alemllm / README.md
i-makashev's picture
Update README.md
94620b6 verified
|
raw
history blame
6.22 kB
---
language:
- kk
- ru
- tr
- en
library_name: transformers
extra_gated_prompt: 'Fill in the form below to access the model:'
extra_gated_fields:
Company: text
Country: country
I want to use this model for: text
license: cc-by-nc-4.0
---
## Description:
AlemLLM is a large language model customized by Astana Hub to improve the helpfulness of LLM generated responses in the Kazakh language.
## Evaluation Metrics
Model evaluations were conducted using established benchmarks, employing a systematic process to test performance across various cognitive and technical tasks.
### Kazakh Leaderboard
| Model | Average | MMLU | Winogrande | Hellaswag | ARC | GSM8k | DROP |
|----------------------------|---------|---------|---------------|--------------|--------|----------|---------|
| Yi-Lightning | 0.812 | 0.720 | 0.852 | 0.820 | 0.940 | 0.880 | 0.660 |
| DeepSeek V3 37A | 0.715 | 0.650 | 0.628 | 0.640 | 0.900 | 0.890 | 0.580 |
| DeepSeek R1 | 0.798 | 0.753 | 0.764 | 0.680 | 0.868 | 0.937 | 0.784 |
| Llama-3.1-70b-inst. | 0.639 | 0.610 | 0.585 | 0.520 | 0.820 | 0.780 | 0.520 |
| KazLLM-1.0-70B | 0.766 | 0.660 | 0.806 | 0.790 | 0.920 | 0.770 | 0.650 |
| GPT-4o | 0.776 | 0.730 | 0.704 | 0.830 | 0.940 | 0.900 | 0.550 |
| **AlemLLM** | 0.826 | 0.757 | 0.837 | 0.775 | 0.949 | 0.917 | 0.719 |
| QwQ 32В | 0.628 | 0.591 | 0.613 | 0.499 | 0.661 | 0.826 | 0.576 |
### Russian Leaderboard
| Model | Average | MMLU | Winogrande | Hellaswag | ARC | GSM8k | DROP |
|----------------------------|---------|---------|---------------|--------------|--------|----------|---------|
| Yi-Lightning | 0.834 | 0.750 | 0.854 | 0.870 | 0.960 | 0.890 | 0.680 |
| DeepSeek V3 37A | 0.818 | 0.784 | 0.756 | 0.840 | 0.960 | 0.910 | 0.660 |
| DeepSeek R1 | 0.845 | 0.838 | 0.811 | 0.827 | 0.972 | 0.928 | 0.694 |
| Llama-3.1-70b-inst. | 0.752 | 0.660 | 0.691 | 0.730 | 0.920 | 0.880 | 0.630 |
| KazLLM-1.0-70B | 0.748 | 0.650 | 0.806 | 0.860 | 0.790 | 0.810 | 0.570 |
| GPT-4o | 0.808 | 0.776 | 0.771 | 0.880 | 0.960 | 0.890 | 0.570 |
| **AlemLLM** | 0.848 | 0.801 | 0.858 | 0.843 | 0.959 | 0.896 | 0.729 |
| QwQ 32B | 0.840 | 0.810 | 0.807 | 0.823 | 0.964 | 0.926 | 0.709 |
### English Leaderboard
| Model | Average | MMLU | Winogrande | Hellaswag | ARC | GSM8k | DROP |
|----------------------------|---------|---------|---------------|--------------|--------|----------|---------|
| Yi-Lightning | 0.909 | 0.820 | 0.936 | 0.930 | 0.980 | 0.930 | 0.860 |
| DeepSeek V3 37A | 0.880 | 0.840 | 0.790 | 0.900 | 0.980 | 0.950 | 0.820 |
| DeepSeek R1 | 0.908 | 0.855 | 0.857 | 0.882 | 0.977 | 0.960 | 0.915 |
| Llama-3.1-70b-inst. | 0.841 | 0.770 | 0.718 | 0.880 | 0.960 | 0.900 | 0.820 |
| KazLLM-1.0-70B | 0.855 | 0.820 | 0.843 | 0.920 | 0.970 | 0.820 | 0.760 |
| GPT-4o | 0.862 | 0.830 | 0.793 | 0.940 | 0.980 | 0.910 | 0.720 |
| **AlemLLM** | 0.921 | 0.874 | 0.928 | 0.909 | 0.978 | 0.926 | 0.911 |
| QwQ 32В | 0.914 | 0.864 | 0.886 | 0.897 | 0.969 | 0.969 | 0.896 |
## Model specification
**Architecture:** Mixture of Experts <br>
**Total Parameters:** 247B <br>
**Activated Parameters:** 22B <br>
**Tokenizer:** SentencePiece <br>
**Quantization:** BF16 <br>
**Vocabulary Size:** 100352 <br>
**Number of Layers:** 56 <br>
**Activation Function:** SwiGLU <br>
**Positional Encoding Method:** ROPE <br>
**Optimizer:** AdamW <br>
## Run in Docker mode
- Ubuntu 24.04
- NVIDIA-SMI 535.247.01
- Driver Version: 535.247.01
- CUDA Version: 12.2
```bash
docker run -it --runtime nvidia -d \
--restart=unless-stopped \
--gpus all \
-e OMP_NUM_THREADS=1 \
-e NVIDIA_VISIBLE_DEVICES=all \
-e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
-p 8000:8000 \
-v shm:/dev/shm \
-v /alemllm/tmp/:/tmp \
-v /alemllm/tmp/:/root/.cache \
-v /alemllm/tmp/:/root/.local \
-v /alemllm/weights:/alemllm/weights/ \
astanahubcloud/alemllm:latest \
python3 -m vllm.entrypoints.openai.api_server \
--model=/alemllm/weights/ \
--trust-remote-code \
--tokenizer-mode=slow \
--disable-log-requests \
--max-seq-len-to-capture=131072 \
--gpu-memory-utilization=0.98 \
--tensor-parallel-size=8 \
--port=8000 \
--host=0.0.0.0 \
--served-model-name astanahub/alemllm
```
## Run in Huggingface mode
- ubuntu22.04
- cuda 12.1
- python 3.11
- pytorch==2.1.0
- transformers==4.40.1
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "/path/to/alemllm"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
rope_scaling=None,
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=16384
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in
zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
```
## Run in TuringInfer mode
- ubuntu22.04
- cuda 12.4
- pytorch==2.6.0
- transformers==4.51.0
```bash
python -m turing_serving.launcher \
--model-path /path/to/alemllm \
--model-name alemllm \
--host 0.0.0.0 \
--port 9528 \
--solver server_solver \
--backend vllm \
--tensor-parallel-size 8 \
--worker-timeout-seconds 7200 \
--skip-authorizationcheck \
--engine-args tokenizer-mode=slow disable-log-requests=__NULL__ trustremote-code=__NULL__ kv-cache-dtype=fp8 quantization=fp8 max-seq-len-tocapture=131072 gpu_memory_utilization=0.98
```
## License
Note that the model is licensed under CC BY-NC 4.0. For commercial usage inquiries, feel free to [contact us](https://astanahub.com/ru/contacts/).