|
--- |
|
language: |
|
- kk |
|
- ru |
|
- tr |
|
- en |
|
library_name: transformers |
|
extra_gated_prompt: 'Fill in the form below to access the model:' |
|
extra_gated_fields: |
|
Company: text |
|
Country: country |
|
I want to use this model for: text |
|
license: cc-by-nc-4.0 |
|
--- |
|
|
|
## Description: |
|
|
|
AlemLLM is a large language model customized by Astana Hub to improve the helpfulness of LLM generated responses in the Kazakh language. |
|
|
|
## Evaluation Metrics |
|
|
|
Model evaluations were conducted using established benchmarks, employing a systematic process to test performance across various cognitive and technical tasks. |
|
|
|
### Kazakh Leaderboard |
|
|
|
| Model | Average | MMLU | Winogrande | Hellaswag | ARC | GSM8k | DROP | |
|
|----------------------------|---------|---------|---------------|--------------|--------|----------|---------| |
|
| Yi-Lightning | 0.812 | 0.720 | 0.852 | 0.820 | 0.940 | 0.880 | 0.660 | |
|
| DeepSeek V3 37A | 0.715 | 0.650 | 0.628 | 0.640 | 0.900 | 0.890 | 0.580 | |
|
| DeepSeek R1 | 0.798 | 0.753 | 0.764 | 0.680 | 0.868 | 0.937 | 0.784 | |
|
| Llama-3.1-70b-inst. | 0.639 | 0.610 | 0.585 | 0.520 | 0.820 | 0.780 | 0.520 | |
|
| KazLLM-1.0-70B | 0.766 | 0.660 | 0.806 | 0.790 | 0.920 | 0.770 | 0.650 | |
|
| GPT-4o | 0.776 | 0.730 | 0.704 | 0.830 | 0.940 | 0.900 | 0.550 | |
|
| **AlemLLM** | 0.826 | 0.757 | 0.837 | 0.775 | 0.949 | 0.917 | 0.719 | |
|
| QwQ 32В | 0.628 | 0.591 | 0.613 | 0.499 | 0.661 | 0.826 | 0.576 | |
|
|
|
### Russian Leaderboard |
|
|
|
| Model | Average | MMLU | Winogrande | Hellaswag | ARC | GSM8k | DROP | |
|
|----------------------------|---------|---------|---------------|--------------|--------|----------|---------| |
|
| Yi-Lightning | 0.834 | 0.750 | 0.854 | 0.870 | 0.960 | 0.890 | 0.680 | |
|
| DeepSeek V3 37A | 0.818 | 0.784 | 0.756 | 0.840 | 0.960 | 0.910 | 0.660 | |
|
| DeepSeek R1 | 0.845 | 0.838 | 0.811 | 0.827 | 0.972 | 0.928 | 0.694 | |
|
| Llama-3.1-70b-inst. | 0.752 | 0.660 | 0.691 | 0.730 | 0.920 | 0.880 | 0.630 | |
|
| KazLLM-1.0-70B | 0.748 | 0.650 | 0.806 | 0.860 | 0.790 | 0.810 | 0.570 | |
|
| GPT-4o | 0.808 | 0.776 | 0.771 | 0.880 | 0.960 | 0.890 | 0.570 | |
|
| **AlemLLM** | 0.848 | 0.801 | 0.858 | 0.843 | 0.959 | 0.896 | 0.729 | |
|
| QwQ 32B | 0.840 | 0.810 | 0.807 | 0.823 | 0.964 | 0.926 | 0.709 | |
|
|
|
### English Leaderboard |
|
|
|
| Model | Average | MMLU | Winogrande | Hellaswag | ARC | GSM8k | DROP | |
|
|----------------------------|---------|---------|---------------|--------------|--------|----------|---------| |
|
| Yi-Lightning | 0.909 | 0.820 | 0.936 | 0.930 | 0.980 | 0.930 | 0.860 | |
|
| DeepSeek V3 37A | 0.880 | 0.840 | 0.790 | 0.900 | 0.980 | 0.950 | 0.820 | |
|
| DeepSeek R1 | 0.908 | 0.855 | 0.857 | 0.882 | 0.977 | 0.960 | 0.915 | |
|
| Llama-3.1-70b-inst. | 0.841 | 0.770 | 0.718 | 0.880 | 0.960 | 0.900 | 0.820 | |
|
| KazLLM-1.0-70B | 0.855 | 0.820 | 0.843 | 0.920 | 0.970 | 0.820 | 0.760 | |
|
| GPT-4o | 0.862 | 0.830 | 0.793 | 0.940 | 0.980 | 0.910 | 0.720 | |
|
| **AlemLLM** | 0.921 | 0.874 | 0.928 | 0.909 | 0.978 | 0.926 | 0.911 | |
|
| QwQ 32В | 0.914 | 0.864 | 0.886 | 0.897 | 0.969 | 0.969 | 0.896 | |
|
|
|
## Model specification |
|
|
|
**Architecture:** Mixture of Experts <br> |
|
**Total Parameters:** 247B <br> |
|
**Activated Parameters:** 22B <br> |
|
**Tokenizer:** SentencePiece <br> |
|
**Quantization:** BF16 <br> |
|
**Vocabulary Size:** 100352 <br> |
|
**Number of Layers:** 56 <br> |
|
**Activation Function:** SwiGLU <br> |
|
**Positional Encoding Method:** ROPE <br> |
|
**Optimizer:** AdamW <br> |
|
|
|
## Run in Docker mode |
|
|
|
- Ubuntu 24.04 |
|
- NVIDIA-SMI 535.247.01 |
|
- Driver Version: 535.247.01 |
|
- CUDA Version: 12.2 |
|
|
|
```bash |
|
docker run -it --runtime nvidia -d \ |
|
--restart=unless-stopped \ |
|
--gpus all \ |
|
-e OMP_NUM_THREADS=1 \ |
|
-e NVIDIA_VISIBLE_DEVICES=all \ |
|
-e NVIDIA_DRIVER_CAPABILITIES=compute,utility \ |
|
-p 8000:8000 \ |
|
-v shm:/dev/shm \ |
|
-v /alemllm/tmp/:/tmp \ |
|
-v /alemllm/tmp/:/root/.cache \ |
|
-v /alemllm/tmp/:/root/.local \ |
|
-v /alemllm/weights:/alemllm/weights/ \ |
|
astanahubcloud/alemllm:latest \ |
|
python3 -m vllm.entrypoints.openai.api_server \ |
|
--model=/alemllm/weights/ \ |
|
--trust-remote-code \ |
|
--tokenizer-mode=slow \ |
|
--disable-log-requests \ |
|
--max-seq-len-to-capture=131072 \ |
|
--gpu-memory-utilization=0.98 \ |
|
--tensor-parallel-size=8 \ |
|
--port=8000 \ |
|
--host=0.0.0.0 \ |
|
--served-model-name astanahub/alemllm |
|
``` |
|
|
|
## Run in Huggingface mode |
|
|
|
- ubuntu22.04 |
|
- cuda 12.1 |
|
- python 3.11 |
|
- pytorch==2.1.0 |
|
- transformers==4.40.1 |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
model_name = "/path/to/alemllm" |
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_name, |
|
torch_dtype="auto", |
|
device_map="auto", |
|
rope_scaling=None, |
|
trust_remote_code=True, |
|
) |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False) |
|
prompt = "Give me a short introduction to large language model." |
|
|
|
messages = [ |
|
{"role": "user", "content": prompt} |
|
] |
|
|
|
text = tokenizer.apply_chat_template( |
|
messages, |
|
tokenize=False, |
|
add_generation_prompt=True |
|
) |
|
model_inputs = tokenizer([text], return_tensors="pt").to(model.device) |
|
|
|
generated_ids = model.generate( |
|
**model_inputs, |
|
max_new_tokens=16384 |
|
) |
|
|
|
generated_ids = [ |
|
output_ids[len(input_ids):] for input_ids, output_ids in |
|
zip(model_inputs.input_ids, generated_ids) |
|
] |
|
|
|
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
print(response) |
|
|
|
``` |
|
|
|
## Run in TuringInfer mode |
|
|
|
- ubuntu22.04 |
|
- cuda 12.4 |
|
- pytorch==2.6.0 |
|
- transformers==4.51.0 |
|
|
|
```bash |
|
python -m turing_serving.launcher \ |
|
--model-path /path/to/alemllm \ |
|
--model-name alemllm \ |
|
--host 0.0.0.0 \ |
|
--port 9528 \ |
|
--solver server_solver \ |
|
--backend vllm \ |
|
--tensor-parallel-size 8 \ |
|
--worker-timeout-seconds 7200 \ |
|
--skip-authorizationcheck \ |
|
--engine-args tokenizer-mode=slow disable-log-requests=__NULL__ trustremote-code=__NULL__ kv-cache-dtype=fp8 quantization=fp8 max-seq-len-tocapture=131072 gpu_memory_utilization=0.98 |
|
``` |
|
|
|
## License |
|
|
|
Note that the model is licensed under CC BY-NC 4.0. For commercial usage inquiries, feel free to [contact us](https://astanahub.com/ru/contacts/). |