---
language:
- kk
- ru
- tr
- en
library_name: transformers
extra_gated_prompt: 'Fill in the form below to access the model:'
extra_gated_fields:
  Company: text
  Country: country
  I want to use this model for: text
license: cc-by-nc-4.0
---

## Description:

AlemLLM is a large language model customized by Astana Hub to improve the helpfulness of LLM generated responses in the Kazakh language. 

## Evaluation Metrics

Model evaluations were conducted using established benchmarks, employing a systematic process to test performance across various cognitive and technical tasks.

### Kazakh Leaderboard

| Model                      | Average | MMLU    | Winogrande    | Hellaswag    | ARC    | GSM8k    | DROP    |
|----------------------------|---------|---------|---------------|--------------|--------|----------|---------|
| Yi-Lightning |  0.812 |  0.720 |  0.852  | 0.820 |  0.940 |  0.880 |  0.660 |
| DeepSeek V3 37A |  0.715 |  0.650 |  0.628 |  0.640 |  0.900 |  0.890  | 0.580 |
| DeepSeek R1  | 0.798 |  0.753 |  0.764 |  0.680 |  0.868 |  0.937 |  0.784 |
| Llama-3.1-70b-inst.  | 0.639  | 0.610 |  0.585 |  0.520 |  0.820  | 0.780 |  0.520 |
| KazLLM-1.0-70B |  0.766 |  0.660 |  0.806 |  0.790  | 0.920 |  0.770 |  0.650 |
| GPT-4o |  0.776 |  0.730  | 0.704 |  0.830  | 0.940 |  0.900 |  0.550 |
| **AlemLLM** |  0.826 |  0.757 |  0.837 |  0.775 |  0.949 |  0.917 |  0.719 |
| QwQ 32В |  0.628 |  0.591 |  0.613  | 0.499 |  0.661 |  0.826 |  0.576 |

### Russian Leaderboard

| Model                      | Average | MMLU    | Winogrande    | Hellaswag    | ARC    | GSM8k    | DROP    |
|----------------------------|---------|---------|---------------|--------------|--------|----------|---------|
| Yi-Lightning |  0.834 |  0.750 |  0.854 |  0.870 |  0.960 |  0.890  | 0.680 |
| DeepSeek V3 37A |  0.818 |  0.784 |  0.756  | 0.840 |  0.960 |  0.910  | 0.660 |
| DeepSeek R1  | 0.845 |  0.838 |  0.811 |  0.827 |  0.972 |  0.928 |  0.694 |
| Llama-3.1-70b-inst. |  0.752 |  0.660 |  0.691 |  0.730 |  0.920 |  0.880 |  0.630 |
| KazLLM-1.0-70B |  0.748  | 0.650  | 0.806 |  0.860  | 0.790 |  0.810 |  0.570 |
| GPT-4o  | 0.808 |  0.776 |  0.771 |  0.880 |  0.960 |  0.890 |  0.570 |
| **AlemLLM**  | 0.848  | 0.801 |  0.858  | 0.843 |  0.959 |  0.896 |  0.729 |
| QwQ 32B  | 0.840  | 0.810 |  0.807  | 0.823 |  0.964  | 0.926 |  0.709 |

### English Leaderboard

| Model                      | Average | MMLU    | Winogrande    | Hellaswag    | ARC    | GSM8k    | DROP    |
|----------------------------|---------|---------|---------------|--------------|--------|----------|---------|
| Yi-Lightning |  0.909  | 0.820  | 0.936 |  0.930  | 0.980 |  0.930 |  0.860 |
| DeepSeek V3 37A  | 0.880  | 0.840  | 0.790  | 0.900 |  0.980  | 0.950 |  0.820 |
| DeepSeek R1  | 0.908  | 0.855  | 0.857  | 0.882 |  0.977 |  0.960 |  0.915 |
| Llama-3.1-70b-inst.  | 0.841  | 0.770  | 0.718  | 0.880  | 0.960 |  0.900 |  0.820 |
| KazLLM-1.0-70B  | 0.855  | 0.820  | 0.843  | 0.920  | 0.970 |  0.820 |  0.760 |
| GPT-4o  | 0.862  | 0.830  | 0.793 |  0.940  | 0.980  | 0.910 |  0.720 |
| **AlemLLM**  | 0.921  | 0.874  | 0.928  | 0.909 |  0.978 |  0.926 |  0.911 |
| QwQ 32В  | 0.914  | 0.864  | 0.886 |  0.897 |  0.969 |  0.969 |  0.896 |

## Model specification

**Architecture:** Mixture of Experts <br>
**Total Parameters:** 247B <br>
**Activated Parameters:** 22B <br>
**Tokenizer:** SentencePiece <br>
**Quantization:** BF16 <br>
**Vocabulary Size:** 100352 <br>
**Number of Layers:** 56 <br>
**Activation Function:** SwiGLU <br>
**Positional Encoding Method:** ROPE <br>
**Optimizer:** AdamW <br>

## Run in Docker mode

- Ubuntu 24.04
- NVIDIA-SMI 535.247.01
- Driver Version: 535.247.01
- CUDA Version: 12.2

```bash
docker run -it --runtime nvidia -d \
  --restart=unless-stopped \
  --gpus all \
  -e OMP_NUM_THREADS=1 \
  -e NVIDIA_VISIBLE_DEVICES=all \
  -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
  -p 8000:8000 \
  -v shm:/dev/shm \
  -v /alemllm/tmp/:/tmp \
  -v /alemllm/tmp/:/root/.cache \
  -v /alemllm/tmp/:/root/.local \
  -v /alemllm/weights:/alemllm/weights/ \
  astanahubcloud/alemllm:latest \
  python3 -m vllm.entrypoints.openai.api_server \
  --model=/alemllm/weights/ \
  --trust-remote-code \
  --tokenizer-mode=slow \
  --disable-log-requests \
  --max-seq-len-to-capture=131072 \
  --gpu-memory-utilization=0.98 \
  --tensor-parallel-size=8 \
  --port=8000 \
  --host=0.0.0.0 \
  --served-model-name  astanahub/alemllm
```

## Run in Huggingface mode

- ubuntu22.04
- cuda 12.1
- python 3.11
- pytorch==2.1.0
- transformers==4.40.1

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "/path/to/alemllm"

model = AutoModelForCausalLM.from_pretrained(
  model_name,
  torch_dtype="auto",
  device_map="auto",
  rope_scaling=None,
  trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
prompt = "Give me a short introduction to large language model."

messages = [
  {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
  messages,
  tokenize=False,
  add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
  **model_inputs,
  max_new_tokens=16384
)

generated_ids = [
  output_ids[len(input_ids):] for input_ids, output_ids in
  zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

```

## Run in TuringInfer mode

- ubuntu22.04
- cuda 12.4
- pytorch==2.6.0
- transformers==4.51.0

```bash
python -m turing_serving.launcher \
  --model-path /path/to/alemllm \
  --model-name alemllm \
  --host 0.0.0.0 \
  --port 9528 \
  --solver server_solver \
  --backend vllm \
  --tensor-parallel-size 8 \
  --worker-timeout-seconds 7200 \
  --skip-authorizationcheck \
  --engine-args tokenizer-mode=slow disable-log-requests=__NULL__ trustremote-code=__NULL__ kv-cache-dtype=fp8 quantization=fp8 max-seq-len-tocapture=131072 gpu_memory_utilization=0.98
```

## License

Note that the model is licensed under CC BY-NC 4.0. For commercial usage inquiries, feel free to [contact us](https://astanahub.com/ru/contacts/).