You need to agree to share your contact information to access this model
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
Fill in the form below to access the model:
Log in or Sign Up to review the conditions and access this model content.
Description:
AlemLLM is a large language model customized by Astana Hub to improve the helpfulness of LLM generated responses in the Kazakh language.
Evaluation Metrics
Model evaluations were conducted using established benchmarks, employing a systematic process to test performance across various cognitive and technical tasks.
Kazakh Leaderboard
Model | Average | MMLU | Winogrande | Hellaswag | ARC | GSM8k | DROP |
---|---|---|---|---|---|---|---|
Yi-Lightning | 0.812 | 0.720 | 0.852 | 0.820 | 0.940 | 0.880 | 0.660 |
DeepSeek V3 37A | 0.715 | 0.650 | 0.628 | 0.640 | 0.900 | 0.890 | 0.580 |
DeepSeek R1 | 0.798 | 0.753 | 0.764 | 0.680 | 0.868 | 0.937 | 0.784 |
Llama-3.1-70b-inst. | 0.639 | 0.610 | 0.585 | 0.520 | 0.820 | 0.780 | 0.520 |
KazLLM-1.0-70B | 0.766 | 0.660 | 0.806 | 0.790 | 0.920 | 0.770 | 0.650 |
GPT-4o | 0.776 | 0.730 | 0.704 | 0.830 | 0.940 | 0.900 | 0.550 |
AlemLLM | 0.826 | 0.757 | 0.837 | 0.775 | 0.949 | 0.917 | 0.719 |
QwQ 32В | 0.628 | 0.591 | 0.613 | 0.499 | 0.661 | 0.826 | 0.576 |
Russian Leaderboard
Model | Average | MMLU | Winogrande | Hellaswag | ARC | GSM8k | DROP |
---|---|---|---|---|---|---|---|
Yi-Lightning | 0.834 | 0.750 | 0.854 | 0.870 | 0.960 | 0.890 | 0.680 |
DeepSeek V3 37A | 0.818 | 0.784 | 0.756 | 0.840 | 0.960 | 0.910 | 0.660 |
DeepSeek R1 | 0.845 | 0.838 | 0.811 | 0.827 | 0.972 | 0.928 | 0.694 |
Llama-3.1-70b-inst. | 0.752 | 0.660 | 0.691 | 0.730 | 0.920 | 0.880 | 0.630 |
KazLLM-1.0-70B | 0.748 | 0.650 | 0.806 | 0.860 | 0.790 | 0.810 | 0.570 |
GPT-4o | 0.808 | 0.776 | 0.771 | 0.880 | 0.960 | 0.890 | 0.570 |
AlemLLM | 0.848 | 0.801 | 0.858 | 0.843 | 0.959 | 0.896 | 0.729 |
QwQ 32B | 0.840 | 0.810 | 0.807 | 0.823 | 0.964 | 0.926 | 0.709 |
English Leaderboard
Model | Average | MMLU | Winogrande | Hellaswag | ARC | GSM8k | DROP |
---|---|---|---|---|---|---|---|
Yi-Lightning | 0.909 | 0.820 | 0.936 | 0.930 | 0.980 | 0.930 | 0.860 |
DeepSeek V3 37A | 0.880 | 0.840 | 0.790 | 0.900 | 0.980 | 0.950 | 0.820 |
DeepSeek R1 | 0.908 | 0.855 | 0.857 | 0.882 | 0.977 | 0.960 | 0.915 |
Llama-3.1-70b-inst. | 0.841 | 0.770 | 0.718 | 0.880 | 0.960 | 0.900 | 0.820 |
KazLLM-1.0-70B | 0.855 | 0.820 | 0.843 | 0.920 | 0.970 | 0.820 | 0.760 |
GPT-4o | 0.862 | 0.830 | 0.793 | 0.940 | 0.980 | 0.910 | 0.720 |
AlemLLM | 0.921 | 0.874 | 0.928 | 0.909 | 0.978 | 0.926 | 0.911 |
QwQ 32В | 0.914 | 0.864 | 0.886 | 0.897 | 0.969 | 0.969 | 0.896 |
Model specification
Architecture: Mixture of Experts
Total Parameters: 247B
Activated Parameters: 22B
Tokenizer: SentencePiece
Quantization: BF16
Vocabulary Size: 100352
Number of Layers: 56
Activation Function: SwiGLU
Positional Encoding Method: ROPE
Optimizer: AdamW
Run in Docker mode
- Ubuntu 24.04
- NVIDIA-SMI 535.247.01
- Driver Version: 535.247.01
- CUDA Version: 12.2
docker run -it --runtime nvidia -d \
--restart=unless-stopped \
--gpus all \
-e OMP_NUM_THREADS=1 \
-e NVIDIA_VISIBLE_DEVICES=all \
-e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
-p 8000:8000 \
-v shm:/dev/shm \
-v /alemllm/tmp/:/tmp \
-v /alemllm/tmp/:/root/.cache \
-v /alemllm/tmp/:/root/.local \
-v /alemllm/weights:/alemllm/weights/ \
astanahubcloud/alemllm:latest \
python3 -m vllm.entrypoints.openai.api_server \
--model=/alemllm/weights/ \
--trust-remote-code \
--tokenizer-mode=slow \
--disable-log-requests \
--max-seq-len-to-capture=131072 \
--gpu-memory-utilization=0.98 \
--tensor-parallel-size=8 \
--port=8000 \
--host=0.0.0.0 \
--served-model-name astanahub/alemllm
Run in Huggingface mode
- ubuntu22.04
- cuda 12.1
- python 3.11
- pytorch==2.1.0
- transformers==4.40.1
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "/path/to/alemllm"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
rope_scaling=None,
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=16384
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in
zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
Run in TuringInfer mode
- ubuntu22.04
- cuda 12.4
- pytorch==2.6.0
- transformers==4.51.0
python -m turing_serving.launcher \
--model-path /path/to/alemllm \
--model-name alemllm \
--host 0.0.0.0 \
--port 9528 \
--solver server_solver \
--backend vllm \
--tensor-parallel-size 8 \
--worker-timeout-seconds 7200 \
--skip-authorizationcheck \
--engine-args tokenizer-mode=slow disable-log-requests=__NULL__ trustremote-code=__NULL__ kv-cache-dtype=fp8 quantization=fp8 max-seq-len-tocapture=131072 gpu_memory_utilization=0.98
License
Note that the model is licensed under CC BY-NC 4.0. For commercial usage inquiries, feel free to contact us.
- Downloads last month
- 1,266