--- language: - kk - ru - tr - en library_name: transformers extra_gated_prompt: 'Fill in the form below to access the model:' extra_gated_fields: Company: text Country: country I want to use this model for: text license: cc-by-nc-4.0 --- ## Description: AlemLLM is a large language model customized by Astana Hub to improve the helpfulness of LLM generated responses in the Kazakh language. ## Evaluation Metrics Model evaluations were conducted using established benchmarks, employing a systematic process to test performance across various cognitive and technical tasks. ### Kazakh Leaderboard | Model | Average | MMLU | Winogrande | Hellaswag | ARC | GSM8k | DROP | |----------------------------|---------|---------|---------------|--------------|--------|----------|---------| | Yi-Lightning | 0.812 | 0.720 | 0.852 | 0.820 | 0.940 | 0.880 | 0.660 | | DeepSeek V3 37A | 0.715 | 0.650 | 0.628 | 0.640 | 0.900 | 0.890 | 0.580 | | DeepSeek R1 | 0.798 | 0.753 | 0.764 | 0.680 | 0.868 | 0.937 | 0.784 | | Llama-3.1-70b-inst. | 0.639 | 0.610 | 0.585 | 0.520 | 0.820 | 0.780 | 0.520 | | KazLLM-1.0-70B | 0.766 | 0.660 | 0.806 | 0.790 | 0.920 | 0.770 | 0.650 | | GPT-4o | 0.776 | 0.730 | 0.704 | 0.830 | 0.940 | 0.900 | 0.550 | | **AlemLLM** | 0.826 | 0.757 | 0.837 | 0.775 | 0.949 | 0.917 | 0.719 | | QwQ 32В | 0.628 | 0.591 | 0.613 | 0.499 | 0.661 | 0.826 | 0.576 | ### Russian Leaderboard | Model | Average | MMLU | Winogrande | Hellaswag | ARC | GSM8k | DROP | |----------------------------|---------|---------|---------------|--------------|--------|----------|---------| | Yi-Lightning | 0.834 | 0.750 | 0.854 | 0.870 | 0.960 | 0.890 | 0.680 | | DeepSeek V3 37A | 0.818 | 0.784 | 0.756 | 0.840 | 0.960 | 0.910 | 0.660 | | DeepSeek R1 | 0.845 | 0.838 | 0.811 | 0.827 | 0.972 | 0.928 | 0.694 | | Llama-3.1-70b-inst. | 0.752 | 0.660 | 0.691 | 0.730 | 0.920 | 0.880 | 0.630 | | KazLLM-1.0-70B | 0.748 | 0.650 | 0.806 | 0.860 | 0.790 | 0.810 | 0.570 | | GPT-4o | 0.808 | 0.776 | 0.771 | 0.880 | 0.960 | 0.890 | 0.570 | | **AlemLLM** | 0.848 | 0.801 | 0.858 | 0.843 | 0.959 | 0.896 | 0.729 | | QwQ 32B | 0.840 | 0.810 | 0.807 | 0.823 | 0.964 | 0.926 | 0.709 | ### English Leaderboard | Model | Average | MMLU | Winogrande | Hellaswag | ARC | GSM8k | DROP | |----------------------------|---------|---------|---------------|--------------|--------|----------|---------| | Yi-Lightning | 0.909 | 0.820 | 0.936 | 0.930 | 0.980 | 0.930 | 0.860 | | DeepSeek V3 37A | 0.880 | 0.840 | 0.790 | 0.900 | 0.980 | 0.950 | 0.820 | | DeepSeek R1 | 0.908 | 0.855 | 0.857 | 0.882 | 0.977 | 0.960 | 0.915 | | Llama-3.1-70b-inst. | 0.841 | 0.770 | 0.718 | 0.880 | 0.960 | 0.900 | 0.820 | | KazLLM-1.0-70B | 0.855 | 0.820 | 0.843 | 0.920 | 0.970 | 0.820 | 0.760 | | GPT-4o | 0.862 | 0.830 | 0.793 | 0.940 | 0.980 | 0.910 | 0.720 | | **AlemLLM** | 0.921 | 0.874 | 0.928 | 0.909 | 0.978 | 0.926 | 0.911 | | QwQ 32В | 0.914 | 0.864 | 0.886 | 0.897 | 0.969 | 0.969 | 0.896 | ## Model specification **Architecture:** Mixture of Experts
**Total Parameters:** 247B
**Activated Parameters:** 22B
**Tokenizer:** SentencePiece
**Quantization:** BF16
**Vocabulary Size:** 100352
**Number of Layers:** 56
**Activation Function:** SwiGLU
**Positional Encoding Method:** ROPE
**Optimizer:** AdamW
## Run in Docker mode - Ubuntu 24.04 - NVIDIA-SMI 535.247.01 - Driver Version: 535.247.01 - CUDA Version: 12.2 ```bash docker run -it --runtime nvidia -d \ --restart=unless-stopped \ --gpus all \ -e OMP_NUM_THREADS=1 \ -e NVIDIA_VISIBLE_DEVICES=all \ -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \ -p 8000:8000 \ -v shm:/dev/shm \ -v /alemllm/tmp/:/tmp \ -v /alemllm/tmp/:/root/.cache \ -v /alemllm/tmp/:/root/.local \ -v /alemllm/weights:/alemllm/weights/ \ astanahubcloud/alemllm:latest \ python3 -m vllm.entrypoints.openai.api_server \ --model=/alemllm/weights/ \ --trust-remote-code \ --tokenizer-mode=slow \ --disable-log-requests \ --max-seq-len-to-capture=131072 \ --gpu-memory-utilization=0.98 \ --tensor-parallel-size=8 \ --port=8000 \ --host=0.0.0.0 \ --served-model-name astanahub/alemllm ``` ## Run in Huggingface mode - ubuntu22.04 - cuda 12.1 - python 3.11 - pytorch==2.1.0 - transformers==4.40.1 ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "/path/to/alemllm" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto", rope_scaling=None, trust_remote_code=True, ) tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False) prompt = "Give me a short introduction to large language model." messages = [ {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) generated_ids = model.generate( **model_inputs, max_new_tokens=16384 ) generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) ] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] print(response) ``` ## Run in TuringInfer mode - ubuntu22.04 - cuda 12.4 - pytorch==2.6.0 - transformers==4.51.0 ```bash python -m turing_serving.launcher \ --model-path /path/to/alemllm \ --model-name alemllm \ --host 0.0.0.0 \ --port 9528 \ --solver server_solver \ --backend vllm \ --tensor-parallel-size 8 \ --worker-timeout-seconds 7200 \ --skip-authorizationcheck \ --engine-args tokenizer-mode=slow disable-log-requests=__NULL__ trustremote-code=__NULL__ kv-cache-dtype=fp8 quantization=fp8 max-seq-len-tocapture=131072 gpu_memory_utilization=0.98 ``` ## License Note that the model is licensed under CC BY-NC 4.0. For commercial usage inquiries, feel free to [contact us](https://astanahub.com/ru/contacts/).