You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Fill in the form below to access the model:

Log in or Sign Up to review the conditions and access this model content.

Description:

AlemLLM is a large language model customized by Astana Hub to improve the helpfulness of LLM generated responses in the Kazakh language.

Evaluation Metrics

Model evaluations were conducted using established benchmarks, employing a systematic process to test performance across various cognitive and technical tasks.

Kazakh Leaderboard

Model Average MMLU Winogrande Hellaswag ARC GSM8k DROP
Yi-Lightning 0.812 0.720 0.852 0.820 0.940 0.880 0.660
DeepSeek V3 37A 0.715 0.650 0.628 0.640 0.900 0.890 0.580
DeepSeek R1 0.798 0.753 0.764 0.680 0.868 0.937 0.784
Llama-3.1-70b-inst. 0.639 0.610 0.585 0.520 0.820 0.780 0.520
KazLLM-1.0-70B 0.766 0.660 0.806 0.790 0.920 0.770 0.650
GPT-4o 0.776 0.730 0.704 0.830 0.940 0.900 0.550
AlemLLM 0.826 0.757 0.837 0.775 0.949 0.917 0.719
QwQ 32В 0.628 0.591 0.613 0.499 0.661 0.826 0.576

Russian Leaderboard

Model Average MMLU Winogrande Hellaswag ARC GSM8k DROP
Yi-Lightning 0.834 0.750 0.854 0.870 0.960 0.890 0.680
DeepSeek V3 37A 0.818 0.784 0.756 0.840 0.960 0.910 0.660
DeepSeek R1 0.845 0.838 0.811 0.827 0.972 0.928 0.694
Llama-3.1-70b-inst. 0.752 0.660 0.691 0.730 0.920 0.880 0.630
KazLLM-1.0-70B 0.748 0.650 0.806 0.860 0.790 0.810 0.570
GPT-4o 0.808 0.776 0.771 0.880 0.960 0.890 0.570
AlemLLM 0.848 0.801 0.858 0.843 0.959 0.896 0.729
QwQ 32B 0.840 0.810 0.807 0.823 0.964 0.926 0.709

English Leaderboard

Model Average MMLU Winogrande Hellaswag ARC GSM8k DROP
Yi-Lightning 0.909 0.820 0.936 0.930 0.980 0.930 0.860
DeepSeek V3 37A 0.880 0.840 0.790 0.900 0.980 0.950 0.820
DeepSeek R1 0.908 0.855 0.857 0.882 0.977 0.960 0.915
Llama-3.1-70b-inst. 0.841 0.770 0.718 0.880 0.960 0.900 0.820
KazLLM-1.0-70B 0.855 0.820 0.843 0.920 0.970 0.820 0.760
GPT-4o 0.862 0.830 0.793 0.940 0.980 0.910 0.720
AlemLLM 0.921 0.874 0.928 0.909 0.978 0.926 0.911
QwQ 32В 0.914 0.864 0.886 0.897 0.969 0.969 0.896

Model specification

Architecture: Mixture of Experts
Total Parameters: 247B
Activated Parameters: 22B
Tokenizer: SentencePiece
Quantization: BF16
Vocabulary Size: 100352
Number of Layers: 56
Activation Function: SwiGLU
Positional Encoding Method: ROPE
Optimizer: AdamW

Run in Docker mode

  • Ubuntu 24.04
  • NVIDIA-SMI 535.247.01
  • Driver Version: 535.247.01
  • CUDA Version: 12.2
docker run -it --runtime nvidia -d \
  --restart=unless-stopped \
  --gpus all \
  -e OMP_NUM_THREADS=1 \
  -e NVIDIA_VISIBLE_DEVICES=all \
  -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
  -p 8000:8000 \
  -v shm:/dev/shm \
  -v /alemllm/tmp/:/tmp \
  -v /alemllm/tmp/:/root/.cache \
  -v /alemllm/tmp/:/root/.local \
  -v /alemllm/weights:/alemllm/weights/ \
  astanahubcloud/alemllm:latest \
  python3 -m vllm.entrypoints.openai.api_server \
  --model=/alemllm/weights/ \
  --trust-remote-code \
  --tokenizer-mode=slow \
  --disable-log-requests \
  --max-seq-len-to-capture=131072 \
  --gpu-memory-utilization=0.98 \
  --tensor-parallel-size=8 \
  --port=8000 \
  --host=0.0.0.0 \
  --served-model-name  astanahub/alemllm

Run in Huggingface mode

  • ubuntu22.04
  • cuda 12.1
  • python 3.11
  • pytorch==2.1.0
  • transformers==4.40.1
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "/path/to/alemllm"

model = AutoModelForCausalLM.from_pretrained(
  model_name,
  torch_dtype="auto",
  device_map="auto",
  rope_scaling=None,
  trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
prompt = "Give me a short introduction to large language model."

messages = [
  {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
  messages,
  tokenize=False,
  add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
  **model_inputs,
  max_new_tokens=16384
)

generated_ids = [
  output_ids[len(input_ids):] for input_ids, output_ids in
  zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Run in TuringInfer mode

  • ubuntu22.04
  • cuda 12.4
  • pytorch==2.6.0
  • transformers==4.51.0
python -m turing_serving.launcher \
  --model-path /path/to/alemllm \
  --model-name alemllm \
  --host 0.0.0.0 \
  --port 9528 \
  --solver server_solver \
  --backend vllm \
  --tensor-parallel-size 8 \
  --worker-timeout-seconds 7200 \
  --skip-authorizationcheck \
  --engine-args tokenizer-mode=slow disable-log-requests=__NULL__ trustremote-code=__NULL__ kv-cache-dtype=fp8 quantization=fp8 max-seq-len-tocapture=131072 gpu_memory_utilization=0.98

License

Note that the model is licensed under CC BY-NC 4.0. For commercial usage inquiries, feel free to contact us.

Downloads last month
1,266
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support