alemllm / README.md

Update README.md

94620b6 verified 5 days ago

6.22 kB

	---
	language:
	- kk
	- ru
	- tr
	- en
	library_name: transformers
	extra_gated_prompt: 'Fill in the form below to access the model:'
	extra_gated_fields:
	Company: text
	Country: country
	I want to use this model for: text
	license: cc-by-nc-4.0
	---

	## Description:

	AlemLLM is a large language model customized by Astana Hub to improve the helpfulness of LLM generated responses in the Kazakh language.

	## Evaluation Metrics

	Model evaluations were conducted using established benchmarks, employing a systematic process to test performance across various cognitive and technical tasks.

	### Kazakh Leaderboard

	\| Model \| Average \| MMLU \| Winogrande \| Hellaswag \| ARC \| GSM8k \| DROP \|
	\|----------------------------\|---------\|---------\|---------------\|--------------\|--------\|----------\|---------\|
	\| Yi-Lightning \| 0.812 \| 0.720 \| 0.852 \| 0.820 \| 0.940 \| 0.880 \| 0.660 \|
	\| DeepSeek V3 37A \| 0.715 \| 0.650 \| 0.628 \| 0.640 \| 0.900 \| 0.890 \| 0.580 \|
	\| DeepSeek R1 \| 0.798 \| 0.753 \| 0.764 \| 0.680 \| 0.868 \| 0.937 \| 0.784 \|
	\| Llama-3.1-70b-inst. \| 0.639 \| 0.610 \| 0.585 \| 0.520 \| 0.820 \| 0.780 \| 0.520 \|
	\| KazLLM-1.0-70B \| 0.766 \| 0.660 \| 0.806 \| 0.790 \| 0.920 \| 0.770 \| 0.650 \|
	\| GPT-4o \| 0.776 \| 0.730 \| 0.704 \| 0.830 \| 0.940 \| 0.900 \| 0.550 \|
	\| AlemLLM \| 0.826 \| 0.757 \| 0.837 \| 0.775 \| 0.949 \| 0.917 \| 0.719 \|
	\| QwQ 32В \| 0.628 \| 0.591 \| 0.613 \| 0.499 \| 0.661 \| 0.826 \| 0.576 \|

	### Russian Leaderboard

	\| Model \| Average \| MMLU \| Winogrande \| Hellaswag \| ARC \| GSM8k \| DROP \|
	\|----------------------------\|---------\|---------\|---------------\|--------------\|--------\|----------\|---------\|
	\| Yi-Lightning \| 0.834 \| 0.750 \| 0.854 \| 0.870 \| 0.960 \| 0.890 \| 0.680 \|
	\| DeepSeek V3 37A \| 0.818 \| 0.784 \| 0.756 \| 0.840 \| 0.960 \| 0.910 \| 0.660 \|
	\| DeepSeek R1 \| 0.845 \| 0.838 \| 0.811 \| 0.827 \| 0.972 \| 0.928 \| 0.694 \|
	\| Llama-3.1-70b-inst. \| 0.752 \| 0.660 \| 0.691 \| 0.730 \| 0.920 \| 0.880 \| 0.630 \|
	\| KazLLM-1.0-70B \| 0.748 \| 0.650 \| 0.806 \| 0.860 \| 0.790 \| 0.810 \| 0.570 \|
	\| GPT-4o \| 0.808 \| 0.776 \| 0.771 \| 0.880 \| 0.960 \| 0.890 \| 0.570 \|
	\| AlemLLM \| 0.848 \| 0.801 \| 0.858 \| 0.843 \| 0.959 \| 0.896 \| 0.729 \|
	\| QwQ 32B \| 0.840 \| 0.810 \| 0.807 \| 0.823 \| 0.964 \| 0.926 \| 0.709 \|

	### English Leaderboard

	\| Model \| Average \| MMLU \| Winogrande \| Hellaswag \| ARC \| GSM8k \| DROP \|
	\|----------------------------\|---------\|---------\|---------------\|--------------\|--------\|----------\|---------\|
	\| Yi-Lightning \| 0.909 \| 0.820 \| 0.936 \| 0.930 \| 0.980 \| 0.930 \| 0.860 \|
	\| DeepSeek V3 37A \| 0.880 \| 0.840 \| 0.790 \| 0.900 \| 0.980 \| 0.950 \| 0.820 \|
	\| DeepSeek R1 \| 0.908 \| 0.855 \| 0.857 \| 0.882 \| 0.977 \| 0.960 \| 0.915 \|
	\| Llama-3.1-70b-inst. \| 0.841 \| 0.770 \| 0.718 \| 0.880 \| 0.960 \| 0.900 \| 0.820 \|
	\| KazLLM-1.0-70B \| 0.855 \| 0.820 \| 0.843 \| 0.920 \| 0.970 \| 0.820 \| 0.760 \|
	\| GPT-4o \| 0.862 \| 0.830 \| 0.793 \| 0.940 \| 0.980 \| 0.910 \| 0.720 \|
	\| AlemLLM \| 0.921 \| 0.874 \| 0.928 \| 0.909 \| 0.978 \| 0.926 \| 0.911 \|
	\| QwQ 32В \| 0.914 \| 0.864 \| 0.886 \| 0.897 \| 0.969 \| 0.969 \| 0.896 \|

	## Model specification

	Architecture: Mixture of Experts <br>
	Total Parameters: 247B <br>
	Activated Parameters: 22B <br>
	Tokenizer: SentencePiece <br>
	Quantization: BF16 <br>
	Vocabulary Size: 100352 <br>
	Number of Layers: 56 <br>
	Activation Function: SwiGLU <br>
	Positional Encoding Method: ROPE <br>
	Optimizer: AdamW <br>

	## Run in Docker mode

	- Ubuntu 24.04
	- NVIDIA-SMI 535.247.01
	- Driver Version: 535.247.01
	- CUDA Version: 12.2

	```bash
	docker run -it --runtime nvidia -d \
	--restart=unless-stopped \
	--gpus all \
	-e OMP_NUM_THREADS=1 \
	-e NVIDIA_VISIBLE_DEVICES=all \
	-e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
	-p 8000:8000 \
	-v shm:/dev/shm \
	-v /alemllm/tmp/:/tmp \
	-v /alemllm/tmp/:/root/.cache \
	-v /alemllm/tmp/:/root/.local \
	-v /alemllm/weights:/alemllm/weights/ \
	astanahubcloud/alemllm:latest \
	python3 -m vllm.entrypoints.openai.api_server \
	--model=/alemllm/weights/ \
	--trust-remote-code \
	--tokenizer-mode=slow \
	--disable-log-requests \
	--max-seq-len-to-capture=131072 \
	--gpu-memory-utilization=0.98 \
	--tensor-parallel-size=8 \
	--port=8000 \
	--host=0.0.0.0 \
	--served-model-name astanahub/alemllm
	```

	## Run in Huggingface mode

	- ubuntu22.04
	- cuda 12.1
	- python 3.11
	- pytorch==2.1.0
	- transformers==4.40.1

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "/path/to/alemllm"

	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto",
	rope_scaling=None,
	trust_remote_code=True,
	)

	tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
	prompt = "Give me a short introduction to large language model."

	messages = [
	{"role": "user", "content": prompt}
	]

	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)
	model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

	generated_ids = model.generate(
	**model_inputs,
	max_new_tokens=16384
	)

	generated_ids = [
	output_ids[len(input_ids):] for input_ids, output_ids in
	zip(model_inputs.input_ids, generated_ids)
	]

	response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
	print(response)

	```

	## Run in TuringInfer mode

	- ubuntu22.04
	- cuda 12.4
	- pytorch==2.6.0
	- transformers==4.51.0

	```bash
	python -m turing_serving.launcher \
	--model-path /path/to/alemllm \
	--model-name alemllm \
	--host 0.0.0.0 \
	--port 9528 \
	--solver server_solver \
	--backend vllm \
	--tensor-parallel-size 8 \
	--worker-timeout-seconds 7200 \
	--skip-authorizationcheck \
	--engine-args tokenizer-mode=slow disable-log-requests=__NULL__ trustremote-code=__NULL__ kv-cache-dtype=fp8 quantization=fp8 max-seq-len-tocapture=131072 gpu_memory_utilization=0.98
	```

	## License

	Note that the model is licensed under CC BY-NC 4.0. For commercial usage inquiries, feel free to [contact us](https://astanahub.com/ru/contacts/).