Update README.md

4ea6a84 verified 13 days ago

20.8 kB

	---
	base_model: inceptionai/Llama-3.1-Sherkala-8B-Chat
	language:
	- kk
	- en
	thumbnail: null
	tags:
	- Kazakh
	- English
	- LLM
	- Decoder
	- causal-lm
	- instruction-tuned
	license: cc-by-nc-sa-4.0
	pipeline_tag: text-generation
	---

	# Llama-3.1-Sherkala-8B-Chat

	Llama-3.1-Sherkala-8B-Chat (Sherkala for short) is a state-of-the-art 8 billion parameter instruction-tuned large language model (LLM) designed primarily for Kazakh while maintaining robust performance in English, Russian, and Turkish. Developed by the Institute of Foundation Models (IFM) at MBZUAI in collaboration with Inception (a G42 company) and Cerebras Systems, Sherkala leverages a balanced mixture of multilingual data and a custom tokenizer to overcome the challenges of data scarcity in Kazakh. This model has been optimized for downstream tasks, safe text generation, and cultural alignment.


	## Sherkala Details

	- Developed by: MBZUAI, Inception (a G42 company), Cerebras Systems.
	- Languages: Kazakh (primary), English, Russian, Turkish.
	- Input: Text.
	- Output: Generated text.
	- Model Size: 8B parameters.
	- Context Length: 8,192 tokens.
	<!-- - Demo: [Insert demo link if available] -->
	- License: cc-by-nc-sa-4.0

	## How to Get Started with the Model:

	Below is sample code to use the model. The code below is tested on transformers==4.46.2.

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_path="inceptionai/KazakhGPT"
	tokenizer = AutoTokenizer.from_pretrained(path, token=TOKEN)
	model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map="auto")
	device = "cuda" if torch.cuda.is_available() else "cpu"

	tokenizer.chat_template="{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<\|start_header_id\|>' + message['role']+'<\|end_header_id\|>\n\n'+ message['content'] \| trim + '<\|eot_id\|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %} {% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<\|start_header_id\|>assistant<\|end_header_id\|>\n\n' }}{% endif %}"


	def get_response(text):
	conversation = [
	{"role": "user", "content": text}
	]

	input_ids = tokenizer.apply_chat_template(
	conversation=conversation,
	tokenize=True,
	add_generation_prompt=True,
	return_tensors="pt").to(device)

	# Generate a response
	gen_tokens = model.generate(
	input_ids,
	max_new_tokens=500,
	stop_strings=["<\|eot_id\|>"],
	tokenizer=tokenizer
	)

	# Decode and print the generated text along with generation prompt
	gen_text = tokenizer.decode(gen_tokens[0][len(input_ids[0]): -1])
	return gen_text

	question = 'Қазақстанның жақсы тағамдарын ұсына аласыз ба?'
	print(get_response(question))
	```

	Response:

	> 1. Бешбармақ: Бұл Қазақстанның ең танымал тағамы болуы мүмкін. Бұл ет (әдетте қой, сиыр немесе жылқы) сүйекке дейін пісірілетін және ұсақ сүйектері бар табаққа ұсынылатын дәстүрлі тағам. Етті қолмен жейді, көбінесе сорпамен бірге.
	> 2. Казыбек: Бұл ашытылған сүттен жасалған дәстүрлі қазақ ірімшігі. Ол жұмсақ және сәл қышқыл дәмімен танымал. Оны өздігінен жеуге немесе нанға немесе сорпаға жаюға болады.
	> 3. Бауырсақ: Бұл ет пен пиязбен толтырылған дәмді, дөңгелек нан. Бұл көптеген қазақ тағамдарының негізгі тағамы.
	> 4. Қуырылған тауық: Бұл қуырылған тауық еті, бірақ оны жасау тәсілі ерекше. Тауық етін дәмдеуіштер мен йогурт қоспасында маринадтайды, содан кейін қытырлақ болғанша қуырады.
	> 5. Шашлык: Бұл кәуапқа ұқсайды және бүкіл әлемде танымал. Шашлык әдетте тауық еті, қой еті немесе сиыр еті сияқты ет кесектерінен дайындалады және ашық отта грильде пісіріледі.
	> 6. Борщ: Бұл қырыққабат, сәбіз, картоп және ет қосылған қызылша сорпасы. Бұл суық айларда жиі ұсынылатын дәмді, жылытатын сорпа.
	> 7. Кәстрөл: Бұл ет (әдетте қой немесе сиыр еті), картоп, пияз және басқа да көкөністерден жасалған бұқтырылған тағам. Ол әдетте нанмен бірге беріледі.
	> 8. Жал-жая: Бұл қой етінен, картоптан, пияздан және дәмдеуіштерден жасалған бұқтырылған тағам. Ол әдетте буға пісірілген күрішпен бірге беріледі.
	> 9. Қуырылған кеспе: Бұл кеспе, ет және көкөністерден жасалған сорпа. Ол әдетте нанмен бірге беріледі.
	> 10. Бал шырыны: Бұл бал мен сүттен жасалған тәтті сусын. Бұл сергітетін және пайдалы

	## Model Architecture

	Sherkala builds upon the Llama-3.1-8B architecture—a causal, decoder-only transformer model that employs RoPE positional encoding and grouped-query attention. To better capture the rich morphological features of Kazakh, we extend the base vocabulary by 25% with high-frequency Kazakh tokens. This expansion reduces tokenization fertility (i.e., the average number of subwords per word) and improves both training and inference efficiency.


	## Pre-training Data

	Sherkala is continually pre-trained on 45.3 billion tokens from a diverse range of sources:

	- Kazakh: 19.45B tokens
	- English: 19.45B tokens
	- Russian & Turkish: 6.4B tokens

	Data sources include Wikipedia, cleaned CommonCrawl archives, news articles, educational texts, and high-quality synthetic translations. A mixing ratio of 3:1:3 (Kazakh : Russian+Turkish : English) ensures a strong Kazakh foundation while preserving competitive English performance.

	## Instruction Tuning

	To enable robust instruction following and safe dialog generation, Sherkala is fine-tuned on a large-scale multilingual instruction dataset comprising:
	- ~5.9M prompt–response pairs in Kazakh
	- ~2.7M prompt–response pairs in English
	- 263K prompt–response pairs in Russian

	A dedicated safety dataset—created using a mix of direct and adversarial prompts—is incorporated to mitigate harmful or biased outputs and to ensure cultural alignment. More information can be found in the [Sherkala paper](TODO).

	## Evaluation

	Sherkala has been extensively evaluated across downstream tasks, open-ended generation, and safety metrics. The following sections detail the evaluation results.

	### Downstream Evaluation

	#### Evaluation Datasets

	Sherkala is benchmarked on multiple tasks in Kazakh, Russian, and English, including:
	- Knowledge: KazMMLU, MMLU, Belebele, etc.
	- Commonsense Reasoning: HellaSwag (HS), PIQA, BoolQA, SIQA, ARC-Challenge (ARC), OpenBookQA (OBQA), NIS MATH, COPA.
	- Misinformation & Bias: TruthfulQA (T-QA) and CrowS-Pairs.

	#### Kazakh Benchmark Results

	<div class="table-container">

	\| Model \| AVG \| KazMMLU \| MMLU \| Belebele \| HS \| PIQA \| BoolQA \| SIQA \| ARC \| OBQA \| NIS \| COPA \| T-QA \| CS-Pairs \|
	\|----------------------------------\|-------\|---------\|-------\|----------\|------\|-------\|--------\|-------\|-------\|-------\|------\|-------\|------\|----------\|
	\| BLOOM (7.1B) \| 37.6 \| 29.3 \| 27.9 \| 29.9 \| 52.0 \| 62.1 \| 36.7 \| 23.6 \| 33.6 \| 26.4 \| 22.0 \| 47.2 \| 49.2 \| 49.1 \|
	\| BLOOMZ (7.1B) \| 36.9 \| 29.2 \| 27.8 \| 30.4 \| 50.8 \| 54.4 \| 36.8 \| 24.4 \| 31.0 \| 22.1 \| 23.0 \| 51.8 \| 48.1 \| 50.1 \|
	\| Gemma-2 (9B) \| 35.7 \| 26.1 \| 27.5 \| 28.3 \| 51.9 \| 62.0 \| 33.5 \| 23.6 \| 28.4 \| 26.0 \| 17.0 \| 45.2 \| 47.1 \| 47.5 \|
	\| Gemma-2-it (9B) \| 36.9 \| 31.4 \| 28.4 \| 27.9 \| 51.0 \| 63.5 \| 36.0 \| 24.0 \| 30.6 \| 23.8 \| 22.0 \| 48.8 \| 49.3 \| 42.6 \|
	\| Qwen-2.5 (7B) \| 38.5 \| 35.1 \| 31.3 \| 31.2 \| 53.4 \| 54.8 \| 38.0 \| 27.1 \| 30.2 \| 26.3 \| 36.0 \| 46.0 \| 48.0 \| 42.6 \|
	\| Qwen-2.5-Instruct (7B) \| 40.8 \| 37.8 \| 33.2 \| 31.5 \| 52.3 \| 60.9 \| 38.1 \| 27.8 \| 31.6 \| 31.1 \| 38.0 \| 47.2 \| 51.0 \| 49.3 \|
	\| LLama3.1 (8B) \| 39.8 \| 38.3 \| 31.3 \| 37.8 \| 57.2 \| 63.7 \| 38.1 \| 29.6 \| 32.8 \| 25.9 \| 20.0 \| 47.8 \| 51.3 \| 43.9 \|
	\| LLama3.1-Instruct (8B) \| 40.4 \| 38.9 \| 32.4 \| 37.5 \| 57.5 \| 67.5 \| 37.9 \| 30.3 \| 32.6 \| 27.0 \| 22.0 \| 48.2 \| 49.7 \| 43.2 \|
	\| LLama3.1-KazLLM-1.0 (8B) \| 43.7 \| 37.0 \| 31.5 \| 46.0 \| 62.8 \| 69.8 \| 44.7 \| 35.5 \| 34.2 \| 27.8 \| 32.0 \| 50.4 \| 50.9 \| 45.0 \|
	\| Sherkala (Ours) \| 45.7 \| 51.6 \| 37.7 \| 53.1 \| 68.1 \| 66.9 \| 42.2 \| 38.1 \| 37.0 \| 25.9 \| 18.0 \| 51.0 \| 50.3 \| 54.3 \|
	\| Sherkala-chat (Ours-chat) \| 46.9 \| 38.8 \| 33.9 \| 54.5 \| 65.3 \| 75.7 \| 48.0 \| 43.6 \| 35.6 \| 29.0 \| 27.0 \| 53.0 \| 55.7 \| 50.2 \|

	</div>


	#### English Benchmark Results

	<div class="table-container">

	\| Model \| AVG \| MMLU \| RACE \| HS \| PIQA \| BoolQA \| SIQA \| ARC \| OBQA \| Winogrande \| TruthfulQA \| CrowS-Pairs \|
	\|----------------------------------\|-------\|-------\|-------\|-------\|-------\|--------\|-------\|-------\|-------\|------------\|------------\|-------------\|
	\| BLOOM (7.1B) \| 48.5 \| 29.1 \| 36.5 \| 59.6 \| 73.6 \| 62.2 \| 46.5 \| 33.4 \| 35.8 \| 38.9 \| 68.9 \| 72.6 \|
	\| BLOOMZ (7.1B) \| 57.0 \| 36.7 \| 45.6 \| 63.1 \| 77.4 \| 90.7 \| 59.7 \| 43.6 \| 42.0 \| 45.2 \| 65.6 \| — \|
	\| Gemma-2 (9B) \| 39.4 \| 27.4 \| 27.8 \| 33.2 \| 59.1 \| 62.2 \| 37.6 \| 24.2 \| 26.4 \| 46.4 \| 49.3 \| — \|
	\| Gemma-2-it (9B) \| 53.2 \| 37.7 \| 46.7 \| 65.4 \| 69.5 \| 80.1 \| 44.1 \| 40.7 \| 29.6 \| 62.1 \| 56.5 \| — \|
	\| Qwen-2.5 (7B) \| 60.8 \| 44.0 \| 41.4 \| 78.9 \| 79.9 \| 84.5 \| 51.9 \| 51.4 \| 47.2 \| 56.4 \| 71.9 \| — \|
	\| Qwen-2.5-Instruct (7B) \| 62.1 \| 46.7 \| 46.3 \| 80.5 \| 80.3 \| 86.4 \| 48.7 \| 54.9 \| 48.8 \| 64.8 \| 63.2 \| — \|
	\| LLama3.1 (8B) \| 56.6 \| 39.6 \| 38.9 \| 79.0 \| 81.3 \| 65.3 \| 52.6 \| 53.5 \| 45.0 \| 45.2 \| 65.5 \| — \|
	\| LLama3.1-Instruct (8B) \| 60.1 \| 41.7 \| 44.9 \| 79.2 \| 81.0 \| 79.4 \| 52.7 \| 55.0 \| 43.6 \| 54.0 \| 69.0 \| — \|
	\| LLama3.1-KazLLM-1.0 (8B) \| 58.6 \| 39.7 \| 44.3 \| 77.9 \| 80.8 \| 72.8 \| 51.5 \| 54.6 \| 43.0 \| 51.0 \| 70.0 \| — \|
	\| Sherkala (Ours) \| 58.7 \| 46.8 \| 39.2 \| 78.3 \| 80.5 \| 77.2 \| 51.3 \| 52.1 \| 46.0 \| 49.6 \| 65.9 \| — \|
	\| Sherkala-chat (Ours-chat) \| 58.6 \| 39.0 \| 41.5 \| 76.2 \| 79.0 \| 82.7 \| 56.8 \| 51.1 \| 41.6 \| 56.4 \| 62.0 \| — \|

	</div>

	Evaluation results on Kazakh and English language benchmarks. Average represents the mean score across tasks. Higher scores are better across all metrics. “HS”, “ARC”, “OBQA”, “T-QA” denote HellaSwag, ARC-Challenge (Easy), OpenBookQA, and TruthfulQA. Further details on the evaluation, including additional results in Russian, can be found in the [Sherkala paper](TODO).


	### Generation Evaluation

	We further evaluated open-ended text generation using GPT-4 as a judge. The following table shows average generation scores (with standard deviations) for models on the MT and Vicuna benchmarks across Kazakh, Russian, and English:

	<div class="table-container">

	\| Model \| Kazakh MT (avg ± sd) \| Kazakh Vicuna (avg ± sd) \| Russian MT (avg ± sd) \| Russian Vicuna (avg ± sd) \| English MT (avg ± sd) \| English Vicuna (avg ± sd) \|
	\|--------------------------\|----------------------\|--------------------------\|-----------------------\|---------------------------\|-----------------------\|---------------------------\|
	\| GPT-4o \| 8.81 ± 1.51 \| 9.32 ± 0.61 \| 8.89 ± 1.59 \| 9.79 ± 0.41 \| 8.36 ± 1.35 \| 9.03 ± 0.59 \|
	\| Qwen-2.5-7B-Instruct \| 3.52 ± 3.52 \| 3.23 ± 1.73 \| 5.81 ± 2.36 \| 6.05 ± 3.07 \| 7.40 ± 1.85 \| 8.06 ± 1.22 \|
	\| Llama-3.1-8B-Instruct \| 3.76 ± 2.11 \| 3.75 ± 1.91 \| 0.85 ± 1.20 \| 0.82 ± 1.55 \| 6.55 ± 2.03 \| 7.41 ± 1.28 \|
	\| KazLLM-1.0-8B \| 3.98 ± 2.15 \| 4.88 ± 2.01 \| 0.72 ± 1.06 \| 0.28 ± 0.71 \| 6.00 ± 2.15 \| 6.66 ± 1.24 \|
	\| Sherkala-chat \| 5.99 ± 2.73 \| 7.39 ± 1.89 \| 1.02 ± 1.41 \| 0.97 ± 1.70 \| 5.78 ± 2.43 \| 6.55 ± 1.59 \|

	</div>

	<!-- ### GPT-4 Evaluation -->
	<!-- Placeholder -->
	<!--
	In addition to the LM-Harness evaluation, we conducted an open-ended generation evaluation using GPT-4-as-a-judge. We measured pairwise win-rates of model responses in both Arabic and English on a fixed set of 80 prompts from the Vicuna test set. English prompts were translated into Arabic by our in-house linguists. In the figures below, we compare the current Sherkala release against previous models:

	<p align="center">
	<img src="https://huggingface.co/inceptionai/JaisFamilySupplmentary/resolve/main/jais.png" alt="Jais-adapted GPT-4">
	</p>
	<p align="center">
	<em>GPT-4-as-a-judge evaluation of Jais in Arabic and English. The Sherkala family models are significantly better than previous versions in both languages.</em>
	</p>

	<p align="center">
	<img src="https://huggingface.co/inceptionai/JaisFamilySupplmentary/resolve/main/jais-adapted.png" alt="Jais-adapted GPT-4">
	</p>
	<p align="center">
	<em>GPT-4-as-a-judge evaluation of adapted Jais in Arabic and English. Generation quality in Arabic is significantly enhanced while also showing improvement in English compared to Llama-2 instruct.</em>
	</p>

	Additionally, we performed MT-Bench style single-answer grading on a scale of 1 to 10. (See supplementary documentation for detailed visual results.)
	-->


	## Intended Use

	Sherkala is intended for both research and commercial applications in Kazakh NLP, including:
	- Chat Assistants: Conversational agents tailored for Kazakh speakers.
	- Question Answering & Content Generation: Systems that deliver culturally aligned, factual, and contextually rich responses.
	- Multilingual NLP: Applications that support English, Russian, and Turkish alongside Kazakh.

	We believe that a number of audiences will benefit from our model:

	* Academics: Those researching Kazakh natural language processing.
	* Businesses: Companies targeting Kazakh-speaking audiences.
	* Developers: Those integrating Kazakh language capabilities in apps.

	### Out-of-Scope Use

	While Sherkala is a powerful language model catering to Kazakh and English it is essential to understand its limitations and the potential for its misuse.

	Sherkala is not recommended for:
	* Malicious Use: The model should not be used for generating harmful, misleading, or inappropriate content. This includes but is not limited to
	* Generating or promoting hate speech, violence, or discrimination,
	* Spreading misinformation or fake news,
	* Engaging in illegal activities or promoting them,
	* Handling sensitive information: the model should not be used to handle or to generate personal, confidential, or sensitive information.
	* Generalization Across All Languages: Sherkala is optimized only for Kazakh and English. It should not be assumed to have equal proficiency in other languages or dialects.
	* High-Stakes Decisions: The model should not be used for making high-stakes decisions without human oversight. This includes medical, legal, financial, or safety-critical decisions, among others.


	## Bias, Risks, and Limitations

	Although extensive measures have been taken to mitigate biases and ensure safe outputs, Sherkala—like all large language models—may still produce inaccurate, misleading, or biased content. Users should apply additional safety measures and conduct thorough evaluations when deploying the model in sensitive or high-stakes environments.



	### Recommendations

	<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

	Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.


	## Training Details

	### Training Data

	<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

	Sherkala was trained on 45.3 billion tokens consisting of:

	* 19.45B Kazakh tokens
	* 19.45B English tokens
	* 6.4B Russian and Turkish tokens

	The dataset includes:

	* Wikipedia, CommonCrawl, news articles, Hugging Face datasets
	* Publicly available documents, educational content, and code
	* Machine-translated Kazakh data from books and Wikipedia
	* Data extracted via Automatic Speech Recognition (ASR) and Optical Character Recognition (OCR)

	### Training Procedure

	<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

	Sherkala is adapted from Llama-3.1-8B, trained in a continual pretraining setup.

	#### Preprocessing [optional]

	* Language-specific standardization, filtering, cleaning, and deduplication.
	* Applied fuzzy deduplication using locality-sensitive hashing (LSH), reducing dataset size to 41.1% of its raw volume.
	* Kazakh tokenizer extended by 25%, reducing tokenization inefficiencies and improving Kazakh language processing.


	#### Training Hyperparameters

	* Learning rate: 1.5e-4
	* Batch size: 4 million tokens
	* Optimizer: AdamW (β1 = 0.9, β2 = 0.95, ε = 1e-5)
	* Weight decay: 0.1
	* Gradient norm clipping: 1.0
	* Learning rate schedule:
	* Linear warm-up (110 steps)
	* 10× cosine decay until 11,433 steps

	<!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

	#### Speeds, Sizes, Times [optional]

	<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

	* Training infrastructure: Cerebras Condor Galaxy 2 (CG-2) AI supercomputer
	* Training executed on 16 Cerebras CS-2 systems
	* Context length: 8,192 tokens
	* Parallelism: Pure data parallelism across multiple CS-2 systems.


	#### Summary

	[TODO]

	#### Citation info

	```bibtex
	@misc{sengupta2023jais,
	title={Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models},
	author={Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Satheesh Katipomu, Haonan Li, Fajri Koto, William Marshall, Gurpreet Gosal, Cynthia Liu, Zhiming Chen, Osama Mohammed Afzal, Samta Kamboj, Onkar Pandit, Rahul Pal, Lalit Pradhan, Zain Muhammad Mujahid, Massa Baali, Xudong Han, Sondos Mahmoud Bsharat, Alham Fikri Aji, Zhiqiang Shen, Zhengzhong Liu, Natalia Vassilieva, Joel Hestness, Andy Hock, Andrew Feldman, Jonathan Lee, Andrew Jackson, Hector Xuguang Ren, Preslav Nakov, Timothy Baldwin and Eric Xing},
	year={2023},
	eprint={2308.16149},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}

	@article{jaisfamilymodelcard,
	title={Jais Family Model Card},
	author={Inception},
	year={2024},
	url = {https://huggingface.co/inceptionai/jais-family-30b-16k-chat/blob/main/README.md}
	}
	```