base_model: inceptionai/Llama-3.1-Sherkala-8B-Chat
language:
- kk
- en
thumbnail: null
tags:
- Kazakh
- English
- LLM
- Decoder
- causal-lm
- instruction-tuned
license: cc-by-nc-sa-4.0
pipeline_tag: text-generation
Llama-3.1-Sherkala-8B-Chat
Llama-3.1-Sherkala-8B-Chat (Sherkala for short) is a state-of-the-art 8 billion parameter instruction-tuned large language model (LLM) designed primarily for Kazakh while maintaining robust performance in English, Russian, and Turkish. Developed by the Institute of Foundation Models (IFM) at MBZUAI in collaboration with Inception (a G42 company) and Cerebras Systems, Sherkala leverages a balanced mixture of multilingual data and a custom tokenizer to overcome the challenges of data scarcity in Kazakh. This model has been optimized for downstream tasks, safe text generation, and cultural alignment.
Sherkala Details
- Developed by: MBZUAI, Inception (a G42 company), Cerebras Systems.
- Languages: Kazakh (primary), English, Russian, Turkish.
- Input: Text.
- Output: Generated text.
- Model Size: 8B parameters.
- Context Length: 8,192 tokens.
- License: cc-by-nc-sa-4.0
How to Get Started with the Model:
Below is sample code to use the model. The code below is tested on transformers==4.46.2.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_path="inceptionai/KazakhGPT"
tokenizer = AutoTokenizer.from_pretrained(path, token=TOKEN)
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map="auto")
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer.chat_template="{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role']+'<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %} {% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}"
def get_response(text):
conversation = [
{"role": "user", "content": text}
]
input_ids = tokenizer.apply_chat_template(
conversation=conversation,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt").to(device)
# Generate a response
gen_tokens = model.generate(
input_ids,
max_new_tokens=500,
stop_strings=["<|eot_id|>"],
tokenizer=tokenizer
)
# Decode and print the generated text along with generation prompt
gen_text = tokenizer.decode(gen_tokens[0][len(input_ids[0]): -1])
return gen_text
question = 'Қазақстанның жақсы тағамдарын ұсына аласыз ба?'
print(get_response(question))
Response:
- Бешбармақ: Бұл Қазақстанның ең танымал тағамы болуы мүмкін. Бұл ет (әдетте қой, сиыр немесе жылқы) сүйекке дейін пісірілетін және ұсақ сүйектері бар табаққа ұсынылатын дәстүрлі тағам. Етті қолмен жейді, көбінесе сорпамен бірге.
- Казыбек: Бұл ашытылған сүттен жасалған дәстүрлі қазақ ірімшігі. Ол жұмсақ және сәл қышқыл дәмімен танымал. Оны өздігінен жеуге немесе нанға немесе сорпаға жаюға болады.
- Бауырсақ: Бұл ет пен пиязбен толтырылған дәмді, дөңгелек нан. Бұл көптеген қазақ тағамдарының негізгі тағамы.
- Қуырылған тауық: Бұл қуырылған тауық еті, бірақ оны жасау тәсілі ерекше. Тауық етін дәмдеуіштер мен йогурт қоспасында маринадтайды, содан кейін қытырлақ болғанша қуырады.
- Шашлык: Бұл кәуапқа ұқсайды және бүкіл әлемде танымал. Шашлык әдетте тауық еті, қой еті немесе сиыр еті сияқты ет кесектерінен дайындалады және ашық отта грильде пісіріледі.
- Борщ: Бұл қырыққабат, сәбіз, картоп және ет қосылған қызылша сорпасы. Бұл суық айларда жиі ұсынылатын дәмді, жылытатын сорпа.
- Кәстрөл: Бұл ет (әдетте қой немесе сиыр еті), картоп, пияз және басқа да көкөністерден жасалған бұқтырылған тағам. Ол әдетте нанмен бірге беріледі.
- Жал-жая: Бұл қой етінен, картоптан, пияздан және дәмдеуіштерден жасалған бұқтырылған тағам. Ол әдетте буға пісірілген күрішпен бірге беріледі.
- Қуырылған кеспе: Бұл кеспе, ет және көкөністерден жасалған сорпа. Ол әдетте нанмен бірге беріледі.
- Бал шырыны: Бұл бал мен сүттен жасалған тәтті сусын. Бұл сергітетін және пайдалы
Model Architecture
Sherkala builds upon the Llama-3.1-8B architecture—a causal, decoder-only transformer model that employs RoPE positional encoding and grouped-query attention. To better capture the rich morphological features of Kazakh, we extend the base vocabulary by 25% with high-frequency Kazakh tokens. This expansion reduces tokenization fertility (i.e., the average number of subwords per word) and improves both training and inference efficiency.
Pre-training Data
Sherkala is continually pre-trained on 45.3 billion tokens from a diverse range of sources:
- Kazakh: 19.45B tokens
- English: 19.45B tokens
- Russian & Turkish: 6.4B tokens
Data sources include Wikipedia, cleaned CommonCrawl archives, news articles, educational texts, and high-quality synthetic translations. A mixing ratio of 3:1:3 (Kazakh : Russian+Turkish : English) ensures a strong Kazakh foundation while preserving competitive English performance.
Instruction Tuning
To enable robust instruction following and safe dialog generation, Sherkala is fine-tuned on a large-scale multilingual instruction dataset comprising:
- ~5.9M prompt–response pairs in Kazakh
- ~2.7M prompt–response pairs in English
- 263K prompt–response pairs in Russian
A dedicated safety dataset—created using a mix of direct and adversarial prompts—is incorporated to mitigate harmful or biased outputs and to ensure cultural alignment. More information can be found in the Sherkala paper.
Evaluation
Sherkala has been extensively evaluated across downstream tasks, open-ended generation, and safety metrics. The following sections detail the evaluation results.
Downstream Evaluation
Evaluation Datasets
Sherkala is benchmarked on multiple tasks in Kazakh, Russian, and English, including:
- Knowledge: KazMMLU, MMLU, Belebele, etc.
- Commonsense Reasoning: HellaSwag (HS), PIQA, BoolQA, SIQA, ARC-Challenge (ARC), OpenBookQA (OBQA), NIS MATH, COPA.
- Misinformation & Bias: TruthfulQA (T-QA) and CrowS-Pairs.
Kazakh Benchmark Results
Model | AVG | KazMMLU | MMLU | Belebele | HS | PIQA | BoolQA | SIQA | ARC | OBQA | NIS | COPA | T-QA | CS-Pairs |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BLOOM (7.1B) | 37.6 | 29.3 | 27.9 | 29.9 | 52.0 | 62.1 | 36.7 | 23.6 | 33.6 | 26.4 | 22.0 | 47.2 | 49.2 | 49.1 |
BLOOMZ (7.1B) | 36.9 | 29.2 | 27.8 | 30.4 | 50.8 | 54.4 | 36.8 | 24.4 | 31.0 | 22.1 | 23.0 | 51.8 | 48.1 | 50.1 |
Gemma-2 (9B) | 35.7 | 26.1 | 27.5 | 28.3 | 51.9 | 62.0 | 33.5 | 23.6 | 28.4 | 26.0 | 17.0 | 45.2 | 47.1 | 47.5 |
Gemma-2-it (9B) | 36.9 | 31.4 | 28.4 | 27.9 | 51.0 | 63.5 | 36.0 | 24.0 | 30.6 | 23.8 | 22.0 | 48.8 | 49.3 | 42.6 |
Qwen-2.5 (7B) | 38.5 | 35.1 | 31.3 | 31.2 | 53.4 | 54.8 | 38.0 | 27.1 | 30.2 | 26.3 | 36.0 | 46.0 | 48.0 | 42.6 |
Qwen-2.5-Instruct (7B) | 40.8 | 37.8 | 33.2 | 31.5 | 52.3 | 60.9 | 38.1 | 27.8 | 31.6 | 31.1 | 38.0 | 47.2 | 51.0 | 49.3 |
LLama3.1 (8B) | 39.8 | 38.3 | 31.3 | 37.8 | 57.2 | 63.7 | 38.1 | 29.6 | 32.8 | 25.9 | 20.0 | 47.8 | 51.3 | 43.9 |
LLama3.1-Instruct (8B) | 40.4 | 38.9 | 32.4 | 37.5 | 57.5 | 67.5 | 37.9 | 30.3 | 32.6 | 27.0 | 22.0 | 48.2 | 49.7 | 43.2 |
LLama3.1-KazLLM-1.0 (8B) | 43.7 | 37.0 | 31.5 | 46.0 | 62.8 | 69.8 | 44.7 | 35.5 | 34.2 | 27.8 | 32.0 | 50.4 | 50.9 | 45.0 |
Sherkala (Ours) | 45.7 | 51.6 | 37.7 | 53.1 | 68.1 | 66.9 | 42.2 | 38.1 | 37.0 | 25.9 | 18.0 | 51.0 | 50.3 | 54.3 |
Sherkala-chat (Ours-chat) | 46.9 | 38.8 | 33.9 | 54.5 | 65.3 | 75.7 | 48.0 | 43.6 | 35.6 | 29.0 | 27.0 | 53.0 | 55.7 | 50.2 |
English Benchmark Results
Model | AVG | MMLU | RACE | HS | PIQA | BoolQA | SIQA | ARC | OBQA | Winogrande | TruthfulQA | CrowS-Pairs |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BLOOM (7.1B) | 48.5 | 29.1 | 36.5 | 59.6 | 73.6 | 62.2 | 46.5 | 33.4 | 35.8 | 38.9 | 68.9 | 72.6 |
BLOOMZ (7.1B) | 57.0 | 36.7 | 45.6 | 63.1 | 77.4 | 90.7 | 59.7 | 43.6 | 42.0 | 45.2 | 65.6 | — |
Gemma-2 (9B) | 39.4 | 27.4 | 27.8 | 33.2 | 59.1 | 62.2 | 37.6 | 24.2 | 26.4 | 46.4 | 49.3 | — |
Gemma-2-it (9B) | 53.2 | 37.7 | 46.7 | 65.4 | 69.5 | 80.1 | 44.1 | 40.7 | 29.6 | 62.1 | 56.5 | — |
Qwen-2.5 (7B) | 60.8 | 44.0 | 41.4 | 78.9 | 79.9 | 84.5 | 51.9 | 51.4 | 47.2 | 56.4 | 71.9 | — |
Qwen-2.5-Instruct (7B) | 62.1 | 46.7 | 46.3 | 80.5 | 80.3 | 86.4 | 48.7 | 54.9 | 48.8 | 64.8 | 63.2 | — |
LLama3.1 (8B) | 56.6 | 39.6 | 38.9 | 79.0 | 81.3 | 65.3 | 52.6 | 53.5 | 45.0 | 45.2 | 65.5 | — |
LLama3.1-Instruct (8B) | 60.1 | 41.7 | 44.9 | 79.2 | 81.0 | 79.4 | 52.7 | 55.0 | 43.6 | 54.0 | 69.0 | — |
LLama3.1-KazLLM-1.0 (8B) | 58.6 | 39.7 | 44.3 | 77.9 | 80.8 | 72.8 | 51.5 | 54.6 | 43.0 | 51.0 | 70.0 | — |
Sherkala (Ours) | 58.7 | 46.8 | 39.2 | 78.3 | 80.5 | 77.2 | 51.3 | 52.1 | 46.0 | 49.6 | 65.9 | — |
Sherkala-chat (Ours-chat) | 58.6 | 39.0 | 41.5 | 76.2 | 79.0 | 82.7 | 56.8 | 51.1 | 41.6 | 56.4 | 62.0 | — |
Evaluation results on Kazakh and English language benchmarks. Average represents the mean score across tasks. Higher scores are better across all metrics. “HS”, “ARC”, “OBQA”, “T-QA” denote HellaSwag, ARC-Challenge (Easy), OpenBookQA, and TruthfulQA. Further details on the evaluation, including additional results in Russian, can be found in the Sherkala paper.
Generation Evaluation
We further evaluated open-ended text generation using GPT-4 as a judge. The following table shows average generation scores (with standard deviations) for models on the MT and Vicuna benchmarks across Kazakh, Russian, and English:
Model | Kazakh MT (avg ± sd) | Kazakh Vicuna (avg ± sd) | Russian MT (avg ± sd) | Russian Vicuna (avg ± sd) | English MT (avg ± sd) | English Vicuna (avg ± sd) |
---|---|---|---|---|---|---|
GPT-4o | 8.81 ± 1.51 | 9.32 ± 0.61 | 8.89 ± 1.59 | 9.79 ± 0.41 | 8.36 ± 1.35 | 9.03 ± 0.59 |
Qwen-2.5-7B-Instruct | 3.52 ± 3.52 | 3.23 ± 1.73 | 5.81 ± 2.36 | 6.05 ± 3.07 | 7.40 ± 1.85 | 8.06 ± 1.22 |
Llama-3.1-8B-Instruct | 3.76 ± 2.11 | 3.75 ± 1.91 | 0.85 ± 1.20 | 0.82 ± 1.55 | 6.55 ± 2.03 | 7.41 ± 1.28 |
KazLLM-1.0-8B | 3.98 ± 2.15 | 4.88 ± 2.01 | 0.72 ± 1.06 | 0.28 ± 0.71 | 6.00 ± 2.15 | 6.66 ± 1.24 |
Sherkala-chat | 5.99 ± 2.73 | 7.39 ± 1.89 | 1.02 ± 1.41 | 0.97 ± 1.70 | 5.78 ± 2.43 | 6.55 ± 1.59 |
Intended Use
Sherkala is intended for both research and commercial applications in Kazakh NLP, including:
- Chat Assistants: Conversational agents tailored for Kazakh speakers.
- Question Answering & Content Generation: Systems that deliver culturally aligned, factual, and contextually rich responses.
- Multilingual NLP: Applications that support English, Russian, and Turkish alongside Kazakh.
We believe that a number of audiences will benefit from our model:
- Academics: Those researching Kazakh natural language processing.
- Businesses: Companies targeting Kazakh-speaking audiences.
- Developers: Those integrating Kazakh language capabilities in apps.
Out-of-Scope Use
While Sherkala is a powerful language model catering to Kazakh and English it is essential to understand its limitations and the potential for its misuse.
Sherkala is not recommended for:
- Malicious Use: The model should not be used for generating harmful, misleading, or inappropriate content. This includes but is not limited to
- Generating or promoting hate speech, violence, or discrimination,
- Spreading misinformation or fake news,
- Engaging in illegal activities or promoting them,
- Handling sensitive information: the model should not be used to handle or to generate personal, confidential, or sensitive information.
- Generalization Across All Languages: Sherkala is optimized only for Kazakh and English. It should not be assumed to have equal proficiency in other languages or dialects.
- High-Stakes Decisions: The model should not be used for making high-stakes decisions without human oversight. This includes medical, legal, financial, or safety-critical decisions, among others.
Bias, Risks, and Limitations
Although extensive measures have been taken to mitigate biases and ensure safe outputs, Sherkala—like all large language models—may still produce inaccurate, misleading, or biased content. Users should apply additional safety measures and conduct thorough evaluations when deploying the model in sensitive or high-stakes environments.
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
Training Details
Training Data
Sherkala was trained on 45.3 billion tokens consisting of:
- 19.45B Kazakh tokens
- 19.45B English tokens
- 6.4B Russian and Turkish tokens
The dataset includes:
- Wikipedia, CommonCrawl, news articles, Hugging Face datasets
- Publicly available documents, educational content, and code
- Machine-translated Kazakh data from books and Wikipedia
- Data extracted via Automatic Speech Recognition (ASR) and Optical Character Recognition (OCR)
Training Procedure
Sherkala is adapted from Llama-3.1-8B, trained in a continual pretraining setup.
Preprocessing [optional]
- Language-specific standardization, filtering, cleaning, and deduplication.
- Applied fuzzy deduplication using locality-sensitive hashing (LSH), reducing dataset size to 41.1% of its raw volume.
- Kazakh tokenizer extended by 25%, reducing tokenization inefficiencies and improving Kazakh language processing.
Training Hyperparameters
- Learning rate: 1.5e-4
- Batch size: 4 million tokens
- Optimizer: AdamW (β1 = 0.9, β2 = 0.95, ε = 1e-5)
- Weight decay: 0.1
- Gradient norm clipping: 1.0
- Learning rate schedule:
- Linear warm-up (110 steps)
- 10× cosine decay until 11,433 steps
Speeds, Sizes, Times [optional]
- Training infrastructure: Cerebras Condor Galaxy 2 (CG-2) AI supercomputer
- Training executed on 16 Cerebras CS-2 systems
- Context length: 8,192 tokens
- Parallelism: Pure data parallelism across multiple CS-2 systems.
Summary
[TODO]
Citation info
@misc{sengupta2023jais,
title={Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models},
author={Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Satheesh Katipomu, Haonan Li, Fajri Koto, William Marshall, Gurpreet Gosal, Cynthia Liu, Zhiming Chen, Osama Mohammed Afzal, Samta Kamboj, Onkar Pandit, Rahul Pal, Lalit Pradhan, Zain Muhammad Mujahid, Massa Baali, Xudong Han, Sondos Mahmoud Bsharat, Alham Fikri Aji, Zhiqiang Shen, Zhengzhong Liu, Natalia Vassilieva, Joel Hestness, Andy Hock, Andrew Feldman, Jonathan Lee, Andrew Jackson, Hector Xuguang Ren, Preslav Nakov, Timothy Baldwin and Eric Xing},
year={2023},
eprint={2308.16149},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@article{jaisfamilymodelcard,
title={Jais Family Model Card},
author={Inception},
year={2024},
url = {https://huggingface.co/inceptionai/jais-family-30b-16k-chat/blob/main/README.md}
}