metadata

base_model: inceptionai/Llama-3.1-Sherkala-8B-Chat
language:
  - kk
  - en
thumbnail: null
tags:
  - Kazakh
  - English
  - LLM
  - Decoder
  - causal-lm
  - instruction-tuned
license: cc-by-nc-sa-4.0
pipeline_tag: text-generation

Llama-3.1-Sherkala-8B-Chat

Llama-3.1-Sherkala-8B-Chat (Sherkala for short) is a state-of-the-art 8 billion parameter instruction-tuned large language model (LLM) designed primarily for Kazakh while maintaining robust performance in English, Russian, and Turkish. Developed by the Institute of Foundation Models (IFM) at MBZUAI in collaboration with Inception (a G42 company) and Cerebras Systems, Sherkala leverages a balanced mixture of multilingual data and a custom tokenizer to overcome the challenges of data scarcity in Kazakh. This model has been optimized for downstream tasks, safe text generation, and cultural alignment.

Sherkala Details

Developed by: MBZUAI, Inception (a G42 company), Cerebras Systems.
Languages: Kazakh (primary), English, Russian, Turkish.
Input: Text.
Output: Generated text.
Model Size: 8B parameters.
Context Length: 8,192 tokens.
License: cc-by-nc-sa-4.0

How to Get Started with the Model:

Below is sample code to use the model. The code below is tested on transformers==4.46.2.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_path="inceptionai/KazakhGPT"
tokenizer = AutoTokenizer.from_pretrained(path, token=TOKEN)
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map="auto")
device = "cuda" if torch.cuda.is_available() else "cpu" 

tokenizer.chat_template="{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role']+'<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %} {% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}"


def get_response(text):
    conversation = [
        {"role": "user", "content": text}
    ]

    input_ids = tokenizer.apply_chat_template(
        conversation=conversation,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt").to(device)

    # Generate a response
    gen_tokens = model.generate(
        input_ids,
        max_new_tokens=500,
        stop_strings=["<|eot_id|>"],
        tokenizer=tokenizer
        )

    # Decode and print the generated text along with generation prompt
    gen_text = tokenizer.decode(gen_tokens[0][len(input_ids[0]): -1])
    return gen_text

question = 'Қазақстанның жақсы тағамдарын ұсына аласыз ба?'
print(get_response(question))

Response:

Бешбармақ: Бұл Қазақстанның ең танымал тағамы болуы мүмкін. Бұл ет (әдетте қой, сиыр немесе жылқы) сүйекке дейін пісірілетін және ұсақ сүйектері бар табаққа ұсынылатын дәстүрлі тағам. Етті қолмен жейді, көбінесе сорпамен бірге.

Казыбек: Бұл ашытылған сүттен жасалған дәстүрлі қазақ ірімшігі. Ол жұмсақ және сәл қышқыл дәмімен танымал. Оны өздігінен жеуге немесе нанға немесе сорпаға жаюға болады.

Бауырсақ: Бұл ет пен пиязбен толтырылған дәмді, дөңгелек нан. Бұл көптеген қазақ тағамдарының негізгі тағамы.

Қуырылған тауық: Бұл қуырылған тауық еті, бірақ оны жасау тәсілі ерекше. Тауық етін дәмдеуіштер мен йогурт қоспасында маринадтайды, содан кейін қытырлақ болғанша қуырады.

Шашлык: Бұл кәуапқа ұқсайды және бүкіл әлемде танымал. Шашлык әдетте тауық еті, қой еті немесе сиыр еті сияқты ет кесектерінен дайындалады және ашық отта грильде пісіріледі.

Борщ: Бұл қырыққабат, сәбіз, картоп және ет қосылған қызылша сорпасы. Бұл суық айларда жиі ұсынылатын дәмді, жылытатын сорпа.

Кәстрөл: Бұл ет (әдетте қой немесе сиыр еті), картоп, пияз және басқа да көкөністерден жасалған бұқтырылған тағам. Ол әдетте нанмен бірге беріледі.

Жал-жая: Бұл қой етінен, картоптан, пияздан және дәмдеуіштерден жасалған бұқтырылған тағам. Ол әдетте буға пісірілген күрішпен бірге беріледі.

Қуырылған кеспе: Бұл кеспе, ет және көкөністерден жасалған сорпа. Ол әдетте нанмен бірге беріледі.

Бал шырыны: Бұл бал мен сүттен жасалған тәтті сусын. Бұл сергітетін және пайдалы

Model Architecture

Sherkala builds upon the Llama-3.1-8B architecture—a causal, decoder-only transformer model that employs RoPE positional encoding and grouped-query attention. To better capture the rich morphological features of Kazakh, we extend the base vocabulary by 25% with high-frequency Kazakh tokens. This expansion reduces tokenization fertility (i.e., the average number of subwords per word) and improves both training and inference efficiency.

Pre-training Data

Sherkala is continually pre-trained on 45.3 billion tokens from a diverse range of sources:

Kazakh: 19.45B tokens
English: 19.45B tokens
Russian & Turkish: 6.4B tokens

Data sources include Wikipedia, cleaned CommonCrawl archives, news articles, educational texts, and high-quality synthetic translations. A mixing ratio of 3:1:3 (Kazakh : Russian+Turkish : English) ensures a strong Kazakh foundation while preserving competitive English performance.

Instruction Tuning

To enable robust instruction following and safe dialog generation, Sherkala is fine-tuned on a large-scale multilingual instruction dataset comprising:

~5.9M prompt–response pairs in Kazakh
~2.7M prompt–response pairs in English
263K prompt–response pairs in Russian

A dedicated safety dataset—created using a mix of direct and adversarial prompts—is incorporated to mitigate harmful or biased outputs and to ensure cultural alignment. More information can be found in the Sherkala paper.

Evaluation

Sherkala has been extensively evaluated across downstream tasks, open-ended generation, and safety metrics. The following sections detail the evaluation results.

Downstream Evaluation

Evaluation Datasets

Sherkala is benchmarked on multiple tasks in Kazakh, Russian, and English, including:

Knowledge: KazMMLU, MMLU, Belebele, etc.
Commonsense Reasoning: HellaSwag (HS), PIQA, BoolQA, SIQA, ARC-Challenge (ARC), OpenBookQA (OBQA), NIS MATH, COPA.
Misinformation & Bias: TruthfulQA (T-QA) and CrowS-Pairs.

Kazakh Benchmark Results

Model	AVG	KazMMLU	MMLU	Belebele	HS	PIQA	BoolQA	SIQA	ARC	OBQA	NIS	COPA	T-QA	CS-Pairs
BLOOM (7.1B)	37.6	29.3	27.9	29.9	52.0	62.1	36.7	23.6	33.6	26.4	22.0	47.2	49.2	49.1
BLOOMZ (7.1B)	36.9	29.2	27.8	30.4	50.8	54.4	36.8	24.4	31.0	22.1	23.0	51.8	48.1	50.1
Gemma-2 (9B)	35.7	26.1	27.5	28.3	51.9	62.0	33.5	23.6	28.4	26.0	17.0	45.2	47.1	47.5
Gemma-2-it (9B)	36.9	31.4	28.4	27.9	51.0	63.5	36.0	24.0	30.6	23.8	22.0	48.8	49.3	42.6
Qwen-2.5 (7B)	38.5	35.1	31.3	31.2	53.4	54.8	38.0	27.1	30.2	26.3	36.0	46.0	48.0	42.6
Qwen-2.5-Instruct (7B)	40.8	37.8	33.2	31.5	52.3	60.9	38.1	27.8	31.6	31.1	38.0	47.2	51.0	49.3
LLama3.1 (8B)	39.8	38.3	31.3	37.8	57.2	63.7	38.1	29.6	32.8	25.9	20.0	47.8	51.3	43.9
LLama3.1-Instruct (8B)	40.4	38.9	32.4	37.5	57.5	67.5	37.9	30.3	32.6	27.0	22.0	48.2	49.7	43.2
LLama3.1-KazLLM-1.0 (8B)	43.7	37.0	31.5	46.0	62.8	69.8	44.7	35.5	34.2	27.8	32.0	50.4	50.9	45.0
Sherkala (Ours)	45.7	51.6	37.7	53.1	68.1	66.9	42.2	38.1	37.0	25.9	18.0	51.0	50.3	54.3
Sherkala-chat (Ours-chat)	46.9	38.8	33.9	54.5	65.3	75.7	48.0	43.6	35.6	29.0	27.0	53.0	55.7	50.2

English Benchmark Results

Model	AVG	MMLU	RACE	HS	PIQA	BoolQA	SIQA	ARC	OBQA	Winogrande	TruthfulQA	CrowS-Pairs
BLOOM (7.1B)	48.5	29.1	36.5	59.6	73.6	62.2	46.5	33.4	35.8	38.9	68.9	72.6
BLOOMZ (7.1B)	57.0	36.7	45.6	63.1	77.4	90.7	59.7	43.6	42.0	45.2	65.6	—
Gemma-2 (9B)	39.4	27.4	27.8	33.2	59.1	62.2	37.6	24.2	26.4	46.4	49.3	—
Gemma-2-it (9B)	53.2	37.7	46.7	65.4	69.5	80.1	44.1	40.7	29.6	62.1	56.5	—
Qwen-2.5 (7B)	60.8	44.0	41.4	78.9	79.9	84.5	51.9	51.4	47.2	56.4	71.9	—
Qwen-2.5-Instruct (7B)	62.1	46.7	46.3	80.5	80.3	86.4	48.7	54.9	48.8	64.8	63.2	—
LLama3.1 (8B)	56.6	39.6	38.9	79.0	81.3	65.3	52.6	53.5	45.0	45.2	65.5	—
LLama3.1-Instruct (8B)	60.1	41.7	44.9	79.2	81.0	79.4	52.7	55.0	43.6	54.0	69.0	—
LLama3.1-KazLLM-1.0 (8B)	58.6	39.7	44.3	77.9	80.8	72.8	51.5	54.6	43.0	51.0	70.0	—
Sherkala (Ours)	58.7	46.8	39.2	78.3	80.5	77.2	51.3	52.1	46.0	49.6	65.9	—
Sherkala-chat (Ours-chat)	58.6	39.0	41.5	76.2	79.0	82.7	56.8	51.1	41.6	56.4	62.0	—

Evaluation results on Kazakh and English language benchmarks. Average represents the mean score across tasks. Higher scores are better across all metrics. “HS”, “ARC”, “OBQA”, “T-QA” denote HellaSwag, ARC-Challenge (Easy), OpenBookQA, and TruthfulQA. Further details on the evaluation, including additional results in Russian, can be found in the Sherkala paper.

Generation Evaluation

We further evaluated open-ended text generation using GPT-4 as a judge. The following table shows average generation scores (with standard deviations) for models on the MT and Vicuna benchmarks across Kazakh, Russian, and English:

Model	Kazakh MT (avg ± sd)	Kazakh Vicuna (avg ± sd)	Russian MT (avg ± sd)	Russian Vicuna (avg ± sd)	English MT (avg ± sd)	English Vicuna (avg ± sd)
GPT-4o	8.81 ± 1.51	9.32 ± 0.61	8.89 ± 1.59	9.79 ± 0.41	8.36 ± 1.35	9.03 ± 0.59
Qwen-2.5-7B-Instruct	3.52 ± 3.52	3.23 ± 1.73	5.81 ± 2.36	6.05 ± 3.07	7.40 ± 1.85	8.06 ± 1.22
Llama-3.1-8B-Instruct	3.76 ± 2.11	3.75 ± 1.91	0.85 ± 1.20	0.82 ± 1.55	6.55 ± 2.03	7.41 ± 1.28
KazLLM-1.0-8B	3.98 ± 2.15	4.88 ± 2.01	0.72 ± 1.06	0.28 ± 0.71	6.00 ± 2.15	6.66 ± 1.24
Sherkala-chat	5.99 ± 2.73	7.39 ± 1.89	1.02 ± 1.41	0.97 ± 1.70	5.78 ± 2.43	6.55 ± 1.59

Intended Use

Sherkala is intended for both research and commercial applications in Kazakh NLP, including:

Chat Assistants: Conversational agents tailored for Kazakh speakers.
Question Answering & Content Generation: Systems that deliver culturally aligned, factual, and contextually rich responses.
Multilingual NLP: Applications that support English, Russian, and Turkish alongside Kazakh.

We believe that a number of audiences will benefit from our model:

Academics: Those researching Kazakh natural language processing.
Businesses: Companies targeting Kazakh-speaking audiences.
Developers: Those integrating Kazakh language capabilities in apps.

Out-of-Scope Use

While Sherkala is a powerful language model catering to Kazakh and English it is essential to understand its limitations and the potential for its misuse.

Sherkala is not recommended for:

Malicious Use: The model should not be used for generating harmful, misleading, or inappropriate content. This includes but is not limited to
- Generating or promoting hate speech, violence, or discrimination,
- Spreading misinformation or fake news,
- Engaging in illegal activities or promoting them,
- Handling sensitive information: the model should not be used to handle or to generate personal, confidential, or sensitive information.
Generalization Across All Languages: Sherkala is optimized only for Kazakh and English. It should not be assumed to have equal proficiency in other languages or dialects.
High-Stakes Decisions: The model should not be used for making high-stakes decisions without human oversight. This includes medical, legal, financial, or safety-critical decisions, among others.

Bias, Risks, and Limitations

Although extensive measures have been taken to mitigate biases and ensure safe outputs, Sherkala—like all large language models—may still produce inaccurate, misleading, or biased content. Users should apply additional safety measures and conduct thorough evaluations when deploying the model in sensitive or high-stakes environments.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

Training Details

Training Data

Sherkala was trained on 45.3 billion tokens consisting of:

19.45B Kazakh tokens
19.45B English tokens
6.4B Russian and Turkish tokens

The dataset includes:

Wikipedia, CommonCrawl, news articles, Hugging Face datasets
Publicly available documents, educational content, and code
Machine-translated Kazakh data from books and Wikipedia
Data extracted via Automatic Speech Recognition (ASR) and Optical Character Recognition (OCR)

Training Procedure

Sherkala is adapted from Llama-3.1-8B, trained in a continual pretraining setup.

Preprocessing [optional]

Language-specific standardization, filtering, cleaning, and deduplication.
Applied fuzzy deduplication using locality-sensitive hashing (LSH), reducing dataset size to 41.1% of its raw volume.
Kazakh tokenizer extended by 25%, reducing tokenization inefficiencies and improving Kazakh language processing.

Training Hyperparameters

Learning rate: 1.5e-4
Batch size: 4 million tokens
Optimizer: AdamW (β1 = 0.9, β2 = 0.95, ε = 1e-5)
Weight decay: 0.1
Gradient norm clipping: 1.0
Learning rate schedule:
- Linear warm-up (110 steps)
- 10× cosine decay until 11,433 steps

Speeds, Sizes, Times [optional]

Training infrastructure: Cerebras Condor Galaxy 2 (CG-2) AI supercomputer
Training executed on 16 Cerebras CS-2 systems
Context length: 8,192 tokens
Parallelism: Pure data parallelism across multiple CS-2 systems.

Summary

[TODO]

Citation info

@misc{sengupta2023jais,
      title={Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models}, 
      author={Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Satheesh Katipomu, Haonan Li, Fajri Koto, William Marshall, Gurpreet Gosal, Cynthia Liu, Zhiming Chen, Osama Mohammed Afzal, Samta Kamboj, Onkar Pandit, Rahul Pal, Lalit Pradhan, Zain Muhammad Mujahid, Massa Baali, Xudong Han, Sondos Mahmoud Bsharat, Alham Fikri Aji, Zhiqiang Shen, Zhengzhong Liu, Natalia Vassilieva, Joel Hestness, Andy Hock, Andrew Feldman, Jonathan Lee, Andrew Jackson, Hector Xuguang Ren, Preslav Nakov, Timothy Baldwin and Eric Xing},
      year={2023},
      eprint={2308.16149},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@article{jaisfamilymodelcard,
    title={Jais Family Model Card},
    author={Inception},
    year={2024},
    url = {https://huggingface.co/inceptionai/jais-family-30b-16k-chat/blob/main/README.md}
}