Klear

🤗 Hugging Face | 💻 Github Repository | 📑 Technique Report | 💬 Issues & Discussions

🔥News

2025.09.05: We’ve released the Klear-46B-A2.5B series, which currently includes a base model and an instruction-tuned model with DPO. A reasoning-enhanced variant is also in training — stay tuned for upcoming updates!

1. Introduction

Klear-46B-A2.5B is a sparse Mixture-of-Experts (MoE) large language model developed by the Kwai-Klear Team at Kuaishou, designed to deliver both high performance and inference efficiency. It features 256 experts, with only 8 experts and 1 shared expert activated per layer during the forward pass, resulting in 46 billion total parameters but just 2.5 billion active — achieving dense-level performance at a fraction of the computational cost.

The model was trained on over 22 trillion tokens using a three-stage progressive curriculum:

1. Foundational Knowledge Learning (12T tokens): General-purpose datasets such as CommonCrawl were processed with stratified quality filters, following a curriculum learning strategy that progresses from lower to higher data quality.

2. Data Complexity Enhancement (8T tokens): The proportion of mathematical, coding, and STEM-related data was gradually increased to strengthen the model's reasoning and problem-solving capabilities.

3. Reasoning Enhancement and Longcontext Stage (2T tokens): Training focused on synthetic and reasoning-intensive data, combined with a fast learning rate annealing strategy to maximize data efficiency and optimize final performance.

As a result, Klear-46B-A2.5B-Base matches or surpasses the performance of dense models with several times more active parameters, while offering significantly better efficiency and cost-effectiveness for real-world deployment.

Model Summary

The base and instruction tuned + DPO models have the following architecture:

key	value
hidden_size	2048
moe_intermediate_size	896
n_shared_experts	1
num_attention_heads	32
num_experts	256
num_experts_per_tok	8
num_hidden_layers	32
num_key_value_heads	4
vocab_size	151936
tie_word_embeddings	false
context length	65536

Model Downloads

Model	#Total Params	#Activated Params	Context Length	Download Link
Klear-46B-A2.5B-Base	46B	2.5B	64K	🤗 Hugging Face
Klear-46B-A2.5B-Instruct	46B	2.5B	64K	🤗 Hugging Face

2. Benchmark Evaluation

Klear-46B-A2.5B-Base Evaluation Results

Ability	Benchmark	Klear-46B-A2.5B-Base	MiMO-7B-Base	Qwen3-8B-BASE	Qwen3-14B-BASE	Ling-lite-1.5-Base	Qwen3-30B-A3B-BASE
	# Total Params	46B	7B	8B	14B	16.8B	30B
	# Activated Params	2.5B	7B	8B	14B	2.75B	3B
Code	HumanEval† (0-shot)	89	-	84.1	87.8	83.5	90.9
	MBPP (3-shot)	76	69.2*	69	74	66.6	75.6
Math	MATH (4-shot, cot)	55.7	38.8	60.8*	62.02*	59.9	59.04*
	CMATH (3-shot)	87.83	78.5	88.3	90.7	85.7	89.7
	GSM8K (4-shot, cot)	87.3	78.47	89.4	90.3	87.6	91.1
General	MMLU-Pro (5-shot, cot)	57.6	43.1	55.2	58.1	49.9	58.8
	MMLU (5-shot)	80.5	69.24	77.1	80.6	73.7	80.4
	CEval (5-shot)	89.8	67.98	81.9	84.8	78.2	87.4
	CMMLU (5-shot)	88	70.79	82	85.6	81.2	87.1
	GPQA (0-shot)	35.3	31.03	33.9	35.7	30.1	35.5
	AGIEval (0-shot)	52.3	48.3*	51.7	55.7	54.3	56
	BBH (3-shot, cot)	77.9	75.6	78.1	80.1	75.4	81.2
	HellaSwag (0-shot)	80.5	80*	78.7	81.5	80	81.2
	Triviaqa (5-shot)	69.6	60.8*	56.3	62.1	60.9	65.6
	Naturalqs (5-shot)	37.5	23.46	25.7	29.1	28	30.7
	PIQA (0-shot)	81.6	80.14	79.5	81.9	82	80.7
	OpenBookQA (0-shot)	37.8	34.2	35	35.6	38.2	34.6
	Average	69.66	-	66.62	69.60	65.60	70.41

Note:

Results marked with * are sourced from their public report, other evaluations are conducted based on internal evaluation frameworks.
†During pretraining, we found that the HumanEval metric fluctuated significantly and was extremely sensitive to formatting. Therefore, we referred to the prompt from Ling-series paper to modify the original HumanEval. The results in the table are the evaluation metrics after this modification.

Klear-46B-A2.5B-Instruct Evaluation Results

Ability	Benchmark	Klear-46B-A2.5B--Instruct	InternLM3-8B-Instruct	MiniCPM4-8B	Qwen3-8B (NoThink)	gemma3-12b-it	Phi4-14B	Qwen3-30B-A3B-2507
	# Total Params	46B	8B	8B	8B	12B	14B	30B
	# Activated Params	2.5B	8B	8B	8B	12B	14B	3B
General	MMLU-Redux	81.95	74.65	77.63	79.32	78.39	83.09	88.11
	MMLU-Pro	63.61	50.87	54.69	63.8	60.69	67.25	78.22
	GPQA-Diamoind	49.12	38.76	38.51	51.77	39.02	59.47	71.21
	SimpleQA	6.2	4.44	3.51	5.5	6.22	3.28	23.39
	CLUEWSC	88.49	77.63	81.91	82.89	91.12	88.16	92.11
	CEval	85.98	84.26	81.78	81.66	60.81	64.79	88.57
	C-SimpleQA	42.8	25.87	23.13	37.07	28.97	24.77	75.37
	LiveBench 1125	50	26.3	25.5	52.1	43.1	40	68.4
Math	MATH500	86.4	68.4	79.8	85	86.8	80.6	97.2
	AIME24	28.33	11.25	22.92	28.33	23.96	15.83	75
	AIME25	19.17	8.12	15.21	20.62	18.33	18.75	61.88
Code	HumanEval	86.59	82.3*	78.05	83.54	82.32	85.37	81.71
	HumanEval+	79.27	-	73.17	76.83	75.61	83.54	76.83
	MBPPEvalplus	79.9	62.4	83.3	76.2	85.7	77.5	89.4
	MBPPEvalplus++	68.8	50.4	71.7	66.1	74.1	66.7	75.1
	LiveCodeBench v5(2408-2501)	27.96	14.7	12.19	27.24	24.73	23.66	41.22
Alignment	IF-Eval	81.89	79.3	73.01	84.47	81.52	59.33	83.92
	Multi-IF(en+zh)	78.46	61.83	61.79	78.95	76.56	62.7	77.75
	MTBench	8.42	7.86	6.875	8.21	8.68	8.62	9.33
	MT-Eval	8.13	7.36	6.7	8.18	8.45	8.12	-
	AlignBench v1.1	7	6.13	5.99	6.95	6.3	6.33	7.06
	Average	53.74	-	46.54	52.61	50.54	48.95	-

Note:

For InternLM3-8B-Instruct, the results marked with * are sourced from their official website, other evaluations are conducted based on internal evaluation frameworks.
For Multi-IF, we report the overall average computed across all three rounds, pooling the Chinese and English metrics.

3. Quick start

Inference with huggingface

You can now inference in Transformers starting from version 4.56.0 and set trust_remote_code=True.

Klear-46B-A2.5B-Base

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = "/path/to/Klear-Base"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", dtype=torch.bfloat16, trust_remote_code=True)

text = "世界上最大的湖是"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=256)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

Klear-46B-A2.5B-Instruct

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_path = "/path/to/Klear-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)

model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", dtype=torch.bfloat16, trust_remote_code=True)

messages = [
    {"role": "user", "content": "帮我用 python 写一个计算器的代码吧。"}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=1024)

result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)

Inference with vllm

vLLM is a high-speed and memery-efficicent inference framework. We provide our own forked version of vLLM here.

git clone https://github.com/Kwai-Klear/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 pip install --editable .
vllm serve /path/to/Klear-Instruct --port 8000 --tensor-parallel-size 8 --trust-remote-code

An OpenAI-compatible API will be available at http://localhost:8000/v1.

Or you can refer to the following Python script for offline inference

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_path = "/path/to/Klear-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

llm = LLM(
    model=model_path,
    trust_remote_code=True,
    tensor_parallel_size=torch.cuda.device_count(),
    gpu_memory_utilization=0.7
)
messages = [
    {"role": "user", "content": "帮我用 python 写一个计算器的代码吧。"}
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

sampling_params = SamplingParams(
    temperature=0.6, top_p=0.95, top_k=40, max_new_tokens=1024
)

outputs = llm.generate([prompt], sampling_params)

print(outputs[0].outputs[0].text)

Kwai-Klear
/

Klear-46B-A2.5B-Base