Klear

🤗 Hugging Face | 💻 Github Repository | 📑 Technique Report | 💬 Issues & Discussions
🔥News
- 2025.09.05: We’ve released the
Klear-46B-A2.5B
series, which currently includesa base model
and aninstruction-tuned model with DPO
.A reasoning-enhanced variant is also in training
— stay tuned for upcoming updates!
1. Introduction
Klear-46B-A2.5B
is a sparse Mixture-of-Experts (MoE) large language model developed by the Kwai-Klear Team at Kuaishou, designed to deliver both high performance and inference efficiency. It features 256 experts, with only 8 experts and 1 shared expert activated per layer during the forward pass, resulting in 46 billion total parameters but just 2.5 billion active — achieving dense-level performance at a fraction of the computational cost.
The model was trained on over 22 trillion tokens using a three-stage progressive curriculum:
1. Foundational Knowledge Learning (12T tokens): General-purpose datasets such as CommonCrawl were processed with stratified quality filters, following a curriculum learning strategy that progresses from lower to higher data quality.
2. Data Complexity Enhancement (8T tokens): The proportion of mathematical, coding, and STEM-related data was gradually increased to strengthen the model's reasoning and problem-solving capabilities.
3. Reasoning Enhancement and Longcontext Stage (2T tokens): Training focused on synthetic and reasoning-intensive data, combined with a fast learning rate annealing strategy to maximize data efficiency and optimize final performance.
As a result, Klear-46B-A2.5B-Base matches or surpasses the performance of dense models with several times more active parameters, while offering significantly better efficiency and cost-effectiveness for real-world deployment.
Model Summary
The base and instruction tuned + DPO models have the following architecture:
key | value |
---|---|
hidden_size | 2048 |
moe_intermediate_size | 896 |
n_shared_experts | 1 |
num_attention_heads | 32 |
num_experts | 256 |
num_experts_per_tok | 8 |
num_hidden_layers | 32 |
num_key_value_heads | 4 |
vocab_size | 151936 |
tie_word_embeddings | false |
context length | 65536 |
Model Downloads
Model | #Total Params | #Activated Params | Context Length | Download Link |
---|---|---|---|---|
Klear-46B-A2.5B-Base | 46B | 2.5B | 64K | 🤗 Hugging Face |
Klear-46B-A2.5B-Instruct | 46B | 2.5B | 64K | 🤗 Hugging Face |
2. Benchmark Evaluation
Klear-46B-A2.5B-Base Evaluation Results
Ability | Benchmark | Klear-46B-A2.5B-Base | MiMO-7B-Base | Qwen3-8B-BASE | Qwen3-14B-BASE | Ling-lite-1.5-Base | Qwen3-30B-A3B-BASE |
---|---|---|---|---|---|---|---|
# Total Params | 46B | 7B | 8B | 14B | 16.8B | 30B | |
# Activated Params | 2.5B | 7B | 8B | 14B | 2.75B | 3B | |
Code | HumanEval† (0-shot) | 89 | - | 84.1 | 87.8 | 83.5 | 90.9 |
MBPP (3-shot) | 76 | 69.2* | 69 | 74 | 66.6 | 75.6 | |
Math | MATH (4-shot, cot) | 55.7 | 38.8 | 60.8* | 62.02* | 59.9 | 59.04* |
CMATH (3-shot) | 87.83 | 78.5 | 88.3 | 90.7 | 85.7 | 89.7 | |
GSM8K (4-shot, cot) | 87.3 | 78.47 | 89.4 | 90.3 | 87.6 | 91.1 | |
General | MMLU-Pro (5-shot, cot) | 57.6 | 43.1 | 55.2 | 58.1 | 49.9 | 58.8 |
MMLU (5-shot) | 80.5 | 69.24 | 77.1 | 80.6 | 73.7 | 80.4 | |
CEval (5-shot) | 89.8 | 67.98 | 81.9 | 84.8 | 78.2 | 87.4 | |
CMMLU (5-shot) | 88 | 70.79 | 82 | 85.6 | 81.2 | 87.1 | |
GPQA (0-shot) | 35.3 | 31.03 | 33.9 | 35.7 | 30.1 | 35.5 | |
AGIEval (0-shot) | 52.3 | 48.3* | 51.7 | 55.7 | 54.3 | 56 | |
BBH (3-shot, cot) | 77.9 | 75.6 | 78.1 | 80.1 | 75.4 | 81.2 | |
HellaSwag (0-shot) | 80.5 | 80* | 78.7 | 81.5 | 80 | 81.2 | |
Triviaqa (5-shot) | 69.6 | 60.8* | 56.3 | 62.1 | 60.9 | 65.6 | |
Naturalqs (5-shot) | 37.5 | 23.46 | 25.7 | 29.1 | 28 | 30.7 | |
PIQA (0-shot) | 81.6 | 80.14 | 79.5 | 81.9 | 82 | 80.7 | |
OpenBookQA (0-shot) | 37.8 | 34.2 | 35 | 35.6 | 38.2 | 34.6 | |
Average | 69.66 | - | 66.62 | 69.60 | 65.60 | 70.41 |
Note:
- Results marked with
*
are sourced from their public report, other evaluations are conducted based on internal evaluation frameworks. †
During pretraining, we found that the HumanEval metric fluctuated significantly and was extremely sensitive to formatting. Therefore, we referred to the prompt from Ling-series paper to modify the original HumanEval. The results in the table are the evaluation metrics after this modification.
Klear-46B-A2.5B-Instruct Evaluation Results
Ability | Benchmark | Klear-46B-A2.5B--Instruct | InternLM3-8B-Instruct | MiniCPM4-8B | Qwen3-8B (NoThink) | gemma3-12b-it | Phi4-14B | Qwen3-30B-A3B-2507 |
---|---|---|---|---|---|---|---|---|
# Total Params | 46B | 8B | 8B | 8B | 12B | 14B | 30B | |
# Activated Params | 2.5B | 8B | 8B | 8B | 12B | 14B | 3B | |
General | MMLU-Redux | 81.95 | 74.65 | 77.63 | 79.32 | 78.39 | 83.09 | 88.11 |
MMLU-Pro | 63.61 | 50.87 | 54.69 | 63.8 | 60.69 | 67.25 | 78.22 | |
GPQA-Diamoind | 49.12 | 38.76 | 38.51 | 51.77 | 39.02 | 59.47 | 71.21 | |
SimpleQA | 6.2 | 4.44 | 3.51 | 5.5 | 6.22 | 3.28 | 23.39 | |
CLUEWSC | 88.49 | 77.63 | 81.91 | 82.89 | 91.12 | 88.16 | 92.11 | |
CEval | 85.98 | 84.26 | 81.78 | 81.66 | 60.81 | 64.79 | 88.57 | |
C-SimpleQA | 42.8 | 25.87 | 23.13 | 37.07 | 28.97 | 24.77 | 75.37 | |
LiveBench 1125 | 50 | 26.3 | 25.5 | 52.1 | 43.1 | 40 | 68.4 | |
Math | MATH500 | 86.4 | 68.4 | 79.8 | 85 | 86.8 | 80.6 | 97.2 |
AIME24 | 28.33 | 11.25 | 22.92 | 28.33 | 23.96 | 15.83 | 75 | |
AIME25 | 19.17 | 8.12 | 15.21 | 20.62 | 18.33 | 18.75 | 61.88 | |
Code | HumanEval | 86.59 | 82.3* | 78.05 | 83.54 | 82.32 | 85.37 | 81.71 |
HumanEval+ | 79.27 | - | 73.17 | 76.83 | 75.61 | 83.54 | 76.83 | |
MBPPEvalplus | 79.9 | 62.4 | 83.3 | 76.2 | 85.7 | 77.5 | 89.4 | |
MBPPEvalplus++ | 68.8 | 50.4 | 71.7 | 66.1 | 74.1 | 66.7 | 75.1 | |
LiveCodeBench v5(2408-2501) | 27.96 | 14.7 | 12.19 | 27.24 | 24.73 | 23.66 | 41.22 | |
Alignment | IF-Eval | 81.89 | 79.3 | 73.01 | 84.47 | 81.52 | 59.33 | 83.92 |
Multi-IF(en+zh) | 78.46 | 61.83 | 61.79 | 78.95 | 76.56 | 62.7 | 77.75 | |
MTBench | 8.42 | 7.86 | 6.875 | 8.21 | 8.68 | 8.62 | 9.33 | |
MT-Eval | 8.13 | 7.36 | 6.7 | 8.18 | 8.45 | 8.12 | - | |
AlignBench v1.1 | 7 | 6.13 | 5.99 | 6.95 | 6.3 | 6.33 | 7.06 | |
Average | 53.74 | - | 46.54 | 52.61 | 50.54 | 48.95 | - |
Note:
- For InternLM3-8B-Instruct, the results marked with
*
are sourced from their official website, other evaluations are conducted based on internal evaluation frameworks. - For Multi-IF, we report the overall average computed across all three rounds, pooling the Chinese and English metrics.
3. Quick start
Inference with huggingface
You can now inference in Transformers starting from version 4.56.0
and set trust_remote_code=True
.
Klear-46B-A2.5B-Base
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_path = "/path/to/Klear-Base"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", dtype=torch.bfloat16, trust_remote_code=True)
text = "世界上最大的湖是"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=256)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
Klear-46B-A2.5B-Instruct
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
model_path = "/path/to/Klear-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", dtype=torch.bfloat16, trust_remote_code=True)
messages = [
{"role": "user", "content": "帮我用 python 写一个计算器的代码吧。"}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=1024)
result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)
Inference with vllm
vLLM is a high-speed and memery-efficicent inference framework. We provide our own forked version of vLLM here.
git clone https://github.com/Kwai-Klear/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 pip install --editable .
vllm serve /path/to/Klear-Instruct --port 8000 --tensor-parallel-size 8 --trust-remote-code
An OpenAI-compatible API will be available at http://localhost:8000/v1
.
Or you can refer to the following Python script for offline inference
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_path = "/path/to/Klear-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
llm = LLM(
model=model_path,
trust_remote_code=True,
tensor_parallel_size=torch.cuda.device_count(),
gpu_memory_utilization=0.7
)
messages = [
{"role": "user", "content": "帮我用 python 写一个计算器的代码吧。"}
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
sampling_params = SamplingParams(
temperature=0.6, top_p=0.95, top_k=40, max_new_tokens=1024
)
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)
- Downloads last month
- 15