---
license: apache-2.0
language:
- zh
- en
pipeline_tag: text-generation
library_name: transformers
---
# Klear

<div align="center">
  <img src="figures/klear-logo-02.png" width="500"/>
  <p>
    🤗 <a href="https://huggingface.co/Kwai-Klear">Hugging Face</a> | 💻 <a href="https://github.com/Kwai-Klear/Klear1.0/tree/main">Github Repository</a> |  📑 <a href="https://huggingface.co/Kwai-Klear/Klear-46B-A2.5B-Base">Technique Report</a> |  💬 <a href="https://github.com/Kwai-Klear/Klear1.0/issues">Issues & Discussions</a>
  </p>
</div>
 

## 🔥News

- 2025.09.05: We’ve released the `Klear-46B-A2.5B` series, which currently includes `a base model` and an `instruction-tuned model with DPO`. `A reasoning-enhanced variant is also in training` — stay tuned for upcoming updates!


## 1. Introduction


`Klear-46B-A2.5B` is a sparse Mixture-of-Experts (MoE) large language model developed by **the Kwai-Klear Team at Kuaishou**, designed to deliver both **high performance** and **inference efficiency**. It features **256 experts**, with only **8 experts and 1 shared expert activated** per layer during the forward pass, resulting in **46 billion total parameters** but just **2.5 billion active** — achieving dense-level performance at a fraction of the computational cost.

The model was trained on over **22 trillion tokens** using a **three-stage progressive curriculum**:

**1. Foundational Knowledge Learning (12T tokens):**
General-purpose datasets such as CommonCrawl were processed with stratified quality filters, following a curriculum learning strategy that progresses from lower to higher data quality.

**2. Data Complexity Enhancement (8T tokens):**
The proportion of mathematical, coding, and STEM-related data was gradually increased to strengthen the model's reasoning and problem-solving capabilities.

**3. Reasoning Enhancement and Longcontext Stage (2T tokens):**
Training focused on synthetic and reasoning-intensive data, combined with a fast learning rate annealing strategy to maximize data efficiency and optimize final performance.

As a result, Klear-46B-A2.5B-Base matches or surpasses the performance of dense models with several times more active parameters, while offering significantly better efficiency and cost-effectiveness for real-world deployment.


## Model Summary

The base and instruction tuned + DPO models have the following architecture:

| **key**               | **value**                                                              |
|---------------------------|------------------------------------------------------------------------|
| hidden_size       | 2048                                                                     |
| moe_intermediate_size                  | 896                                                                    |
| n_shared_experts       | 1                                                                      |
| num_attention_heads         | 32                                                                    |
| num_experts     | 256                                                                    |
| num_experts_per_tok               | 8                                                                    |
| num_hidden_layers       | 32                                                                      |
| num_key_value_heads               | 4                                                                   |
| vocab_size                | 151936                                                                 |
| tie_word_embeddings       | false                                                                  |
| context length       | 65536                                                                  |


### Model Downloads

<div align="center">

| **Model** | **#Total Params** | **#Activated Params** | **Context Length** | **Download Link** |
| :------------: | :------------: | :------------: | :------------: | :------------: |
| Klear-46B-A2.5B-Base | 46B | 2.5B | 64K   | [🤗 Hugging Face](https://huggingface.co/Kwai-Klear/Klear-46B-A2.5B-Base)   |
| Klear-46B-A2.5B-Instruct  | 46B | 2.5B |  64K   | [🤗 Hugging Face](https://huggingface.co/Kwai-Klear/Klear-46B-A2.5B-Instruct)   |

</div>


## 2. Benchmark Evaluation
### Klear-46B-A2.5B-Base Evaluation Results
| Ability     | Benchmark              | Klear-46B-A2.5B-Base | MiMO-7B-Base | Qwen3-8B-BASE | Qwen3-14B-BASE | Ling-lite-1.5-Base | Qwen3-30B-A3B-BASE |
| ----------- | ---------------------- | -------------------- | ------------ | ------------- | -------------- | ------------------ | ------------------ |
|             | # Total Params         | 46B                  | 7B           | 8B            | 14B            | 16.8B              | 30B                |
|             | # Activated Params     | 2.5B                 | 7B           | 8B            | 14B            | 2.75B              | 3B                 |
| **Code**    | HumanEval† (0-shot)    | 89                   | -            | 84.1          | 87.8           | 83.5               | 90.9               |
|             | MBPP (3-shot)          | 76                   | 69.2*        | 69            | 74             | 66.6               | 75.6               |
| **Math**    | MATH (4-shot, cot)     | 55.7                 | 38.8         | 60.8*         | 62.02*         | 59.9               | 59.04*             |
|             | CMATH (3-shot)         | 87.83                | 78.5         | 88.3          | 90.7           | 85.7               | 89.7               |
|             | GSM8K (4-shot, cot)    | 87.3                 | 78.47        | 89.4          | 90.3           | 87.6               | 91.1               |
| **General** | MMLU-Pro (5-shot, cot) | 57.6                 | 43.1         | 55.2          | 58.1           | 49.9               | 58.8               |
|             | MMLU (5-shot)          | 80.5                 | 69.24        | 77.1          | 80.6           | 73.7               | 80.4               |
|             | CEval (5-shot)         | 89.8                 | 67.98        | 81.9          | 84.8           | 78.2               | 87.4               |
|             | CMMLU (5-shot)         | 88                   | 70.79        | 82            | 85.6           | 81.2               | 87.1               |
|             | GPQA (0-shot)          | 35.3                 | 31.03        | 33.9          | 35.7           | 30.1               | 35.5               |
|             | AGIEval (0-shot)       | 52.3                 | 48.3*        | 51.7          | 55.7           | 54.3               | 56                 |
|             | BBH (3-shot, cot)      | 77.9                 | 75.6         | 78.1          | 80.1           | 75.4               | 81.2               |
|             | HellaSwag (0-shot)     | 80.5                 | 80*          | 78.7          | 81.5           | 80                 | 81.2               |
|             | Triviaqa (5-shot)      | 69.6                 | 60.8*        | 56.3          | 62.1           | 60.9               | 65.6               |
|             | Naturalqs (5-shot)     | 37.5                 | 23.46        | 25.7          | 29.1           | 28                 | 30.7               |
|             | PIQA (0-shot)          | 81.6                 | 80.14        | 79.5          | 81.9           | 82                 | 80.7               |
|             | OpenBookQA (0-shot)    | 37.8                 | 34.2         | 35            | 35.6           | 38.2               | 34.6               |
|             | Average                | 69.66                | -            | 66.62         | 69.60          | 65.60              | 70.41              |

Note:
1.  Results marked with `*` are sourced from their public report, other evaluations are conducted based on internal evaluation frameworks.
2. `†`During pretraining, we found that the HumanEval metric fluctuated significantly and was extremely sensitive to formatting. Therefore, we referred to the prompt from Ling-series paper to modify the original HumanEval. The results in the table are the evaluation metrics after this modification. 

### Klear-46B-A2.5B-Instruct Evaluation Results
| Ability       | Benchmark                   | Klear-46B-A2.5B--Instruct | InternLM3-8B-Instruct | MiniCPM4-8B | Qwen3-8B (NoThink) | gemma3-12b-it | Phi4-14B | Qwen3-30B-A3B-2507 |
| ------------- | --------------------------- | ------------------------- | --------------------- | ----------- | ------------------ | ------------- | -------- | ------------------ |
|               | # Total Params              | 46B                       | 8B                    | 8B          | 8B                 | 12B           | 14B      | 30B                |
|               | # Activated Params          | 2.5B                      | 8B                    | 8B          | 8B                 | 12B           | 14B      | 3B                 |
| **General**   | MMLU-Redux                  | 81.95                     | 74.65                 | 77.63       | 79.32              | 78.39         | 83.09    | 88.11              |
|               | MMLU-Pro                    | 63.61                     | 50.87                 | 54.69       | 63.8               | 60.69         | 67.25    | 78.22              |
|               | GPQA-Diamoind               | 49.12                     | 38.76                 | 38.51       | 51.77              | 39.02         | 59.47    | 71.21              |
|               | SimpleQA                    | 6.2                       | 4.44                  | 3.51        | 5.5                | 6.22          | 3.28     | 23.39              |
|               | CLUEWSC                     | 88.49                     | 77.63                 | 81.91       | 82.89              | 91.12         | 88.16    | 92.11              |
|               | CEval                       | 85.98                     | 84.26                 | 81.78       | 81.66              | 60.81         | 64.79    | 88.57              |
|               | C-SimpleQA                  | 42.8                      | 25.87                 | 23.13       | 37.07              | 28.97         | 24.77    | 75.37              |
|               | LiveBench 1125              | 50                        | 26.3                  | 25.5        | 52.1               | 43.1          | 40       | 68.4               |
| **Math**      | MATH500                     | 86.4                      | 68.4                  | 79.8        | 85                 | 86.8          | 80.6     | 97.2               |
|               | AIME24                      | 28.33                     | 11.25                 | 22.92       | 28.33              | 23.96         | 15.83    | 75                 |
|               | AIME25                      | 19.17                     | 8.12                  | 15.21       | 20.62              | 18.33         | 18.75    | 61.88              |
| **Code**      | HumanEval                   | 86.59                     | 82.3*                 | 78.05       | 83.54              | 82.32         | 85.37    | 81.71              |
|               | HumanEval+                  | 79.27                     | -                     | 73.17       | 76.83              | 75.61         | 83.54    | 76.83              |
|               | MBPPEvalplus                | 79.9                      | 62.4                  | 83.3        | 76.2               | 85.7          | 77.5     | 89.4               |
|               | MBPPEvalplus++              | 68.8                      | 50.4                  | 71.7        | 66.1               | 74.1          | 66.7     | 75.1               |
|               | LiveCodeBench v5(2408-2501) | 27.96                     | 14.7                  | 12.19       | 27.24              | 24.73         | 23.66    | 41.22              |
| **Alignment** | IF-Eval                     | 81.89                     | 79.3                  | 73.01       | 84.47              | 81.52         | 59.33    | 83.92              |
|               | Multi-IF(en+zh)             | 78.46                     | 61.83                 | 61.79       | 78.95              | 76.56         | 62.7     | 77.75              |
|               | MTBench                     | 8.42                      | 7.86                  | 6.875       | 8.21               | 8.68          | 8.62     | 9.33               |
|               | MT-Eval                     | 8.13                      | 7.36                  | 6.7         | 8.18               | 8.45          | 8.12     | -                  |
|               | AlignBench v1.1             | 7                         | 6.13                  | 5.99        | 6.95               | 6.3           | 6.33     | 7.06               |
|               | Average                     | 53.74                     | -                     | 46.54       | 52.61              | 50.54         | 48.95    | -                  |

Note:
1. For InternLM3-8B-Instruct, the results marked with `*` are sourced from their official website, other evaluations are conducted based on internal evaluation frameworks.
2. For Multi-IF, we report the overall average computed across all three rounds, pooling the Chinese and English metrics.

## 3. Quick start

### Inference with huggingface

You can now inference in Transformers starting from version `4.56.0` and `set trust_remote_code=True`.

#### Klear-46B-A2.5B-Base

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = "/path/to/Klear-Base"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", dtype=torch.bfloat16, trust_remote_code=True)

text = "世界上最大的湖是"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=256)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
```

#### Klear-46B-A2.5B-Instruct

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_path = "/path/to/Klear-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)

model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", dtype=torch.bfloat16, trust_remote_code=True)

messages = [
    {"role": "user", "content": "帮我用 python 写一个计算器的代码吧。"}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=1024)

result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)
```

### Inference with vllm

[vLLM](https://github.com/vllm-project/vllm) is a high-speed and memery-efficicent inference framework. We provide **our own forked version of [vLLM](https://github.com/Kwai-Klear/vllm) here.**

```shell
git clone https://github.com/Kwai-Klear/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 pip install --editable .
vllm serve /path/to/Klear-Instruct --port 8000 --tensor-parallel-size 8 --trust-remote-code
```

An OpenAI-compatible API will be available at `http://localhost:8000/v1`.

Or you can refer to the following Python script for offline inference
```python
import torch
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_path = "/path/to/Klear-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

llm = LLM(
    model=model_path,
    trust_remote_code=True,
    tensor_parallel_size=torch.cuda.device_count(),
    gpu_memory_utilization=0.7
)
messages = [
    {"role": "user", "content": "帮我用 python 写一个计算器的代码吧。"}
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

sampling_params = SamplingParams(
    temperature=0.6, top_p=0.95, top_k=40, max_new_tokens=1024
)

outputs = llm.generate([prompt], sampling_params)

print(outputs[0].outputs[0].text)

```