Update README.md

3afd206 verified 21 days ago

5.72 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- Qwen/Qwen3-8B-Base
	datasets:
	- Kwai-Klear/KlearReasoner-MathSub-30K
	- Kwai-Klear/KlearReasoner-CodeSub-15K
	metrics:
	- accuracy
	---


	# ✨ Klear-Reasoner-8B-SFT
	We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. We investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose Gradient-Preserving clipping Policy Optimization (GPPO) that gently backpropagates gradients from clipped tokens.

	\| Resource \| Link \|
	\|---\|---\|
	\| 📝 Preprints \| [Paper](https://arxiv.org/pdf/2508.07629) \|
	\| 🤗 Daily Paper \| [Paper](https://huggingface.co/papers/2508.07629) \|
	\| 🤗 Model Hub \| [Klear-Reasoner-8B](https://huggingface.co/Kwai-Klear/Klear-Reasoner-8B) \|
	\| 🤗 Dataset Hub \| [Math RL](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-MathSub-30K) \|
	\| 🤗 Dataset Hub \| [Code RL](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-CodeSub-15K) \|
	\| 🐛 Issues & Discussions \| [GitHub Issues](https://github.com/suu990901/KlearReasoner/issues) \|
	\| 📧 Contact \| [email protected] \|

	## 📌 Overview

	<div align="center">
	<img src="main_result.png" width="100%"/>

	<sub>Benchmark accuracy of Klear-Reasoner-8B on AIME 2024/2025 (avg@64), LiveCodeBench V5 (2024/08/01-2025/02/01, avg@8), and v6 (2025/02/01-2025/05/01, avg@8).</sub>
	</div>

	Klear-Reasoner is an 8-billion-parameter reasoning model that achieves SOTA performance on challenging math and coding benchmarks:

	\| Benchmark \| AIME 2024 \| AIME 2025 \| LiveCodeBench V5 \| LiveCodeBench V6 \|
	\|---\|---\|---\|---\|---\|
	\| Score \| 90.5 % \| 83.2 % \| 66.0 % \| 58.1 % \|

	The model combines:
	1. Quality-centric long CoT SFT – distilled from DeepSeek-R1-0528.
	2. Gradient-Preserving Clipping Policy Optimization (GPPO) – a novel RL method that keeps gradients from clipped tokens to boost exploration & convergence.

	---

	### Evaluation
	When we expand the inference budget to 64K and adopt the YaRN method with a scaling factor of 2.5. Evaluation is coming soon, stay tuned.

	## 📊 Benchmark Results (Pass@1)

	\| Model \| AIME2024<br>avg@64 \| AIME2025<br>avg@64 \| HMMT2025<br>avg@64 \| LCB V5<br>avg@8 \| LCB V6<br>avg@8 \|
	\|-------\|--------------------\|--------------------\|--------------------\|-----------------\|-----------------\|
	\| AReal-boba-RL-7B \| 61.9 \| 48.3 \| 29.4 \| 34.3 \| 31.0† \|
	\| MiMo-7B-RL \| 68.2 \| 55.4 \| 35.7 \| 57.8 \| 49.3 \|
	\| Skywork-OR1-7B \| 70.2 \| 54.6 \| 35.7 \| 47.6 \| 42.7 \|
	\| AceReason-Nemotron-1.1-7B \| 72.6 \| 64.8 \| 42.9 \| 57.2 \| 52.1 \|
	\| POLARIS-4B-Preview \| 81.2 \| _79.4_ \| 58.7 \| 58.5† \| 53.0† \|
	\| Qwen3-8B \| 76.0 \| 67.3 \| 44.7† \| 57.5 \| 48.4† \|
	\| Deepseek-R1-0528-Distill-8B \| _86.0_ \| 76.3 \| 61.5 \| 61.0† \| 51.6† \|
	\| OpenReasoning-Nemotron-7B \| 84.7 \| 78.2 \| 63.5 \| _65.6_† \| _56.3_† \|
	\| Klear-Reasoner-8B-SFT \| 75.6 \| 70.1 \| 57.6 \| 58.5 \| 49.6 \|
	\| Klear-Reasoner-8B \| 83.2 \| 75.6 \| 60.3 \| 61.6 \| 53.1 \|
	\| w/ 64K Inference Budget \| 90.5 \| 83.2 \| 70.8 \| 66.0 \| 58.1 \|

	> We report the average `pass@1` results (avg@_n_), with all other evaluation metrics following the DeepSeek-R1 assessment framework (temperature=0.6, top_p=0.95).


	---

	## 🧪 Training
	### Configure the experimental environment
	```bash
	git clone https://github.com/suu990901/Klear_Reasoner
	cd Klear_Reasoner
	pip install -r requirements.txt
	```
	For the code, we use [Firejail](https://github.com/netblue30/firejail) for the sandbox environment. Additionally, we implemented multi-process control based on [Pebble](https://github.com/noxdafox/pebble), enabling automatic resource reclamation upon task timeout. For mathematics, we use [math_verify](https://github.com/huggingface/Math-Verify) for judging.

	### Using Ray for Multi-Node Training
	For multi-node training, ensure all nodes are started and connected via Ray before executing the training script. Below is a brief setup guide for Ray across multiple machines:
	#### Step 1: Start Ray on the Head Node (node0)

	On the first node (typically called `node0`), run:

	```bash
	ray start --head --dashboard-host=0.0.0.0
	```

	Get the IP address of the master node.
	```bash
	MASTER_IP=$(hostname -I \| awk '{print $1}')
	```
	#### Step 2: Connect Other Nodes (e.g., node1)

	On each additional worker node (e.g., `node1`), run the following, replacing the IP with that of your head node:

	```bash
	ray start --address=\"$MASTER_IP:6379\"
	```

	### RL Training
	Run the following script on the master node to start the training task.

	```bash
	bash recipe/dapo/perf_run_dapo_ours_math.sh # For Math RL
	bash recipe/dapo/perf_run_dapo_ours_code.sh # For Code RL
	```

	In the startup script, you need to set the following variables:
	```bash
	YOUR_MODEL_PATH="<your_model_path>"
	CKPTS_SAVE_DIR="<ckpts_save_path>"
	YOUR_TRAIN_FILE="<train_data_path>"
	YOUR_TEST_FILE="<test_data_path>"
	```

	## 🤝 Citation
	If you find this work helpful, please cite our paper:
	```bibtex
	@misc{su2025klearreasoneradvancingreasoningcapability,
	title={Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization},
	author={Zhenpeng Su and Leiyu Pan and Xue Bai and Dening Liu and Guanting Dong and Jiaming Huang and Wenping Hu and Fuzheng Zhang and Kun Gai and Guorui Zhou},
	year={2025},
	eprint={2508.07629},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2508.07629},
	}
	```