guru-32B / README.md

Create README.md

21de8c1 verified 2 months ago

4.63 kB

	---
	library_name: transformers
	pipeline_tag: text-generation
	license: cc-by-nc-4.0
	---

	This repository contains the Guru-32B (base Qwen2.5-32B) model presented in [Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective](https://huggingface.co/papers/2506.14965).

	The leaderboard is evaluated with our evaluation [code](https://github.com/LLM360/Reasoning360/tree/main/scripts/offline_eval). The parameters we set in evaluation for all models: temperature=1.0, top_p=0.7.

	\| Domain \| Benchmark \| GURU 7B \| General Reasoner 7B \| ORZ 7B◇ \| SimpleRL 7B \| GURU 32B \| ORZ 32B◇ \| SimpleRL 32B \|
	\|----------------\|--------------------------------\|------------:\|-------------------------:\|------------:\|---------------:\|-------------:\|--------------:\|-----------------:\|
	\| Math \| AIME24 (avg@32) \| 17.50 \| 17.08 \| 16.25 \| 15.60 \| 34.89 \| 47.50 \| 27.20 \|
	\| \| MATH500 \| 77.25 \| 70.40 \| 80.80 \| 87.00 \| 86.00 \| 89.80 \| 89.60 \|
	\| Code \| LiveCodeBench (avg@4) \| 16.49 \| 8.51 \| 5.47 \| 6.72 \| 29.30 \| 22.04 \| 19.80 \|
	\| \| HumanEval (avg@4) \| 82.62 \| 61.12 \| 67.38 \| 58.08 \| 90.85 \| 84.30 \| 81.25 \|
	\| \| MBPP \| 70.00 \| 39.80 \| 48.40 \| 49.60 \| 78.80 \| 74.20 \| 76.75 \|
	\| Science \| GPQA-diamond (avg@4) \| 40.78 \| 38.64 \| 37.63 \| 35.98 \| 50.63 \| 55.67 \| 46.46 \|
	\| \| SuperGPQA \| 31.80 \| 30.64 \| 29.75 \| 27.29 \| 43.60 \| 46.05 \| 37.73 \|
	\| Logic \| ARC-AGI (avg@4) \| 3.31 \| 0.75 \| 0.00 \| 0.50 \| 7.63 \| 2.31 \| 5.25 \|
	\| \| Zebra Puzzle (avg@4) \| 39.40 \| 0.07 \| 1.00 \| 0.62 \| 45.21 \| 0.54 \| 1.16 \|
	\| Simulation \| CodeI/O (avg@4) \| 15.63 \| 7.13 \| 5.13 \| 6.63 \| 12.63 \| 3.75 \| 9.75 \|
	\| \| CruxEval-I \| 61.72 \| 63.63 \| 69.38 \| 56.25 \| 80.63 \| 71.13 \| 72.63 \|
	\| \| CruxEval-O \| 71.28 \| 56.50 \| 65.88 \| 58.31 \| 88.75 \| 82.38 \| 67.75 \|
	\| Tabular \| FinQA \| 34.70 \| 34.33 \| 37.60 \| 35.10 \| 46.14 \| 45.20 \| 45.41 \|
	\| \| HiTab \| 74.20 \| 54.40 \| 54.10 \| 50.40 \| 82.00 \| 63.30 \| 69.00 \|
	\| \| MultiHiertt (avg@4) \| 44.94 \| 31.62 \| 38.10 \| 37.57 \| 55.28 \| 52.83 \| 52.83 \|
	\| Others \| IFEval \| 35.81 \| 39.56 \| 32.72 \| 36.69 \| 55.45 \| 38.26 \| 55.27 \|
	\| \| LiveBench \| 18.57 \| 19.76 \| 12.64 \| 15.20 \| 34.30 \| 28.78 \| 28.33 \|
	\| \| Average Score \| 43.29 \| 33.76 \| 35.42 \| 33.97 \| 54.24 \| 47.53 \| 46.25 \|



	Example usage:
	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model = "LLM360/Guru-32B"
	tokenizer = AutoTokenizer.from_pretrained(model)
	model = AutoModelForCausalLM.from_pretrained(model, device_map="auto", torch_dtype="auto")

	messages = [{"role": "user", "content": "What is reinforcement learning?"}]
	prompt = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
	outputs = model.generate(prompt, max_new_tokens=256, temperature=1.0, top_p=0.7)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	Please refer to the [paper](https://arxiv.org/abs/2506.14965) for more details.

	---
	library_name: transformers
	pipeline_tag: text-generation
	license: cc-by-nc-4.0
	---

	This repository contains the Guru-32B (base Qwen2.5-32B) model presented in [Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective](https://huggingface.co/papers/2506.14965).

	The leaderboard is evaluated with our evaluation [code](https://github.com/LLM360/Reasoning360/tree/main/scripts/offline_eval). The parameters we set in evaluation for all models: temperature=1.0, top_p=0.7.

	\| Domain \| Benchmark \| GURU 7B \| General Reasoner 7B \| ORZ 7B◇ \| SimpleRL 7B \| GURU 32B \| ORZ 32B◇ \| SimpleRL 32B \|
	\|----------------\|--------------------------------\|------------:\|-------------------------:\|------------:\|---------------:\|-------------:\|--------------:\|-----------------:\|
	\| Math \| AIME24 (avg@32) \| 17.50 \| 17.08 \| 16.25 \| 15.60 \| 34.89 \| 47.50 \| 27.20 \|
	\| \| MATH500 \| 77.25 \| 70.40 \| 80.80 \| 87.00 \| 86.00 \| 89.80 \| 89.60 \|
	\| Code \| LiveCodeBench (avg@4) \| 16.49 \| 8.51 \| 5.47 \| 6.72 \| 29.30 \| 22.04 \| 19.80 \|
	\| \| HumanEval (avg@4) \| 82.62 \| 61.12 \| 67.38 \| 58.08 \| 90.85 \| 84.30 \| 81.25 \|
	\| \| MBPP \| 70.00 \| 39.80 \| 48.40 \| 49.60 \| 78.80 \| 74.20 \| 76.75 \|
	\| Science \| GPQA-diamond (avg@4) \| 40.78 \| 38.64 \| 37.63 \| 35.98 \| 50.63 \| 55.67 \| 46.46 \|
	\| \| SuperGPQA \| 31.80 \| 30.64 \| 29.75 \| 27.29 \| 43.60 \| 46.05 \| 37.73 \|
	\| Logic \| ARC-AGI (avg@4) \| 3.31 \| 0.75 \| 0.00 \| 0.50 \| 7.63 \| 2.31 \| 5.25 \|
	\| \| Zebra Puzzle (avg@4) \| 39.40 \| 0.07 \| 1.00 \| 0.62 \| 45.21 \| 0.54 \| 1.16 \|
	\| Simulation \| CodeI/O (avg@4) \| 15.63 \| 7.13 \| 5.13 \| 6.63 \| 12.63 \| 3.75 \| 9.75 \|
	\| \| CruxEval-I \| 61.72 \| 63.63 \| 69.38 \| 56.25 \| 80.63 \| 71.13 \| 72.63 \|
	\| \| CruxEval-O \| 71.28 \| 56.50 \| 65.88 \| 58.31 \| 88.75 \| 82.38 \| 67.75 \|
	\| Tabular \| FinQA \| 34.70 \| 34.33 \| 37.60 \| 35.10 \| 46.14 \| 45.20 \| 45.41 \|
	\| \| HiTab \| 74.20 \| 54.40 \| 54.10 \| 50.40 \| 82.00 \| 63.30 \| 69.00 \|
	\| \| MultiHiertt (avg@4) \| 44.94 \| 31.62 \| 38.10 \| 37.57 \| 55.28 \| 52.83 \| 52.83 \|
	\| Others \| IFEval \| 35.81 \| 39.56 \| 32.72 \| 36.69 \| 55.45 \| 38.26 \| 55.27 \|
	\| \| LiveBench \| 18.57 \| 19.76 \| 12.64 \| 15.20 \| 34.30 \| 28.78 \| 28.33 \|
	\| \| Average Score \| 43.29 \| 33.76 \| 35.42 \| 33.97 \| 54.24 \| 47.53 \| 46.25 \|



	Example usage:
	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model = "LLM360/Guru-32B"
	tokenizer = AutoTokenizer.from_pretrained(model)
	model = AutoModelForCausalLM.from_pretrained(model, device_map="auto", torch_dtype="auto")

	messages = [{"role": "user", "content": "What is reinforcement learning?"}]
	prompt = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
	outputs = model.generate(prompt, max_new_tokens=256, temperature=1.0, top_p=0.7)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	Please refer to the [paper](https://arxiv.org/abs/2506.14965) for more details.