Introduction
More information at DPO-VP.
Drawing on the ideas from Iterative DPO, we propose a self-improvement process based on the Qwen2.5-Math-7B base model. In this process, we perform sampling-filtering to construct preference datasets for self-improvement using a challenging 8K MATH dataset.
The final model achieved an average score of 48.2 on five mathematical reasoning benchmarks, which is comparable to the performance of Qwen2.5-Math-7B-Instruct and other RL-based methods under similar data conditions.
All results are in pass@1 accuracy
pass@1 acc | MATH500 | Minerva Math | Olymapaidbench | AMC23 | AIME24 | Avg. |
---|---|---|---|---|---|---|
Qwen2.5-Math-7B * | 64.8 | 15.4 | 25.6 | 37.5 | 16.7 | 32.0 |
Qwen2.5-Math-7B-Instruct * | 83.2 | 33.5 | 38.4 | 62.5 | 20.0 | 47.5 |
rStar-Math-7B ^ | 78.4 | - | 47.1 | 47.5 | 26.7 | - |
Eurus-2-7B-PRIME * | 74.0 | 39.7 | 35.6 | 57.5 | 23.3 | 46.0 |
Qwen2.5-7B-Simple-RL-Zero ^ | 77.2 | 33.5 | 37.6 | 62.5 | 33.3 | 48.8 |
Qwen2.5-7B-Simple-RL-Zero * | 75.6 | 34.2 | 39.0 | 52.5 | 26.7 | 45.6 |
Qwen2.5-7B-PURE-VR * | 79.8 | 36.8 | 41.9 | 60.0 | 20.0 | 47.7 |
Qwen2.5-7B-DPO-VP | 74.8 | 35.3 | 36.9 | 67.5 | 26.7 | 48.2 |
In the table, all models are fine-tuned based on the Qwen2.5-Math-7B base model. Bolded models represent those that were adjusted using the self-improvement method with exactly the same prompts. The results with * are from my own evaluation, and the results with ^ are derived from the corresponding model's technical report. Note that Qwen2.5-7B-Simple-RL-Zero has not released its trained model, so we evaluated a reproduced version found on Huggingface. Additionally, we observed that due to Qwen's official evaluation code slicing the model, slight differences may arise when evaluating on different numbers of GPUs. Our model and the reproduced results were both evaluated on 4 A800 GPUs.
Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "SunnyLin/Qwen2.5-7B-DPO-VP"
device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Find the value of $x$ that satisfies the equation $4x+5 = 6x+7$."
messages = [
{"role": "system", "content": "Please reason step by step, and put your final answer within \\boxed{}."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=2048
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
- Downloads last month
- 3