Introduction

More information at DPO-VP.

Drawing on the ideas from Iterative DPO, we propose a self-improvement process based on the Qwen2.5-Math-7B base model. In this process, we perform sampling-filtering to construct preference datasets for self-improvement using a challenging 8K MATH dataset.

The final model achieved an average score of 48.2 on five mathematical reasoning benchmarks, which is comparable to the performance of Qwen2.5-Math-7B-Instruct and other RL-based methods under similar data conditions.

All results are in pass@1 accuracy

pass@1 acc MATH500 Minerva Math Olymapaidbench AMC23 AIME24 Avg.
Qwen2.5-Math-7B * 64.8 15.4 25.6 37.5 16.7 32.0
Qwen2.5-Math-7B-Instruct * 83.2 33.5 38.4 62.5 20.0 47.5
rStar-Math-7B ^ 78.4 - 47.1 47.5 26.7 -
Eurus-2-7B-PRIME * 74.0 39.7 35.6 57.5 23.3 46.0
Qwen2.5-7B-Simple-RL-Zero ^ 77.2 33.5 37.6 62.5 33.3 48.8
Qwen2.5-7B-Simple-RL-Zero * 75.6 34.2 39.0 52.5 26.7 45.6
Qwen2.5-7B-PURE-VR * 79.8 36.8 41.9 60.0 20.0 47.7
Qwen2.5-7B-DPO-VP 74.8 35.3 36.9 67.5 26.7 48.2

In the table, all models are fine-tuned based on the Qwen2.5-Math-7B base model. Bolded models represent those that were adjusted using the self-improvement method with exactly the same prompts. The results with * are from my own evaluation, and the results with ^ are derived from the corresponding model's technical report. Note that Qwen2.5-7B-Simple-RL-Zero has not released its trained model, so we evaluated a reproduced version found on Huggingface. Additionally, we observed that due to Qwen's official evaluation code slicing the model, slight differences may arise when evaluating on different numbers of GPUs. Our model and the reproduced results were both evaluated on 4 A800 GPUs.

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "SunnyLin/Qwen2.5-7B-DPO-VP"
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Find the value of $x$ that satisfies the equation $4x+5 = 6x+7$."

messages = [
    {"role": "system", "content": "Please reason step by step, and put your final answer within \\boxed{}."},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=2048
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
Downloads last month
3
Safetensors
Model size
7.62B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for SunnyLin/Qwen2.5-7B-DPO-VP

Base model

Qwen/Qwen2.5-7B
Finetuned
(58)
this model

Evaluation results