DRIVE: Data Curation Best Practices for Reinforcement Learning wIth VErifiable Reward in Competitive Code Generation

Hunyuan Team, Tencent

📖 Paper📙 SFT Model 📘 RL Model 📜 Citation


Abstract

Recent reasoning-first models have spurred a resurgence of interest in RLVR (Reinforcement Learning with Verifiable Reward). However, advances are dominated by mathematics, with competitive-programming code generation being relatively underexplored. This work investigates how to construct RLVR datasets and presents practical training techniques that yield strong performance.

Our pipeline begins with Supervised Fine-Tuning (SFT) distilled from strong open-source models. This is followed by a two-stage RL process using executable, testcase-driven rewards:

  1. Stage 1 (Entropy Expansion): Training on a large, uniformly distributed set of problems with moderate rollouts (8) and a shorter context (24k) to expand entropy and mitigate repetition.
  2. Stage 2 (Hard-Focus Curriculum): Updating on a small, high-quality set of challenging problems using Pre-GRPO with a large rollout budget (64) under a hard-focus curriculum.

We implement our method on Qwen2.5-32B and achieve state-of-the-art performance among models of similar scale, comparable to leading systems like DeepSeek v3.1.

🚀 The DRIVE Pipeline

Our training pipeline consists of two main phases: Supervised Fine-Tuning (SFT) and a Two-Stage Reinforcement Learning process, as illustrated below.

pipeline_overview

Figure 2: The training pipeline of our models.

Phase 1: Supervised Fine-Tuning (SFT)

We begin by fine-tuning Qwen2.5-32B. The key innovation in this stage is Difficulty-Aware Sampling:

  • We first classify all competitive programming prompts into three categories: easy, medium, and hard.
  • To force the model to focus on more challenging problems, we duplicate hard samples twice in the final SFT dataset.
  • We also augment this with general-purpose coding and reasoning-intensive data to improve overall capabilities.

Phase 2: Two-Stage Reinforcement Learning

After SFT, the model still suffers from low entropy, repetitive generation, and poor performance on hard problems. Our two-stage RL process directly addresses this.

Stage 1: Entropy Expansion

  • Goal: Increase output diversity and reduce repetitive patterns.
  • Data: A large, uniformly distributed set of ~9k problems.
  • Method: We use 8 rollouts and a shorter 24k token length. As shown in Figure 3, this "24k-style" training (blue line) successfully increases entropy, while standard training (orange line) leads to entropy collapse.

entropy_vs_steps

Figure 3: The entropy comparison of 24k-style training and 32k-style training.

Stage 2: Hard-Focus Curriculum

  • Goal: Master the most challenging problems.
  • Data: A small, high-quality set of difficult problems (e.g., the 72, 50, and 32 hardest cases from LiveCode V6).
  • Method: We apply a "hard-focus curriculum" that progressively retains only the most difficult instances. Crucially, we use a large rollout budget (64-80 rollouts) in this stage, which we found essential for stable gains on hard problems.

📊 Key Results

Our final 32B model, DRIVE-RL, achieves state-of-the-art performance among similarly sized models and is competitive with larger 64k-context models.

Figure 1: Performance of our models on various benchmarks.

Pass@1 Performance Comparison

The two-stage RL pipeline provides significant improvements over the SFT baseline, particularly on challenging benchmarks. We see a +58.3% relative improvement on Codeforces OJ.

Model LiveCode 08-11 LiveCode V5 LiveCode V6 LeetCode Weekly (32) Codeforces OJ (33)
DeepseekV3.1 (64k) 0.692 0.713 0.693 0.688 0.161
Seed1.6-0715 (64k) 0.803 0.824 0.770 0.743 0.188
Qwen3-235B-2507 (64k) 0.681 0.713 0.646 0.688 0.200
--- --- --- --- --- ---
SFT model (32k) 0.602 0.594 0.549 0.578 0.115
RL Stage 1 model (24k) 0.625 0.627 0.634 0.603 0.112
DRIVE-RL model (32k) 0.699 0.697 0.703 0.653 0.182
Rel. Improvement (RL vs SFT) +16.1% +17.3% +28.1% +13.0% +58.3%

(Data sourced from Table 2 in our paper)

Key Findings

  1. Difficulty-aware training is crucial: Standard RL struggles with hard problems. Our hard-focus curriculum (Stage 2) is essential for pushing the model's capabilities.
  2. Entropy expansion is necessary: Skipping Stage 1 (Entropy Expansion) and training only on hard cases hurts generalization to out-of-distribution benchmarks. Both stages are necessary.
  3. Large rollouts for hard problems: A large rollout budget (e.g., 64+) is essential for mastering challenging cases.
  4. Scaling: The DRIVE strategy shows strong, positive scaling trends when applied to a large-scale internal MoE model.

📜 Citation

If you find this work useful, please cite our paper:

@misc{zhu2025drivedatacurationbest,
      title={DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation}, 
      author={Speed Zhu and Jianwei Cai and Guang Chen and Lulu Wu and Saiyong Yang and Wiggin Zhou},
      year={2025},
      eprint={2511.06307},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2511.06307}, 
}

License

This repository contains two separate licenses for different models:

Please refer to the respective license file for the model you are using.

Downloads last month
152
Safetensors
Model size
33B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tencent/DRIVE-RL

Base model

Qwen/Qwen2.5-32B
Finetuned
tencent/DRIVE-SFT
Finetuned
(1)
this model
Quantizations
1 model