File size: 2,389 Bytes
85219cc 5473b25 85219cc 5473b25 85219cc 5473b25 85219cc 5473b25 85219cc 5473b25 85219cc 5473b25 85219cc 5473b25 85219cc 5473b25 85219cc 5473b25 85219cc 5473b25 85219cc 5473b25 85219cc 5473b25 85219cc 5473b25 85219cc 5473b25 85219cc 5473b25 85219cc 5473b25 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
---
license: apache-2.0
base_model: thomasjhuang/qwen2-sft-warmup
tags:
- reinforcement-learning
- rloo
- countdown-math
- qwen2
language:
- en
pipeline_tag: text-generation
---
# Qwen2 RLOO Countdown (Step 150)
This model is a Qwen2-based language model fine-tuned using RLOO (Rank-order Learning with Localized Objectives) on countdown math problems.
## Training Details
- **Base Model**: thomasjhuang/qwen2-sft-warmup
- **Method**: RLOO (Reinforcement Learning from Human Feedback)
- **Dataset**: Jiayi-Pan/Countdown-Tasks-3to4
- **Training Steps**: 150 optimizer steps
- **Learning Rate**: 3e-6
- **Temperature**: 0.1
- **Batch Size**: 2
- **K Samples**: 8
## Key Fixes Applied
1. **Prompt Format**: Updated to match SFT evaluation format with detailed instructions
2. **Token Length**: Increased to 250 tokens for complete reasoning
3. **Temperature**: Reduced to 0.1 for more deterministic generation
4. **Extraction**: Fixed to work with vLLM outputs
## Performance
During training at step 150, the model achieved:
- Average rewards ranging from 0.05 to 0.50 across batches
- Successful generation of proper `<think>` and `<answer>` tags
- Correct solutions to various countdown math problems
## Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("thomasjhuang/qwen2-rloo-countdown-step150")
model = AutoModelForCausalLM.from_pretrained("thomasjhuang/qwen2-rloo-countdown-step150")
prompt = '''Using the numbers [8, 16, 80], create an equation that equals 72. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>.'''
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=300, temperature=0.1)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```
## Training Progress
This checkpoint represents an intermediate state in RLOO training where:
- The model learned to follow the correct prompt format
- Success rates improved from 0% to 10-50% on various problems
- The model generates structured reasoning in `<think>` tags
- Solutions are properly formatted in `<answer>` tags
For the latest checkpoint, see: thomasjhuang/qwen2-rloo-countdown-final
|