thomasjhuang
/

qwen2-rloo-countdown-step150

Text Generation

reinforcement-learning

Model card Files Files and versions Community

qwen2-rloo-countdown-step150 / README.md

thomasjhuang's picture

Add model card with training details

5473b25 verified 3 months ago

|

history blame contribute delete

2.39 kB

	---
	license: apache-2.0
	base_model: thomasjhuang/qwen2-sft-warmup
	tags:
	- reinforcement-learning
	- rloo
	- countdown-math
	- qwen2
	language:
	- en
	pipeline_tag: text-generation
	---

	# Qwen2 RLOO Countdown (Step 150)

	This model is a Qwen2-based language model fine-tuned using RLOO (Rank-order Learning with Localized Objectives) on countdown math problems.

	## Training Details

	- Base Model: thomasjhuang/qwen2-sft-warmup
	- Method: RLOO (Reinforcement Learning from Human Feedback)
	- Dataset: Jiayi-Pan/Countdown-Tasks-3to4
	- Training Steps: 150 optimizer steps
	- Learning Rate: 3e-6
	- Temperature: 0.1
	- Batch Size: 2
	- K Samples: 8

	## Key Fixes Applied

	1. Prompt Format: Updated to match SFT evaluation format with detailed instructions
	2. Token Length: Increased to 250 tokens for complete reasoning
	3. Temperature: Reduced to 0.1 for more deterministic generation
	4. Extraction: Fixed to work with vLLM outputs

	## Performance

	During training at step 150, the model achieved:
	- Average rewards ranging from 0.05 to 0.50 across batches
	- Successful generation of proper `<think>` and `<answer>` tags
	- Correct solutions to various countdown math problems

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	tokenizer = AutoTokenizer.from_pretrained("thomasjhuang/qwen2-rloo-countdown-step150")
	model = AutoModelForCausalLM.from_pretrained("thomasjhuang/qwen2-rloo-countdown-step150")

	prompt = '''Using the numbers [8, 16, 80], create an equation that equals 72. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>.'''

	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, max_length=300, temperature=0.1)
	response = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(response)
	```

	## Training Progress

	This checkpoint represents an intermediate state in RLOO training where:
	- The model learned to follow the correct prompt format
	- Success rates improved from 0% to 10-50% on various problems
	- The model generates structured reasoning in `<think>` tags
	- Solutions are properly formatted in `<answer>` tags

	For the latest checkpoint, see: thomasjhuang/qwen2-rloo-countdown-final