Uploaded model
- Developed by: vishal042002
- License: apache-2.0
- Finetuned from model : unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit
This qwen2 model was trained 2x faster with Unsloth and Huggingface's TRL library.
Run The Model:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"vishal042002/Qwen2.5-3B-GRPO",
torch_dtype="auto",
device_map="cuda"
)
tokenizer = AutoTokenizer.from_pretrained("vishal042002/Qwen2.5-3B-GRPO")
text = "Look at this series: 36, 34, 30, 28, 24, … What number should come next?"
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=128,
temperature=0.7,
top_p=0.9
)
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)
Training Details
This model was fine-tuned using Generalized Reward-Preference Optimization (GRPO), a reinforcement learning technique that combines reward optimization with preference learning.
Training Process
- Base Model: Qwen 2.5 3B
- Method: GRPO (Generalized Reward-Preference Optimization)
- Training Focus: The model was trained to balance between maximizing reward signals while respecting human preferences
- Learning Approach: The model learns from both explicit rewards and pairwise preference data, helping it to generate more aligned and high-quality responses
GRPO enhances the model's capabilities by:
- Incorporating both reward signals and preference learning
- Maintaining a balance between performance optimization and preference alignment
- Reducing the potential for reward hacking while preserving desired model behaviors
- Downloads last month
- 10
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.