Uploaded model

  • Developed by: vishal042002
  • License: apache-2.0
  • Finetuned from model : unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit

This qwen2 model was trained 2x faster with Unsloth and Huggingface's TRL library.

Run The Model:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "vishal042002/Qwen2.5-3B-GRPO",
    torch_dtype="auto",
    device_map="cuda"
)
tokenizer = AutoTokenizer.from_pretrained("vishal042002/Qwen2.5-3B-GRPO")

text = "Look at this series: 36, 34, 30, 28, 24, … What number should come next?"

inputs = tokenizer(text, return_tensors="pt").to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    temperature=0.7,
    top_p=0.9
)

response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)

Training Details

This model was fine-tuned using Generalized Reward-Preference Optimization (GRPO), a reinforcement learning technique that combines reward optimization with preference learning.

Training Process

  • Base Model: Qwen 2.5 3B
  • Method: GRPO (Generalized Reward-Preference Optimization)
  • Training Focus: The model was trained to balance between maximizing reward signals while respecting human preferences
  • Learning Approach: The model learns from both explicit rewards and pairwise preference data, helping it to generate more aligned and high-quality responses

GRPO enhances the model's capabilities by:

  • Incorporating both reward signals and preference learning
  • Maintaining a balance between performance optimization and preference alignment
  • Reducing the potential for reward hacking while preserving desired model behaviors
Downloads last month
10
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Dataset used to train vishal042002/Qwen2.5-3B-GRPO