Reasoning Work
Collection
A collection models trained to think like DeepSeek R1 using online learning - Group Relative Policy Optimization (GRPO) introduced by DeepSeekMath
•
6 items
•
Updated
A Qwen2.5 3Billion parameter model trained to "think" like DeepSeek's R1 using GRPO to be able deduce a disease using patients' complaints in one-shot!
Tiny but really impressive model. Training to think and reason has also resulted significant boost in general ELO of the model.
This qwen2 model was trained 2x faster with Unsloth and Huggingface's TRL library.