--- license: cc-by-nc-2.0 base_model: - Qwen/Qwen2.5-3B-Instruct tags: - Reasoning - GRPO - DeepSeek - CoT - finetune --- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64e1a459ff3fd4fd8eedb456/9-Siwp9GLyx5F9SNzThAi.png) A fine-tuned variant of **Qwen 2.5 3B Instruct** designed specifically for improved **toggleable reasoning** and **instruction-following capabilities**. This model has been built by engineers at [xioserv.com](https://xioserv.com) and incorporates specialized modifications to enhance performance for structured reasoning tasks. --- ## Overview The **AaryanK/Qwen_2.5_3B_Instruct_GRPO_Reasoning_XIOSERV** model is a refined version of the Qwen 2.5 3B Instruct model. It is optimized to provide responses in a structured format, making it particularly useful for tasks requiring clear separation between reasoning steps and final answers. ### **Toggleable Reasoning Mode** - If you include the **system prompt**, the model will **explicitly separate reasoning and the final answer**. - If you **omit the system prompt**, the model will **respond naturally** without structured reasoning. This makes the model highly **versatile**, allowing users to choose between structured reasoning and direct responses based on their specific use case. --- ## System Prompt To enable structured reasoning, use the following system prompt: ``` Respond in the following format: ... ... ``` If you do not include this prompt, the model will respond in a **standard, conversational** manner without explicitly separating reasoning from the final answer. --- ## Methodology To replicate the 'aha moment,' we employed **Group Relative Policy Optimization (GRPO)**, a variant of **Proximal Policy Optimization (PPO)**, which enhances reasoning capabilities while optimizing memory usage. This approach aligns with the techniques outlined in the **DeepSeekMath** paper, where GRPO was instrumental in advancing reasoning in language models. By integrating GRPO with reinforcement learning, our model autonomously refines its problem-solving strategies, mirroring the **self-reflective behavior** observed in **DeepSeek's R1**. --- ## Usage We have provided **GGUF files** that can be run with **llama.cpp** for efficient inference. To run the model with **llama.cpp**, follow the instructions in the [llama.cpp repository](https://github.com/ggerganov/llama.cpp). Ensure that you include the system prompt in your input **if you want structured reasoning output**. Otherwise, the model will function like a standard instruct model. --- ## Acknowledgements - **xioserv.com** – For the engineering efforts in fine-tuning this model. - **Hugging Face** – For providing an accessible platform to share and deploy models. For any questions or contributions, please open an issue or submit a pull request on our [GitHub repository](https://github.com/AaryanK/Qwen_2.5_3B_Instruct_GRPO_Reasoning_XIOSERV). --- Happy coding!