---
license: cc-by-nc-2.0
base_model:
- Qwen/Qwen2.5-3B-Instruct
tags:
- Reasoning
- GRPO
- DeepSeek
- CoT
- finetune
---

![image/png](https://cdn-uploads.huggingface.co/production/uploads/64e1a459ff3fd4fd8eedb456/9-Siwp9GLyx5F9SNzThAi.png)

A fine-tuned variant of **Qwen 2.5 3B Instruct** designed specifically for improved **toggleable reasoning** and **instruction-following capabilities**. This model has been built by engineers at [xioserv.com](https://xioserv.com) and incorporates specialized modifications to enhance performance for structured reasoning tasks.

---

## Overview

The **AaryanK/Qwen_2.5_3B_Instruct_GRPO_Reasoning_XIOSERV** model is a refined version of the Qwen 2.5 3B Instruct model. It is optimized to provide responses in a structured format, making it particularly useful for tasks requiring clear separation between reasoning steps and final answers.  

### **Toggleable Reasoning Mode**
- If you include the **system prompt**, the model will **explicitly separate reasoning and the final answer**.  
- If you **omit the system prompt**, the model will **respond naturally** without structured reasoning.  

This makes the model highly **versatile**, allowing users to choose between structured reasoning and direct responses based on their specific use case.

---

## System Prompt

To enable structured reasoning, use the following system prompt:

```
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
```

If you do not include this prompt, the model will respond in a **standard, conversational** manner without explicitly separating reasoning from the final answer.

---

## Methodology

To replicate the 'aha moment,' we employed **Group Relative Policy Optimization (GRPO)**, a variant of **Proximal Policy Optimization (PPO)**, which enhances reasoning capabilities while optimizing memory usage. This approach aligns with the techniques outlined in the **DeepSeekMath** paper, where GRPO was instrumental in advancing reasoning in language models. By integrating GRPO with reinforcement learning, our model autonomously refines its problem-solving strategies, mirroring the **self-reflective behavior** observed in **DeepSeek's R1**.

---

## Usage

We have provided **GGUF files** that can be run with **llama.cpp** for efficient inference.  

To run the model with **llama.cpp**, follow the instructions in the [llama.cpp repository](https://github.com/ggerganov/llama.cpp).  

Ensure that you include the system prompt in your input **if you want structured reasoning output**. Otherwise, the model will function like a standard instruct model.

---

## Acknowledgements

- **xioserv.com** – For the engineering efforts in fine-tuning this model.
- **Hugging Face** – For providing an accessible platform to share and deploy models.

For any questions or contributions, please open an issue or submit a pull request on our [GitHub repository](https://github.com/AaryanK/Qwen_2.5_3B_Instruct_GRPO_Reasoning_XIOSERV).

---

Happy coding!