File size: 10,571 Bytes
08a2617
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2e9b93d
08a2617
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8330ded
 
 
08a2617
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8a7a0c8
2e9b93d
09f1c23
08a2617
 
 
 
 
09f1c23
08a2617
 
 
 
 
8a7a0c8
08a2617
 
 
 
 
 
 
 
 
 
 
 
 
 
09f1c23
08a2617
 
 
 
 
 
8a7a0c8
08a2617
 
 
 
09f1c23
08a2617
 
 
127dd4a
08a2617
127dd4a
 
 
08a2617
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4325ae6
08a2617
4325ae6
08a2617
4325ae6
 
08a2617
4325ae6
08a2617
4325ae6
 
08a2617
4325ae6
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
base_model: perplexity-ai/r1-1776
language:
- en
library_name: transformers
license: mit
tags:
- deepseek
- deepseek_v3
- unsloth
- transformers
---
<div>
  <p style="margin-bottom: 0; margin-top: 0;">
    <strong>See <a href="https://huggingface.co/collections/unsloth/deepseek-r1-all-versions-678e1c48f5d2fce87892ace5">our collection</a> for versions of Deepseek-R1 including GGUF & 4-bit formats.</strong>
  </p>
  <p style="margin-bottom: 0;">
    <em>Unsloth's r1-1776 <a href="https://unsloth.ai/blog/deepseekr1-dynamic">2-bit Dynamic Quants</a> is selectively quantized, greatly improving accuracy over standard 1-bit/2-bit.</em>
  </p>
  <div style="display: flex; gap: 5px; align-items: center; ">
    <a href="https://github.com/unslothai/unsloth/">
      <img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="133">
    </a>
    <a href="https://discord.gg/unsloth">
      <img src="https://github.com/unslothai/unsloth/raw/main/images/Discord%20button.png" width="173">
    </a>
    <a href="https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-r1-on-your-own-local-device">
      <img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="143">
    </a>
  </div>
<h1 style="margin-top: 0rem;">Instructions to run this model in llama.cpp:</h2>
</div>

Or you can view more detailed instructions here: [unsloth.ai/blog/deepseekr1-dynamic](https://unsloth.ai/blog/deepseekr1-dynamic)
1. Do not forget about `<|User|>` and `<|Assistant|>` tokens! - Or use a chat template formatter. Also 
do not forget about `<think>\n`! 
Prompt format: `"<|User|>Create a Flappy Bird game in Python.<|Assistant|><think>\n"`
2. Obtain the latest `llama.cpp` at https://github.com/ggerganov/llama.cpp. You can follow the build instructions below as well:
```bash
apt-get update
apt-get install build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggerganov/llama.cpp
cmake llama.cpp -B llama.cpp/build \
	-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
```
3. It's best to use `--min-p 0.05` to counteract very rare token predictions - I found this to work well especially for the 1.58bit model.
4. Download the model via:
```python
# pip install huggingface_hub hf_transfer
# import os # Optional for faster downloading
# os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

from huggingface_hub import snapshot_download
snapshot_download(
  repo_id = "unsloth/r1-1776-GGUF",
  local_dir = "r1-1776-GGUF",
  allow_patterns = ["*UD-Q2_K_XL*"], # Select quant type Q2_K_XL for dynamic 2bit
)
```
5. Example with Q4_0 K quantized cache **Notice -no-cnv disables auto conversation mode**
```bash
   ./llama.cpp/llama-cli \
	  --model r1-1776-GGUF/UD-Q2_K_XL/r1-1776-UD-Q2_K_XL-00001-of-00005.gguf \
	  --cache-type-k q4_0 \
	  --threads 12 -no-cnv --prio 2 \
	  --temp 0.6 \
	  --ctx-size 8192 \
	  --seed 3407 \
	  --prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|><think>\n"
```
   Example output:
   
   ```txt
    Okay, so I need to figure out what 1 plus 1 is. Hmm, where do I even start? I remember from school that adding numbers is pretty basic, but I want to make sure I understand it properly.
    Let me think, 1 plus 1. So, I have one item and I add another one. Maybe like a apple plus another apple. If I have one apple and someone gives me another, I now have two apples. So, 1 plus 1 should be 2. That makes sense.
    Wait, but sometimes math can be tricky. Could it be something else? Like, in a different number system maybe? But I think the question is straightforward, using regular numbers, not like binary or hexadecimal or anything.
    I also recall that in arithmetic, addition is combining quantities. So, if you have two quantities of 1, combining them gives you a total of 2. Yeah, that seems right.
    Is there a scenario where 1 plus 1 wouldn't be 2? I can't think of any...
   ```
   
6. If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.
```bash
  ./llama.cpp/llama-cli \
    --model r1-1776-GGUF/UD-Q2_K_XL/r1-1776-UD-Q2_K_XL-00001-of-00005.gguf \
    --cache-type-k q4_0 \
    --threads 12 -no-cnv --prio 2 \
    --n-gpu-layers 7 \
    --temp 0.6 \
    --ctx-size 8192 \
    --seed 3407 \
    --prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|><think>\n"
```
7. If you want to merge the weights together, use this script:
```
./llama.cpp/llama-gguf-split --merge \
    r1-1776-GGUF/UD-Q2_K_XL/r1-1776-UD-Q2_K_XL-00001-of-00005.gguf \
    merged_file.gguf
```

| Dynamic Bits     | Type   | Disk Size |  Accuracy | Link                      | Details   |
| -------- | -------- | ------------ | ------------ | ---------------------|  ---------- |
| 2bit | UD-Q2_K_XL |   **211GB**    | Better      | [Link](https://huggingface.co/unsloth/r1-1776-GGUF/tree/main/r1-1776-UD-Q2_K_XL) | MoE all 2.5bit. `down_proj` in MoE mixture of 3.5/2.5bit |
| 3bit | UD-Q3_K_XL |   **298GB**    | Best        | [Link](https://huggingface.co/unsloth/r1-1776-GGUF/tree/main/r1-1776-UD-Q3_K_XL) | MoE Q3_K_M. Attention parts are upcasted |
| 4bit | UD-Q4_K_XL |   **377GB**    | Best        | [Link](https://huggingface.co/unsloth/r1-1776-GGUF/tree/main/r1-1776-UD-Q4_K_XL) | MoE Q4_K_M. Attention parts are upcasted |

# Finetune your own Reasoning model like R1 with Unsloth!
We have a free Google Colab notebook for turning Llama 3.1 (8B) into a reasoning model: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb

[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/Discord%20button.png" width="200"/>](https://discord.gg/unsloth)
[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)


## ✨ Finetune for Free

All notebooks are **beginner friendly**! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.

| Unsloth supports          |    Free Notebooks                                                                                           | Performance | Memory use |
|-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------|
| **GRPO with Phi-4 (14B)**      | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Phi_4_(14B)-GRPO.ipynb)               | 2x faster | 80% less |
| **Llama-3.2 (3B)**      | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb)               | 2.4x faster | 58% less |
| **Llama-3.2 (11B vision)**      | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)               | 2x faster | 60% less |
| **Qwen2 VL (7B)**      | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2_VL_(7B)-Vision.ipynb)               | 1.8x faster | 60% less |
| **Qwen2.5 (7B)**      | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2.5_(7B)-Alpaca.ipynb)               | 2x faster | 60% less |
| **Llama-3.1 (8B)**      | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb)               | 2.4x faster | 58% less |
| **Phi-3.5 (mini)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Phi_3.5_Mini-Conversational.ipynb)               | 2x faster | 50% less |
| **Gemma 2 (9B)**      | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma2_(9B)-Alpaca.ipynb)               | 2.4x faster | 58% less |
| **Mistral (7B)**    | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_v0.3_(7B)-Conversational.ipynb)               | 2.2x faster | 62% less |

[<img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="200"/>](https://docs.unsloth.ai)

- This [Llama 3.2 conversational notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb) is useful for ShareGPT ChatML / Vicuna templates.
- This [text completion notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_(7B)-Text_Completion.ipynb) is for raw text. This [DPO notebook](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing) replicates Zephyr.
- \* Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster.

# R1 1776

Blog link: [https://perplexity.ai/hub/blog/open-sourcing-r1-1776](https://perplexity.ai/hub/blog/open-sourcing-r1-1776 ) 

R1 1776 is a DeepSeek-R1 reasoning model that has been post-trained by Perplexity AI to remove Chinese Communist Party censorship. 
The model provides unbiased, accurate, and factual information while maintaining high reasoning capabilities.

## Evals

To ensure our model remains fully “uncensored” and capable of engaging with a broad spectrum of sensitive topics, we curated a diverse, multilingual evaluation set of over a 1000 of examples that comprehensively cover such subjects. We then use human annotators as well as carefully designed LLM judges to measure the likelihood a model will evade or provide overly sanitized responses to the queries.
![image/png](https://cdn-uploads.huggingface.co/production/uploads/675c8332d01f593dc90817f5/GiN2VqC5hawUgAGJ6oHla.png)

We also ensured that the model’s math and reasoning abilities remained intact after the decensoring process. Evaluations on multiple benchmarks showed that our post-trained model performed on par with the base R1 model, indicating that the decensoring had no impact on its core reasoning capabilities.
![image/png](https://cdn-uploads.huggingface.co/production/uploads/675c8332d01f593dc90817f5/n4Z9Byqp2S7sKUvCvI40R.png)