File size: 8,781 Bytes
cdfbcd4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6830f66
 
 
 
 
 
 
cdfbcd4
 
 
 
 
 
 
 
6830f66
 
 
 
cdfbcd4
 
 
 
6830f66
 
 
 
 
 
 
 
 
 
 
 
 
cdfbcd4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b58e063
 
 
cdfbcd4
 
b58e063
 
 
cdfbcd4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
---
thumbnail: https://github.com/rinnakk/japanese-pretrained-models/blob/master/rinna.png
license: apache-2.0
language:
- ja
tags:
- qwen2
- conversational
- gptq
base_model: rinna/qwen2.5-bakeneko-32b-instruct
base_model_relation: quantized
pipeline_tag: text-generation
library_name: transformers
---

# `Qwen2.5 Bakeneko 32B Instruct GPTQ int8 (rinna/qwen2.5-bakeneko-32b-instruct-gptq-int8)`

![rinna-icon](./rinna.png)

# Overview

This model is an 8-bit quantized model for [rinna/qwen2.5-bakeneko-32b-instruct](https://huggingface.co/rinna/qwen2.5-bakeneko-32b-instruct) using [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ). The quantized version is 2x smaller than the original model and thus requires less memory and provides faster inference.

| Model Type | Model Name
| :-   | :-
| Japanese Continual Pre-Training Model | Qwen2.5 Bakeneko 32B [[HF]](https://huggingface.co/rinna/qwen2.5-bakeneko-32b)
| Instruction-Tuning Model | Qwen2.5 Bakeneko 32B Instruct [[HF]](https://huggingface.co/rinna/qwen2.5-bakeneko-32b-instruct)[[AWQ]](https://huggingface.co/rinna/qwen2.5-bakeneko-32b-instruct-awq)[[GGUF]](https://huggingface.co/rinna/qwen2.5-bakeneko-32b-instruct-gguf)[[GPTQ int8]](https://huggingface.co/rinna/qwen2.5-bakeneko-32b-instruct-gptq-int8)[[GPTQ int4]](https://huggingface.co/rinna/qwen2.5-bakeneko-32b-instruct-gptq-int4)
| DeepSeek R1 Distill Qwen2.5 Merged Reasoning Model | DeepSeek R1 Distill Qwen2.5 Bakeneko 32B [[HF]](https://huggingface.co/rinna/deepseek-r1-distill-qwen2.5-bakeneko-32b)[[AWQ]](https://huggingface.co/rinna/deepseek-r1-distill-qwen2.5-bakeneko-32b-awq)[[GGUF]](https://huggingface.co/rinna/deepseek-r1-distill-qwen2.5-bakeneko-32b-gguf)[[GPTQ int8]](https://huggingface.co/rinna/deepseek-r1-distill-qwen2.5-bakeneko-32b-gptq-int8)[[GPTQ int4]](https://huggingface.co/rinna/deepseek-r1-distill-qwen2.5-bakeneko-32b-gptq-int4)
| QwQ Merged Reasoning Model | QwQ Bakeneko 32B [[HF]](https://huggingface.co/rinna/qwq-bakeneko-32b)[[AWQ]](https://huggingface.co/rinna/qwq-bakeneko-32b-awq)[[GGUF]](https://huggingface.co/rinna/qwq-bakeneko-32b-gguf)[[GPTQ int8]](https://huggingface.co/rinna/qwq-bakeneko-32b-gptq-int8)[[GPTQ int4]](https://huggingface.co/rinna/qwq-bakeneko-32b-gptq-int4)
| QwQ Bakeneko Merged Instruction-Tuning Model | Qwen2.5 Bakeneko 32B Instruct V2 [[HF]](https://huggingface.co/rinna/qwen2.5-bakeneko-32b-instruct-v2)[[AWQ]](https://huggingface.co/rinna/qwen2.5-bakeneko-32b-instruct-v2-awq)[[GGUF]](https://huggingface.co/rinna/qwen2.5-bakeneko-32b-instruct-v2-gguf)[[GPTQ int8]](https://huggingface.co/rinna/qwen2.5-bakeneko-32b-instruct-v2-gptq-int8)[[GPTQ int4]](https://huggingface.co/rinna/qwen2.5-bakeneko-32b-instruct-v2-gptq-int4)

See [rinna/qwen2.5-bakeneko-32b-instruct](https://huggingface.co/rinna/qwen2.5-bakeneko-32b-instruct) for details about model architecture and data.

* **Contributors**
    - [Toshiaki Wakatsuki](https://huggingface.co/t-w)
    - [Xinqi Chen](https://huggingface.co/Keely0419)
    - [Kei Sawada](https://huggingface.co/keisawada)

* **Release date**

    February 13, 2025

---

# Benchmarking

| Model | Japanese LM Evaluation Harness | Japanese MT-Bench (first turn) | Japanese MT-Bench (multi turn)
| :-    | :-: | :-: | :-:
| [Qwen/Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B) | 79.46 | - | -
| [rinna/qwen2.5-bakeneko-32b](https://huggingface.co/rinna/qwen2.5-bakeneko-32b) | 79.18 | - | -
| [Qwen/Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) | 78.29 | 8.13 | 7.54 
| [rinna/qwen2.5-bakeneko-32b-instruct](https://huggingface.co/rinna/qwen2.5-bakeneko-32b-instruct) | 79.62 | 8.17 | 7.66
| [rinna/qwen2.5-bakeneko-32b-instruct-v2](https://huggingface.co/rinna/qwen2.5-bakeneko-32b-instruct-v2) | 77.92 | 8.86 | 8.53
| [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) | 73.51 | 7.39 | 6.88
| [rinna/deepseek-r1-distill-qwen2.5-bakeneko-32b](https://huggingface.co/rinna/deepseek-r1-distill-qwen2.5-bakeneko-32b) | 77.43 | 8.58 | 8.19
| [Qwen/QwQ-32B](https://huggingface.co/Qwen/QwQ-32B) | 76.12 | 8.58 | 8.25
| [rinna/qwq-bakeneko-32b](https://huggingface.co/rinna/qwq-bakeneko-32b) | 78.31 | 8.81 | 8.52

For detailed benchmarking results, please refer to [rinna's LM benchmark page (Sheet 20250213)](https://rinnakk.github.io/research/benchmarks/lm/index.html).

---

# How to use the model

~~~python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "rinna/qwen2.5-bakeneko-32b-instruct-gptq-int8"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

messages = [
    {"role": "system", "content": "あなたは誠実で優秀な日本人のアシスタントです。"},
    {"role": "user", "content": "ゲーム・小説・アニメに登場するアイテムボックスの特徴と、その原理を詳細に推測してください。"},
]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
input_ids = tokenizer.encode(
    prompt,
    add_special_tokens=False,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_k=20,
    top_p=0.8,
    repetition_penalty=1.05,
)

response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)
~~~

---

# Tokenization
The model uses the original [Qwen/Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) tokenizer.

---

# How to cite
```bibtex
@misc{rinna-qwen2.5-bakeneko-32b-instruct-gptq-int8,
    title = {rinna/qwen2.5-bakeneko-32b-instruct-gptq-int8},
    author = {Wakatsuki, Toshiaki and Chen, Xinqi and Sawada, Kei},
    url = {https://huggingface.co/rinna/qwen2.5-bakeneko-32b-instruct-gptq-int8}
}

@inproceedings{sawada2024release,
    title = {Release of Pre-Trained Models for the {J}apanese Language},
    author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
    booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
    month = {5},
    year = {2024},
    pages = {13898--13905},
    url = {https://aclanthology.org/2024.lrec-main.1213},
    note = {\url{https://arxiv.org/abs/2404.01657}}
}
```
---

# References
```bibtex
@misc{qwen2.5,
    title = {Qwen2.5: A Party of Foundation Models},
    url = {https://qwenlm.github.io/blog/qwen2.5/},
    author = {Qwen Team},
    month = {September},
    year = {2024}
}

@article{qwen2,
    title = {Qwen2 Technical Report}, 
    author = {An Yang and Baosong Yang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Zhou and Chengpeng Li and Chengyuan Li and Dayiheng Liu and Fei Huang and Guanting Dong and Haoran Wei and Huan Lin and Jialong Tang and Jialin Wang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Ma and Jin Xu and Jingren Zhou and Jinze Bai and Jinzheng He and Junyang Lin and Kai Dang and Keming Lu and Keqin Chen and Kexin Yang and Mei Li and Mingfeng Xue and Na Ni and Pei Zhang and Peng Wang and Ru Peng and Rui Men and Ruize Gao and Runji Lin and Shijie Wang and Shuai Bai and Sinan Tan and Tianhang Zhu and Tianhao Li and Tianyu Liu and Wenbin Ge and Xiaodong Deng and Xiaohuan Zhou and Xingzhang Ren and Xinyu Zhang and Xipin Wei and Xuancheng Ren and Yang Fan and Yang Yao and Yichang Zhang and Yu Wan and Yunfei Chu and Yuqiong Liu and Zeyu Cui and Zhenru Zhang and Zhihao Fan},
    journal = {arXiv preprint arXiv:2407.10671},
    year = {2024}
}

@article{huang2023chat,
    title = {Chat Vector: A Simple Approach to Equip LLMs with Instruction Following and Model Alignment in New Languages},
    author = {Huang, Shih-Cheng and Li, Pin-Zu and Hsu, Yu-Chi and Chen, Kuang-Ming and Lin, Yu Tung and Hsiao, Shih-Kai and Tzong-Han Tsai, Richard and Lee, Hung-yi},
    year = {2023},
    url = {https://arxiv.org/abs/2310.04799}
}

@article{meng2024simpo,
    title = {SimPO: Simple Preference Optimization with a Reference-Free Reward},
    author = {Meng, Yu and Xia, Mengzhou and Chen, Danqi},
    journal = {arXiv preprint arXiv:2405.14734},
    year = {2024}
}

@article{frantar2022gptq,
    title = {{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers},
    author = {Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan},
    year = {2022},
    url = {https://arxiv.org/abs/2210.17323}
}
```
---

# License
[The Apache License, Version 2.0](https://opensource.org/license/apache-2-0)