Qwen3-Reranker-0.6B-GGUF

🚨 REQUIRED Llama.cpp build: https://github.com/ngxson/llama.cpp/tree/xsn/qwen3_embd_rerank
This unmerged fix branch is mandatory to run Qwen3 reranking models. Other HF GGUF quantizations of the 0.6B reranker typically fail in mainline llama.cpp because they were not produced with this build. This quantization was produced with the above build and works.

Purpose

Multilingual text-reranking model in GGUF for efficient CPU/GPU inference with llama.cpp-compatible back-ends.
Parameters ≈ 0.6 B.

Note: Token embedding matrix and output tensors are left at FP16 across all quantizations.

Files

Filename Quant Size (bytes / MiB) Est. quality Δ vs FP16
Qwen3-Reranker-0.6B-F16.gguf FP16 1,197,634,048 B (1142.2 MiB) 0 (reference)
Qwen3-Reranker-0.6B-Q4_K_M.gguf Q4_K_M 396,476,032 B (378.1 MiB) TBD
Qwen3-Reranker-0.6B-Q5_K_M.gguf Q5_K_M 444,186,496 B (423.6 MiB) TBD
Qwen3-Reranker-0.6B-Q6_K.gguf Q6_K 494,878,880 B (472.0 MiB) TBD
Qwen3-Reranker-0.6B-Q8_0.gguf Q8_0 639,153,088 B (609.5 MiB) TBD

Upstream Source

  • Repo: Qwen/Qwen3-Reranker-0.6B
  • Commit: f16fc5d (2025-06-09)
  • License: Apache-2.0

Conversion & Quantization

# Convert safetensors → GGUF (FP16)
python convert_hf_to_gguf.py ~/models/local/Qwen3-Reranker-0.6B

# Quantize variants
EMB_OPT="--token-embedding-type F16 --leave-output-tensor"
for QT in Q4_K_M Q5_K_M Q6_K Q8_0; do
  llama-quantize $EMB_OPT Qwen3-Reranker-0.6B-F16.gguf Qwen3-Reranker-0.6B-${QT}.gguf $QT
done
Downloads last month
-
GGUF
Model size
596M params
Architecture
qwen3
Hardware compatibility
Log In to view the estimation

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JonathanMiddleton/Qwen3-Reranker-0.6B

Quantized
(18)
this model