Qwen3-Reranker-0.6B-GGUF

🚨 REQUIRED Llama.cpp build: https://github.com/ngxson/llama.cpp/tree/xsn/qwen3_embd_rerank
This unmerged fix branch is mandatory to run Qwen3 reranking models. Other HF GGUF quantizations of the 0.6B reranker typically fail in mainline llama.cpp because they were not produced with this build. This quantization was produced with the above build and works.

Purpose

Multilingual text-reranking model in GGUF for efficient CPU/GPU inference with llama.cpp-compatible back-ends.
Parameters ≈ 0.6 B.

Note: Token embedding matrix and output tensors are left at FP16 across all quantizations.

Files

Filename	Quant	Size (bytes / MiB)	Est. quality Δ vs FP16
`Qwen3-Reranker-0.6B-F16.gguf`	FP16	1,197,634,048 B (1142.2 MiB)	0 (reference)
`Qwen3-Reranker-0.6B-Q4_K_M.gguf`	Q4_K_M	396,476,032 B (378.1 MiB)	TBD
`Qwen3-Reranker-0.6B-Q5_K_M.gguf`	Q5_K_M	444,186,496 B (423.6 MiB)	TBD
`Qwen3-Reranker-0.6B-Q6_K.gguf`	Q6_K	494,878,880 B (472.0 MiB)	TBD
`Qwen3-Reranker-0.6B-Q8_0.gguf`	Q8_0	639,153,088 B (609.5 MiB)	TBD

Upstream Source

Repo: Qwen/Qwen3-Reranker-0.6B
Commit: f16fc5d (2025-06-09)
License: Apache-2.0

Conversion & Quantization

# Convert safetensors → GGUF (FP16)
python convert_hf_to_gguf.py ~/models/local/Qwen3-Reranker-0.6B

# Quantize variants
EMB_OPT="--token-embedding-type F16 --leave-output-tensor"
for QT in Q4_K_M Q5_K_M Q6_K Q8_0; do
  llama-quantize $EMB_OPT Qwen3-Reranker-0.6B-F16.gguf Qwen3-Reranker-0.6B-${QT}.gguf $QT
done

JonathanMiddleton
/

Qwen3-Reranker-0.6B

Qwen3-Reranker-0.6B-GGUF

Purpose

Files

Upstream Source

Conversion & Quantization

Model tree for JonathanMiddleton/Qwen3-Reranker-0.6B