Qwen3-Reranker-0.6B-GGUF
🚨 REQUIRED Llama.cpp build: https://github.com/ngxson/llama.cpp/tree/xsn/qwen3_embd_rerank
This unmerged fix branch is mandatory to run Qwen3 reranking models. Other HF GGUF quantizations of the 0.6B reranker typically fail in mainline llama.cpp
because they were not produced with this build. This quantization was produced with the above build and works.
Purpose
Multilingual text-reranking model in GGUF for efficient CPU/GPU inference with llama.cpp-compatible back-ends.
Parameters ≈ 0.6 B.
Note: Token embedding matrix and output tensors are left at FP16 across all quantizations.
Files
Filename | Quant | Size (bytes / MiB) | Est. quality Δ vs FP16 |
---|---|---|---|
Qwen3-Reranker-0.6B-F16.gguf |
FP16 | 1,197,634,048 B (1142.2 MiB) | 0 (reference) |
Qwen3-Reranker-0.6B-Q4_K_M.gguf |
Q4_K_M | 396,476,032 B (378.1 MiB) | TBD |
Qwen3-Reranker-0.6B-Q5_K_M.gguf |
Q5_K_M | 444,186,496 B (423.6 MiB) | TBD |
Qwen3-Reranker-0.6B-Q6_K.gguf |
Q6_K | 494,878,880 B (472.0 MiB) | TBD |
Qwen3-Reranker-0.6B-Q8_0.gguf |
Q8_0 | 639,153,088 B (609.5 MiB) | TBD |
Upstream Source
- Repo:
Qwen/Qwen3-Reranker-0.6B
- Commit:
f16fc5d
(2025-06-09) - License: Apache-2.0
Conversion & Quantization
# Convert safetensors → GGUF (FP16)
python convert_hf_to_gguf.py ~/models/local/Qwen3-Reranker-0.6B
# Quantize variants
EMB_OPT="--token-embedding-type F16 --leave-output-tensor"
for QT in Q4_K_M Q5_K_M Q6_K Q8_0; do
llama-quantize $EMB_OPT Qwen3-Reranker-0.6B-F16.gguf Qwen3-Reranker-0.6B-${QT}.gguf $QT
done
- Downloads last month
- -
Hardware compatibility
Log In
to view the estimation
4-bit
5-bit
6-bit
8-bit
16-bit
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support