NVIDIA-Nemotron-Nano-9B-v2-gguf
GGUF quantizations of NVIDIA’s NVIDIA-Nemotron-Nano-9B-v2. These files target llama.cpp-compatible runtimes.
Available Models
Model | Size | Bits/Weight | Description |
---|---|---|---|
NVIDIA-Nemotron-Nano-9B-v2-gguf-Q8_0.gguf |
8.9GB | ~8.0 | Near-lossless, reference quality |
NVIDIA-Nemotron-Nano-9B-v2-gguf-Q6_K.gguf |
8.6GB | ~6.0 | High quality, recommended for most users |
NVIDIA-Nemotron-Nano-9B-v2-gguf-Q5_K_M.gguf |
6.6GB | ~5.0 | Good quality, balanced |
NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_M.gguf |
6.1GB | ~4.0 | Standard choice, good compression |
NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_1.gguf |
5.5GB | ~4.0 | Legacy 4-bit (Q4_1), better than Q4_0 |
NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_0.gguf |
5.0GB | ~4.0 | Legacy 4-bit (Q4_0), smaller |
NVIDIA-Nemotron-Nano-9B-v2-gguf-IQ4_XS.gguf |
5.0GB | 4.25 | Integer quantization, excellent compression |
NVIDIA-Nemotron-Nano-9B-v2-gguf-IQ3_M.gguf |
4.9GB | 3.66 | Ultra-small, mobile/edge deployment |
NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_S.gguf |
5.8GB | ~4.0 | 4-bit K (small), smaller than Q4_K_M |
NVIDIA-Nemotron-Nano-9B-v2-gguf-Q2_K.gguf |
4.7GB | ~2.0 | 2-bit K, maximum compression |
NVIDIA-Nemotron-Nano-9B-v2-gguf-f16.gguf |
17GB | 16.0 | Full precision reference (optional) |
Usage
- Download a quantization
huggingface-cli download weathermanj/NVIDIA-Nemotron-Nano-9B-v2-gguf NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_M.gguf --local-dir ./
- Run with llama.cpp
./llama-server -m NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_M.gguf -c 4096
Performance (tokens/s)
CPU vs CUDA vs CUDA+FlashAttn on a 24GB RTX 3090, n_predict=64, temp=0.7, top_p=0.95.
Model | CPU Factoid | CPU Code | CPU Reasoning | CUDA Factoid | CUDA Code | CUDA Reasoning | CUDA+FA Factoid | CUDA+FA Code | CUDA+FA Reasoning |
---|---|---|---|---|---|---|---|---|---|
IQ3_M | 10.96 | 9.83 | 9.84 | 59.51 | 48.83 | 51.22 | 49.46 | 51.48 | 51.54 |
Q4_K_M | 8.59 | 8.03 | 8.02 | 48.28 | 48.72 | 48.70 | 53.48 | 48.73 | 47.97 |
Q5_K_M | 7.54 | 7.54 | 7.52 | 49.09 | 46.00 | 46.87 | 51.25 | 50.58 | 47.00 |
Q6_K | 6.65 | 6.19 | 5.89 | 52.77 | 41.84 | 42.06 | 47.59 | 41.48 | 42.85 |
Q8_0 | 6.95 | 5.79 | 5.93 | 45.99 | 40.81 | 41.51 | 48.32 | 41.21 | 41.54 |
Notes:
- IQ3_M is fastest on this setup; Q4_K_M offers stronger quality with close speed.
- Flash Attention helps variably; larger micro-batches (e.g.,
--ubatch-size 1024
) can improve throughput.
Notes
- Base model: nvidia/NVIDIA-Nemotron-Nano-9B-v2
- These are GGUF files suitable for llama.cpp and compatible backends.
- Choose a quantization based on your resource/quality needs (see table).
License
- NVIDIA Open Model License: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
- Downloads last month
- 1,539
Hardware compatibility
Log In
to view the estimation
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
Model tree for weathermanj/NVIDIA-Nemotron-Nano-9B-v2-gguf
Base model
nvidia/NVIDIA-Nemotron-Nano-12B-v2-Base
Finetuned
nvidia/NVIDIA-Nemotron-Nano-12B-v2
Finetuned
nvidia/NVIDIA-Nemotron-Nano-9B-v2