NVIDIA-Nemotron-Nano-9B-v2-gguf

GGUF quantizations of NVIDIA’s NVIDIA-Nemotron-Nano-9B-v2. These files target llama.cpp-compatible runtimes.

Available Models

Model Size Bits/Weight Description
NVIDIA-Nemotron-Nano-9B-v2-gguf-Q8_0.gguf 8.9GB ~8.0 Near-lossless, reference quality
NVIDIA-Nemotron-Nano-9B-v2-gguf-Q6_K.gguf 8.6GB ~6.0 High quality, recommended for most users
NVIDIA-Nemotron-Nano-9B-v2-gguf-Q5_K_M.gguf 6.6GB ~5.0 Good quality, balanced
NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_M.gguf 6.1GB ~4.0 Standard choice, good compression
NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_1.gguf 5.5GB ~4.0 Legacy 4-bit (Q4_1), better than Q4_0
NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_0.gguf 5.0GB ~4.0 Legacy 4-bit (Q4_0), smaller
NVIDIA-Nemotron-Nano-9B-v2-gguf-IQ4_XS.gguf 5.0GB 4.25 Integer quantization, excellent compression
NVIDIA-Nemotron-Nano-9B-v2-gguf-IQ3_M.gguf 4.9GB 3.66 Ultra-small, mobile/edge deployment
NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_S.gguf 5.8GB ~4.0 4-bit K (small), smaller than Q4_K_M
NVIDIA-Nemotron-Nano-9B-v2-gguf-Q2_K.gguf 4.7GB ~2.0 2-bit K, maximum compression
NVIDIA-Nemotron-Nano-9B-v2-gguf-f16.gguf 17GB 16.0 Full precision reference (optional)

Usage

  • Download a quantization
    • huggingface-cli download weathermanj/NVIDIA-Nemotron-Nano-9B-v2-gguf NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_M.gguf --local-dir ./
  • Run with llama.cpp
    • ./llama-server -m NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_M.gguf -c 4096

Performance (tokens/s)

CPU vs CUDA vs CUDA+FlashAttn on a 24GB RTX 3090, n_predict=64, temp=0.7, top_p=0.95.

Model CPU Factoid CPU Code CPU Reasoning CUDA Factoid CUDA Code CUDA Reasoning CUDA+FA Factoid CUDA+FA Code CUDA+FA Reasoning
IQ3_M 10.96 9.83 9.84 59.51 48.83 51.22 49.46 51.48 51.54
Q4_K_M 8.59 8.03 8.02 48.28 48.72 48.70 53.48 48.73 47.97
Q5_K_M 7.54 7.54 7.52 49.09 46.00 46.87 51.25 50.58 47.00
Q6_K 6.65 6.19 5.89 52.77 41.84 42.06 47.59 41.48 42.85
Q8_0 6.95 5.79 5.93 45.99 40.81 41.51 48.32 41.21 41.54

Notes:

  • IQ3_M is fastest on this setup; Q4_K_M offers stronger quality with close speed.
  • Flash Attention helps variably; larger micro-batches (e.g., --ubatch-size 1024) can improve throughput.

Notes

  • Base model: nvidia/NVIDIA-Nemotron-Nano-9B-v2
  • These are GGUF files suitable for llama.cpp and compatible backends.
  • Choose a quantization based on your resource/quality needs (see table).

License

Downloads last month
1,539
GGUF
Model size
8.89B params
Architecture
nemotron_h
Hardware compatibility
Log In to view the estimation

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for weathermanj/NVIDIA-Nemotron-Nano-9B-v2-gguf