weathermanj
/

NVIDIA-Nemotron-Nano-9B-v2-gguf

Text Generation

Model card Files Files and versions

NVIDIA-Nemotron-Nano-9B-v2-gguf

GGUF quantizations of NVIDIA’s NVIDIA-Nemotron-Nano-9B-v2. These files target llama.cpp-compatible runtimes.

Available Models

Model	Size	Bits/Weight	Description
`NVIDIA-Nemotron-Nano-9B-v2-gguf-Q8_0.gguf`	8.9GB	~8.0	Near-lossless, reference quality
`NVIDIA-Nemotron-Nano-9B-v2-gguf-Q6_K.gguf`	8.6GB	~6.0	High quality, recommended for most users
`NVIDIA-Nemotron-Nano-9B-v2-gguf-Q5_K_M.gguf`	6.6GB	~5.0	Good quality, balanced
`NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_M.gguf`	6.1GB	~4.0	Standard choice, good compression
`NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_1.gguf`	5.5GB	~4.0	Legacy 4-bit (Q4_1), better than Q4_0
`NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_0.gguf`	5.0GB	~4.0	Legacy 4-bit (Q4_0), smaller
`NVIDIA-Nemotron-Nano-9B-v2-gguf-IQ4_XS.gguf`	5.0GB	4.25	Integer quantization, excellent compression
`NVIDIA-Nemotron-Nano-9B-v2-gguf-IQ3_M.gguf`	4.9GB	3.66	Ultra-small, mobile/edge deployment
`NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_S.gguf`	5.8GB	~4.0	4-bit K (small), smaller than Q4_K_M
`NVIDIA-Nemotron-Nano-9B-v2-gguf-Q2_K.gguf`	4.7GB	~2.0	2-bit K, maximum compression
`NVIDIA-Nemotron-Nano-9B-v2-gguf-f16.gguf`	17GB	16.0	Full precision reference (optional)

Usage

Download a quantization
- huggingface-cli download weathermanj/NVIDIA-Nemotron-Nano-9B-v2-gguf NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_M.gguf --local-dir ./
Run with llama.cpp
- ./llama-server -m NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_M.gguf -c 4096

Performance (tokens/s)

CPU vs CUDA vs CUDA+FlashAttn on a 24GB RTX 3090, n_predict=64, temp=0.7, top_p=0.95.

Model	CPU Factoid	CPU Code	CPU Reasoning	CUDA Factoid	CUDA Code	CUDA Reasoning	CUDA+FA Factoid	CUDA+FA Code	CUDA+FA Reasoning
IQ3_M	10.96	9.83	9.84	59.51	48.83	51.22	49.46	51.48	51.54
Q4_K_M	8.59	8.03	8.02	48.28	48.72	48.70	53.48	48.73	47.97
Q5_K_M	7.54	7.54	7.52	49.09	46.00	46.87	51.25	50.58	47.00
Q6_K	6.65	6.19	5.89	52.77	41.84	42.06	47.59	41.48	42.85
Q8_0	6.95	5.79	5.93	45.99	40.81	41.51	48.32	41.21	41.54

Notes:

IQ3_M is fastest on this setup; Q4_K_M offers stronger quality with close speed.
Flash Attention helps variably; larger micro-batches (e.g., --ubatch-size 1024) can improve throughput.

Notes

Base model: nvidia/NVIDIA-Nemotron-Nano-9B-v2
These are GGUF files suitable for llama.cpp and compatible backends.
Choose a quantization based on your resource/quality needs (see table).

License

NVIDIA Open Model License: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/

Downloads last month: 1,539

GGUF

Model size

8.89B params

Architecture

nemotron_h

Hardware compatibility

Log In to view the estimation

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Model tree for weathermanj/NVIDIA-Nemotron-Nano-9B-v2-gguf

Base model

nvidia/NVIDIA-Nemotron-Nano-12B-v2-Base

Finetuned

nvidia/NVIDIA-Nemotron-Nano-12B-v2

Finetuned

nvidia/NVIDIA-Nemotron-Nano-9B-v2

Quantized

(10)

this model