---
pipeline_tag: text-generation
base_model:
- deepseek-ai/DeepSeek-R1
license: mit
library_name: Model Optimizer
tags:
- nvidia
- ModelOpt
- DeepSeekR1
- quantized
- FP4
---
# Model Overview
## Description:
The NVIDIA DeepSeek R1 FP4 v2 model is the quantized version of the DeepSeek AI's DeepSeek R1 model, which is an auto-regressive language model that uses an optimized transformer architecture. For more information, please check [here](https://huggingface.co/deepseek-ai/DeepSeek-R1). The NVIDIA DeepSeek R1 FP4 model is quantized with [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer).
Compared to [nvidia/DeepSeek-R1-FP4](https://huggingface.co/nvidia/DeepSeek-R1-FP4), this checkpoint additionally quantizes the wo module in attention layers.
This model is ready for commercial/non-commercial use.
## Third-Party Community Consideration
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Non-NVIDIA [(DeepSeek R1) Model Card](https://huggingface.co/deepseek-ai/DeepSeek-R1).
### License/Terms of Use:
[MIT](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/mit.md)
## Model Architecture:
**Architecture Type:** Transformers
**Network Architecture:** DeepSeek R1
## Input:
**Input Type(s):** Text
**Input Format(s):** String
**Input Parameters:** 1D (One Dimensional): Sequences
**Other Properties Related to Input:** DeepSeek recommends adhering to the following configurations when utilizing the DeepSeek-R1 series models, including benchmarking, to achieve the expected performance: \
- Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs.
- Avoid adding a system prompt; all instructions should be contained within the user prompt.
- For mathematical problems, it is advisable to include a directive in your prompt such as: "Please reason step by step, and put your final answer within \boxed{}."
- When evaluating model performance, it is recommended to conduct multiple tests and average the results.
## Output:
**Output Type(s):** Text
**Output Format:** String
**Output Parameters:** 1D (One Dimensional): Sequences
## Software Integration:
**Supported Runtime Engine(s):**
* TensorRT-LLM
**Supported Hardware Microarchitecture Compatibility:**
* NVIDIA Blackwell
**Preferred Operating System(s):**
* Linux
## Model Version(s):
** The model is quantized with nvidia-modelopt **v0.33.0**
## Training Dataset:
** Data Collection Method by dataset: Hybrid: Human, Automated
** Labeling Method by dataset: Hybrid: Human, Automated
## Testing Dataset:
** Data Collection Method by dataset: Hybrid: Human, Automated
** Labeling Method by dataset: Hybrid: Human, Automated
## Evaluation Dataset:
** Data Collection Method by dataset: Hybrid: Human, Automated
** Labeling Method by dataset: Hybrid: Human, Automated
## Calibration Datasets:
* Calibration Dataset: [cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail)
** Data collection method: Automated.
** Labeling method: Automated.
## Inference:
**Engine:** TensorRT-LLM
**Test Hardware:** B200
## Post Training Quantization
This model was obtained by quantizing the weights and activations of DeepSeek R1 to FP4 data type, ready for inference with TensorRT-LLM. Only the weights and activations of the linear operators within transformer blocks are quantized. This optimization reduces the number of bits per parameter from 8 to 4, reducing the disk size and GPU memory requirements by approximately 1.6x.
## Usage
### Deploy with TensorRT-LLM
To deploy the quantized FP4 checkpoint with [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) LLM API, follow the sample codes below (you need 8xB200 GPU and TensorRT LLM built from source with the latest main branch):
#### LLM API sample usage:
```
from tensorrt_llm import SamplingParams
from tensorrt_llm._torch import LLM
def main():
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(max_tokens=32)
llm = LLM(model="nvidia/DeepSeek-R1-FP4-v2", tensor_parallel_size=8, enable_attention_dp=True)
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
# The entry point of the program need to be protected for spawning processes.
if __name__ == '__main__':
main()
```
#### Minimum Latency Server Deployment
If you want to deploy your endpoint to minimize response latency for a single-concurrency or low-concurrency use case, follow the instructions below.
**Step 1: Create configuration file (`args.yaml`)**
```yaml
moe_backend: TRTLLM
use_cuda_graph: true
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 3
use_relaxed_acceptance_for_thinking: true
relaxed_topk: 10
relaxed_delta: 0.6
```
**Step 2: Start the TensorRT-LLM server**
```bash
trtllm-serve nvidia/DeepSeek-R1-FP4-v2 \
--host 0.0.0.0 \
--port 8000 \
--backend pytorch \
--max_batch_size 4 \
--tp_size 8 \
--ep_size 2 \
--max_num_tokens 32768 \
--trust_remote_code \
--extra_llm_api_options args.yaml \
--kv_cache_free_gpu_memory_fraction 0.75
```
**Step 3: Send an example query**
```bash
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/DeepSeek-R1-FP4-v2",
"messages": [{"role": "user", "content": "Why is NVIDIA a great company?"}],
"max_tokens": 1024
}'
```
### Evaluation
The accuracy benchmark results are presented in the table below:
Precision | MMLU Pro | GPQA Diamond | HLE | LiveCodeBench | MATH-500 | AIME 2024 |
FP8 (AA Ref) | 84 | 71 | 9 | 62 | 96 | 68 |
FP4 | 83 | 71 | 9 | 68 | 96 | 74 |