RedHatAI
/

Qwen2.5-VL-72B-Instruct-quantized.w4a16

text-generation-inference

compressed-tensors

Model card Files Files and versions

nm-research commited on Jul 10

Commit

0354d13

·

verified ·

1 Parent(s): 285c7b3

Update README.md

Files changed (1) hide show

README.md +5 -5

README.md CHANGED Viewed

@@ -12,14 +12,14 @@ base_model: Qwen/Qwen2.5-VL-72B-Instruct
 library_name: transformers
 ---
-# Qwen2.5-VL-72B-Instruct-quantized-w8a8
 ## Model Overview
 - **Model Architecture:** Qwen/Qwen2.5-VL-72B-Instruct
   - **Input:** Vision-Text
   - **Output:** Text
 - **Model Optimizations:**
-  - **Weight quantization:** INT8
   - **Activation quantization:** FP16
 - **Release Date:** 2/24/2025
 - **Version:** 1.0
@@ -29,7 +29,7 @@ Quantized version of [Qwen/Qwen2.5-VL-72B-Instruct](https://huggingface.co/Qwen/
 ### Model Optimizations
-This model was obtained by quantizing the weights of [Qwen/Qwen2.5-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct) to INT8 data type, ready for inference with vLLM >= 0.5.2.
 ## Deployment
@@ -203,10 +203,10 @@ The model was evaluated using [mistral-evals](https://github.com/neuralmagic/mis
 - chartqa
 ```
-vllm serve neuralmagic/pixtral-12b-quantized.w8a8 --tensor_parallel_size 1 --max_model_len 25000 --trust_remote_code --max_num_seqs 8 --gpu_memory_utilization 0.9 --dtype float16 --limit_mm_per_prompt image=7
 python -m eval.run eval_vllm \
-        --model_name neuralmagic/pixtral-12b-quantized.w8a8 \
         --url http://0.0.0.0:8000 \
         --output_dir ~/tmp \
         --eval_name <vision_task_name>

 library_name: transformers
 ---
+# Qwen2.5-VL-72B-Instruct-quantized-w4a16
 ## Model Overview
 - **Model Architecture:** Qwen/Qwen2.5-VL-72B-Instruct
   - **Input:** Vision-Text
   - **Output:** Text
 - **Model Optimizations:**
+  - **Weight quantization:** INT4
   - **Activation quantization:** FP16
 - **Release Date:** 2/24/2025
 - **Version:** 1.0
 ### Model Optimizations
+This model was obtained by quantizing the weights of [Qwen/Qwen2.5-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct) to INT4 data type, ready for inference with vLLM >= 0.5.2.
 ## Deployment
 - chartqa
 ```
+vllm serve RedHatAI/Qwen2.5-VL-72B-Instruct-quantized.w4a16 --tensor_parallel_size 1 --max_model_len 25000 --trust_remote_code --max_num_seqs 8 --gpu_memory_utilization 0.9 --dtype float16 --limit_mm_per_prompt image=7
 python -m eval.run eval_vllm \
+        --model_name RedHatAI/Qwen2.5-VL-72B-Instruct-quantized.w4a16 \
         --url http://0.0.0.0:8000 \
         --output_dir ~/tmp \
         --eval_name <vision_task_name>