---
license: apache-2.0
language:
- en
pipeline_tag: image-text-to-text
tags:
- multimodal
- mlx
library_name: transformers
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
---

# NexaAI/Qwen2.5-VL-7B-Instruct-4bit-MLX


## Quickstart

Run them directly with [nexa-sdk](https://github.com/NexaAI/nexa-sdk) installed
In nexa-sdk CLI:

```bash
NexaAI/Qwen2.5-VL-7B-Instruct-4bit-MLX
```

## Overview

In the past five months since Qwen2-VL’s release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL.

### Key Enhancements:
- **Understand things visually**: Qwen2.5-VL is not only proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images.

- **Being agentic**: Qwen2.5-VL directly plays as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use.

- **Understanding long videos and capturing events**: Qwen2.5-VL can comprehend videos of over 1 hour, and this time it has a new ability of cpaturing event by pinpointing the relevant video segments.

- **Capable of visual localization in different formats**: Qwen2.5-VL can accurately localize objects in an image by generating bounding boxes or points, and it can provide stable JSON outputs for coordinates and attributes.

- **Generating structured outputs**: for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc.


## Benchmark Results

### Image benchmark


| Benchmark | InternVL2.5-8B | MiniCPM-o 2.6 | GPT-4o-mini | Qwen2-VL-7B |**Qwen2.5-VL-7B** |
| :--- | :---: | :---: | :---: | :---: | :---: |
| MMMU<sub>val</sub>  | 56 | 50.4 | **60**| 54.1 | 58.6|
| MMMU-Pro<sub>val</sub>  | 34.3 | - | 37.6| 30.5 | 41.0|
| DocVQA<sub>test</sub>  | 93 | 93 | - | 94.5 | **95.7** |
| InfoVQA<sub>test</sub>  | 77.6 | - |  - |76.5 | **82.6** |
| ChartQA<sub>test</sub>  | 84.8 | - |- | 83.0 |**87.3** |
| TextVQA<sub>val</sub>  | 79.1 | 80.1 | -| 84.3 | **84.9**|
| OCRBench | 822 | 852 | 785 | 845 | **864** |
| CC_OCR | 57.7 |  | | 61.6 | **77.8**|
| MMStar | 62.8| | |60.7| **63.9**|
| MMBench-V1.1-En<sub>test</sub>  | 79.4 | 78.0 | 76.0| 80.7 | **82.6** |
| MMT-Bench<sub>test</sub> | - | - | - |**63.7** |63.6 |
| MMStar | **61.5** | 57.5 |  54.8 | 60.7 |63.9 |
| MMVet<sub>GPT-4-Turbo</sub>  | 54.2 | 60.0 | 66.9 | 62.0 | **67.1**|
| HallBench<sub>avg</sub>  | 45.2 | 48.1 | 46.1| 50.6 | **52.9**|
| MathVista<sub>testmini</sub>  | 58.3 | 60.6 | 52.4 | 58.2 | **68.2**|
| MathVision  | - | -  | - | 16.3 | **25.07** |

### Video Benchmarks
| Benchmark |  Qwen2-VL-7B | **Qwen2.5-VL-7B** |
| :--- | :---: | :---: |
| MVBench |  67.0 | **69.6** |
| PerceptionTest<sub>test</sub>  | 66.9 | **70.5** |
| Video-MME<sub>wo/w subs</sub>   | 63.3/69.0 | **65.1**/**71.6** |
| LVBench  |  | 45.3 |
| LongVideoBench  |  | 54.7 |
| MMBench-Video | 1.44 | 1.79 |
| TempCompass |  | 71.7 |
| MLVU |  | 70.2 |
| CharadesSTA/mIoU |  43.6|

### Agent benchmark
| Benchmarks              | Qwen2.5-VL-7B |
|-------------------------|---------------|
| ScreenSpot              |     84.7    |
| ScreenSpot Pro          |     29.0    |
| AITZ_EM                 |  	81.9    |
| Android Control High_EM |    	60.1    |
| Android Control Low_EM  |  	93.7    |
| AndroidWorld_SR         | 	25.5  	|
| MobileMiniWob++_SR      | 	91.4    |

## Reference
**Original model card**: [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)