--- license: apache-2.0 language: - en pipeline_tag: image-text-to-text tags: - multimodal - mlx library_name: transformers base_model: - Qwen/Qwen2.5-VL-7B-Instruct --- # NexaAI/Qwen2.5-VL-7B-Instruct-4bit-MLX ## Quickstart Run them directly with [nexa-sdk](https://github.com/NexaAI/nexa-sdk) installed In nexa-sdk CLI: ```bash NexaAI/Qwen2.5-VL-7B-Instruct-4bit-MLX ``` ## Overview In the past five months since Qwen2-VL’s release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL. ### Key Enhancements: - **Understand things visually**: Qwen2.5-VL is not only proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images. - **Being agentic**: Qwen2.5-VL directly plays as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use. - **Understanding long videos and capturing events**: Qwen2.5-VL can comprehend videos of over 1 hour, and this time it has a new ability of cpaturing event by pinpointing the relevant video segments. - **Capable of visual localization in different formats**: Qwen2.5-VL can accurately localize objects in an image by generating bounding boxes or points, and it can provide stable JSON outputs for coordinates and attributes. - **Generating structured outputs**: for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc. ## Benchmark Results ### Image benchmark | Benchmark | InternVL2.5-8B | MiniCPM-o 2.6 | GPT-4o-mini | Qwen2-VL-7B |**Qwen2.5-VL-7B** | | :--- | :---: | :---: | :---: | :---: | :---: | | MMMUval | 56 | 50.4 | **60**| 54.1 | 58.6| | MMMU-Proval | 34.3 | - | 37.6| 30.5 | 41.0| | DocVQAtest | 93 | 93 | - | 94.5 | **95.7** | | InfoVQAtest | 77.6 | - | - |76.5 | **82.6** | | ChartQAtest | 84.8 | - |- | 83.0 |**87.3** | | TextVQAval | 79.1 | 80.1 | -| 84.3 | **84.9**| | OCRBench | 822 | 852 | 785 | 845 | **864** | | CC_OCR | 57.7 | | | 61.6 | **77.8**| | MMStar | 62.8| | |60.7| **63.9**| | MMBench-V1.1-Entest | 79.4 | 78.0 | 76.0| 80.7 | **82.6** | | MMT-Benchtest | - | - | - |**63.7** |63.6 | | MMStar | **61.5** | 57.5 | 54.8 | 60.7 |63.9 | | MMVetGPT-4-Turbo | 54.2 | 60.0 | 66.9 | 62.0 | **67.1**| | HallBenchavg | 45.2 | 48.1 | 46.1| 50.6 | **52.9**| | MathVistatestmini | 58.3 | 60.6 | 52.4 | 58.2 | **68.2**| | MathVision | - | - | - | 16.3 | **25.07** | ### Video Benchmarks | Benchmark | Qwen2-VL-7B | **Qwen2.5-VL-7B** | | :--- | :---: | :---: | | MVBench | 67.0 | **69.6** | | PerceptionTesttest | 66.9 | **70.5** | | Video-MMEwo/w subs | 63.3/69.0 | **65.1**/**71.6** | | LVBench | | 45.3 | | LongVideoBench | | 54.7 | | MMBench-Video | 1.44 | 1.79 | | TempCompass | | 71.7 | | MLVU | | 70.2 | | CharadesSTA/mIoU | 43.6| ### Agent benchmark | Benchmarks | Qwen2.5-VL-7B | |-------------------------|---------------| | ScreenSpot | 84.7 | | ScreenSpot Pro | 29.0 | | AITZ_EM | 81.9 | | Android Control High_EM | 60.1 | | Android Control Low_EM | 93.7 | | AndroidWorld_SR | 25.5 | | MobileMiniWob++_SR | 91.4 | ## Reference **Original model card**: [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)