👔 Working quants for Qwen2.5 VL 7B.

We'll be uploading benchmark results along with the quants here.

The models have been tested on latest llama.cpp built with CLIP hardware acceleration manually enabled!

Consult the following post for more details: https://github.com/ggml-org/llama.cpp/issues/11483#issuecomment-2676422772

For now you can only do single prompts via the cli:

llama-qwen2vl-cli -m ~/gguf/Qwen2.5-VL-7B-Instruct-Q4_0.gguf --mmproj ~/gguf/mmproj-Qwen2.5-VL-7B-Instruct-f32.gguf --n_gpu_layers 9999 -p "Describe the image." --image ~/Pictures/test_small.png

We're working on a wrapper API solution until multimodal support is added back to llama.cpp

API will be published here: https://github.com/Independent-AI-Labs/local-super-agents

Let us know if you need a specific quant!

💪 Benchmarking Update:

The latest main looks stable with Vulkan CLIP and any model thrown at it so far. Some preliminary insights:

1200x1200 is the maximum you can encode with 16GB of VRAM. clip.cpp does not seem to support multi-GPU Vulkan yet.
A 4060Ti-class GPU delivers 20-30 t/s with the Q8_0 and double that on Q4 @ 16-32K context.
Batching (multiple images) in a single cli call seems to be working fine:

llama-qwen2vl-cli--ctx-size 16000 -n 16000 -m ~/gguf/Qwen2.5-VL-7B-Instruct-Q4_0.gguf --mmproj ~/gguf/mmproj-Qwen2.5-VL-7B-Instruct-f32.gguf --n_gpu_layers 9999 -p "Describe the image in detail. Extract all textual information from it. Output as detailed JSON." -p "Analyze the image." --image ~/Pictures/test_small.png --image ~/Pictures/test_small.png

IAILabs
/

Qwen2.5-VL-7b-Instruct-GGUF

👔 Working quants for Qwen2.5 VL 7B.

Let us know if you need a specific quant!

💪 Benchmarking Update:

Output quality looks very promising! We'll release all of the benchmark code when ready, so the process can be streamlined for other models.

Model tree for IAILabs/Qwen2.5-VL-7b-Instruct-GGUF