MiniCPM-V-4_5-Q5_K_M.gguf not running on llama.cpp. gives "GGML_ASSERT(false && "unsupported minicpmv version") failed"

#2
by mku1988 - opened

I am running your q5 gguf on the llama-server (build b6271) with the following parameters:
-m "..\models\MiniCPM\MiniCPM-V-4_5-Q5_K_M.gguf" --mmproj "..\models\MiniCPM\mmproj-MiniCPM-V-4_5-f16.gguf"
--image "..\models\hydration_page.png" --threads 8
--ctx-size 10000 --flash-attn
--n-gpu-layers 99 --cache-type-k q8_0
--cache-type-v q8_0 --temp 0.4
--top-p 0.95 --min-p 0.05
--top-k 0 --repeat-penalty 1.1
--seed 3407 `
I am getting the error:
D:\LLMs\llama.cpp\tools\mtmd\mtmd.cpp:223: GGML_ASSERT(false && "unsupported minicpmv version") failed.

OpenBMB org

We have merged it in llama.cpp, but the merge was not long ago, so you may need to synchronize the latest code.

OpenBMB org

Thanks. it works after updating llama.cpp.
There's another issue. Its thinking is way too long. is there a way to disable thinking on llama.cpp? I tried adding "/no_think " in the prompt, but it didnt work,

OpenBMB org

@mku1988
I've received your question and figured out the source of the issue. I'll submit a pull request to llama.cpp this week to fix this.
This likely means llama.cpp doesn't yet have a control for enabling or disabling COT in multimodal mode. I'll contribute this to the community.

Adding "--reasoning-budget 0" to my llama-server launch parameters worked for me to disable thinking.

I’m experiencing the same issue. After starting llama-server with --reasoning-budget 0, it still performs reasoning when asked with complex images. I hope this can be fixed soon. Thanks!

When I try to launch the model openbmb/MiniCPM-V-4_5-gguf in LM Studio, I get the following error:
Failed to load the model. Error loading model. (Exit code: 6). Please check settings and try loading the model again.

I have macbook pro, m4 pro chip

Снимок экрана 2025-08-30 в 10.41.37.png

Sign up or log in to comment