mistralai/Mistral-Small-3.2-24B-Instruct-2506

celsowm

Jun 23

Please,
Release AWQ version

Thanks !

stelterlab

Jun 25

The AWQ quant tools do not support vision models yet AFAIK.

I tried the latest llm-compressor (as AutoAWQ has been adopted by the vLLM project - but their newest example for GPTQ as an alternative failed for me due to OOM (even with 256 GB RAM - not VRAM).

https://github.com/vllm-project/llm-compressor/blob/main/examples/multimodal_vision/mistral3_example.py

Support by the Mistral AI team on llm-compressor would be nice.

stelterlab

Jun 25

Well, the experimental script for creating a FP8 quant did work.

For those who are interested, give stelterlab/Mistral-Small-3.2-24B-Instruct-2506-FP8 a try.

vllm came up with some errors and warnings, but it seem to work (using v0.9.1 on a L40, reducing the max model len, max image count, using fp8 cache...).

INFO 06-25 18:15:13 [worker.py:294] Memory profiling takes 6.48 seconds
INFO 06-25 18:15:13 [worker.py:294] the current vLLM instance can use total_gpu_memory (44.39GiB) x gpu_memory_utilization (0.98) = 43.50GiB
INFO 06-25 18:15:13 [worker.py:294] model weights take 24.05GiB; non_torch_memory takes 0.28GiB; PyTorch activation peak memory takes 3.65GiB; the rest of the memory reserved for KV Cache is 15.52GiB.

celsowm

Jun 25

Well, the experimental script for creating a FP8 quant did work.

For those who are interested, give stelterlab/Mistral-Small-3.2-24B-Instruct-2506-FP8 a try.

vllm came up with some errors and warnings, but it seem to work (using v0.9.1 on a L40, reducing the max model len, max image count, using fp8 cache...).
INFO 06-25 18:15:13 [worker.py:294] Memory profiling takes 6.48 seconds
INFO 06-25 18:15:13 [worker.py:294] the current vLLM instance can use total_gpu_memory (44.39GiB) x gpu_memory_utilization (0.98) = 43.50GiB
INFO 06-25 18:15:13 [worker.py:294] model weights take 24.05GiB; non_torch_memory takes 0.28GiB; PyTorch activation peak memory takes 3.65GiB; the rest of the memory reserved for KV Cache is 15.52GiB.

Would you mind to share your vllm params ?

stelterlab

Jun 26

vllm serve Mistral-Small-3.2-24B-Instruct-2506 --tokenizer-mode mistral --config-format mistral --load-format mistral --tool-call-parser mistral --enable-auto-tool-choice --port 8101 --gpu-memory-utilization 0.98 --max-model-len 16384 --limit_mm_per_prompt 'image=2' --kv-cache-dtype fp8

Did work for me.

docgerbil

Jul 3

•

edited Jul 3

The AWQ quant tools do not support vision models yet AFAIK.

I tried the latest llm-compressor (as AutoAWQ has been adopted by the vLLM project - but their newest example for GPTQ as an alternative failed for me due to OOM (even with 256 GB RAM - not VRAM).

https://github.com/vllm-project/llm-compressor/blob/main/examples/multimodal_vision/mistral3_example.py

Support by the Mistral AI team on llm-compressor would be nice.

How do you suppose the 'OPEA/Mistral-Small-3.1-24B-Instruct-2503-int4-AutoRound-awq-sym' model works?

I've been using this quantized version for some time and it's been working great with vLLM.

stelterlab

Jul 4

Well. In their model card that stated that they used Intel's auto-round toolkit. https://github.com/intel/auto-round

I wasn't aware that they also support CUDA as platform. I had the impression it's Intel CPU/NPU only.

Will give it a try at the weekend. Is the OPEA quant text-only or does it support also image-text-to-text?

docgerbil

Jul 4

Well. In their model card that stated that they used Intel's auto-round toolkit. https://github.com/intel/auto-round

I wasn't aware that they also support CUDA as platform. I had the impression it's Intel CPU/NPU only.

Will give it a try at the weekend. Is the OPEA quant text-only or does it support also image-text-to-text?

It's recognizes images as well. That would be great if you could do this!

stelterlab

Jul 5

Well, it seems that at least the current version of auto-round is not yet ready for this Mistral Version.

KeyError: <class 'transformers.models.mistral3.configuration_mistral3.Mistral3Config'>

I will have to take a deeper look into it and/or ask the OPEA team what they did for v3.1.

tensornet

Jul 6

Well, it seems that at least the current version of auto-round is not yet ready for this Mistral Version.

KeyError: <class 'transformers.models.mistral3.configuration_mistral3.Mistral3Config'>

I will have to take a deeper look into it and/or ask the OPEA team what they did for v3.1.

unsloth/Mistral-Small-3.2-24B-Instruct-2506 seems to load.

docgerbil

Jul 6

Well, it seems that at least the current version of auto-round is not yet ready for this Mistral Version.

KeyError: <class 'transformers.models.mistral3.configuration_mistral3.Mistral3Config'>

I will have to take a deeper look into it and/or ask the OPEA team what they did for v3.1.

unsloth/Mistral-Small-3.2-24B-Instruct-2506 seems to load.

I've only seen bnb and ggufs from unsloth.

Kufer

Jul 20

Is there any update on this? Is it possible to get vision working with AWQ? Does AutoAWQ not support vision multimodal models?
And is there any new information on that OPEA setup? To have a working AWQ multimodal would be really great!

So far, I only know of Gemma 3-27B, as it is converted directly from QAT. So it seems that would work in the AWQ format. There is no other model currently in AWQ with vision? Thanks!

docgerbil

26 days ago

Is there any update on this? Is it possible to get vision working with AWQ? Does AutoAWQ not support vision multimodal models?
And is there any new information on that OPEA setup? To have a working AWQ multimodal would be really great!

So far, I only know of Gemma 3-27B, as it is converted directly from QAT. So it seems that would work in the AWQ format. There is no other model currently in AWQ with vision? Thanks!

The OPEA/Mistral-Small-3.1-24B-Instruct-2503-int4-AutoRound-awq-sym most certainly works with vision as I've used it.

stelterlab

24 days ago

jeffcookio did a successful awq quant with a newer version of llm-compressor jeffcookio/Mistral-Small-3.2-24B-Instruct-2506-awq-sym

tool calling does not work yet as he mentioned on his model card - did'n try it yet

Just sah that autoround had an updated which I wanted to test, too.