AWQ version
Please,
Release AWQ version
Thanks !
The AWQ quant tools do not support vision models yet AFAIK.
I tried the latest llm-compressor (as AutoAWQ has been adopted by the vLLM project - but their newest example for GPTQ as an alternative failed for me due to OOM (even with 256 GB RAM - not VRAM).
Support by the Mistral AI team on llm-compressor would be nice.
Well, the experimental script for creating a FP8 quant did work.
For those who are interested, give stelterlab/Mistral-Small-3.2-24B-Instruct-2506-FP8 a try.
vllm came up with some errors and warnings, but it seem to work (using v0.9.1 on a L40, reducing the max model len, max image count, using fp8 cache...).
INFO 06-25 18:15:13 [worker.py:294] Memory profiling takes 6.48 seconds
INFO 06-25 18:15:13 [worker.py:294] the current vLLM instance can use total_gpu_memory (44.39GiB) x gpu_memory_utilization (0.98) = 43.50GiB
INFO 06-25 18:15:13 [worker.py:294] model weights take 24.05GiB; non_torch_memory takes 0.28GiB; PyTorch activation peak memory takes 3.65GiB; the rest of the memory reserved for KV Cache is 15.52GiB.
Well, the experimental script for creating a FP8 quant did work.
For those who are interested, give stelterlab/Mistral-Small-3.2-24B-Instruct-2506-FP8 a try.
vllm came up with some errors and warnings, but it seem to work (using v0.9.1 on a L40, reducing the max model len, max image count, using fp8 cache...).
INFO 06-25 18:15:13 [worker.py:294] Memory profiling takes 6.48 seconds INFO 06-25 18:15:13 [worker.py:294] the current vLLM instance can use total_gpu_memory (44.39GiB) x gpu_memory_utilization (0.98) = 43.50GiB INFO 06-25 18:15:13 [worker.py:294] model weights take 24.05GiB; non_torch_memory takes 0.28GiB; PyTorch activation peak memory takes 3.65GiB; the rest of the memory reserved for KV Cache is 15.52GiB.
Would you mind to share your vllm params ?
vllm serve Mistral-Small-3.2-24B-Instruct-2506 --tokenizer-mode mistral --config-format mistral --load-format mistral --tool-call-parser mistral --enable-auto-tool-choice --port 8101 --gpu-memory-utilization 0.98 --max-model-len 16384 --limit_mm_per_prompt 'image=2' --kv-cache-dtype fp8
Did work for me.
The AWQ quant tools do not support vision models yet AFAIK.
I tried the latest llm-compressor (as AutoAWQ has been adopted by the vLLM project - but their newest example for GPTQ as an alternative failed for me due to OOM (even with 256 GB RAM - not VRAM).
Support by the Mistral AI team on llm-compressor would be nice.
How do you suppose the 'OPEA/Mistral-Small-3.1-24B-Instruct-2503-int4-AutoRound-awq-sym' model works?
I've been using this quantized version for some time and it's been working great with vLLM.
Well. In their model card that stated that they used Intel's auto-round toolkit. https://github.com/intel/auto-round
I wasn't aware that they also support CUDA as platform. I had the impression it's Intel CPU/NPU only.
Will give it a try at the weekend. Is the OPEA quant text-only or does it support also image-text-to-text?
Well. In their model card that stated that they used Intel's auto-round toolkit. https://github.com/intel/auto-round
I wasn't aware that they also support CUDA as platform. I had the impression it's Intel CPU/NPU only.
Will give it a try at the weekend. Is the OPEA quant text-only or does it support also image-text-to-text?
It's recognizes images as well. That would be great if you could do this!
Well, it seems that at least the current version of auto-round is not yet ready for this Mistral Version.
KeyError: <class 'transformers.models.mistral3.configuration_mistral3.Mistral3Config'>
I will have to take a deeper look into it and/or ask the OPEA team what they did for v3.1.
Well, it seems that at least the current version of auto-round is not yet ready for this Mistral Version.
KeyError: <class 'transformers.models.mistral3.configuration_mistral3.Mistral3Config'>
I will have to take a deeper look into it and/or ask the OPEA team what they did for v3.1.
unsloth/Mistral-Small-3.2-24B-Instruct-2506 seems to load.
Well, it seems that at least the current version of auto-round is not yet ready for this Mistral Version.
KeyError: <class 'transformers.models.mistral3.configuration_mistral3.Mistral3Config'>
I will have to take a deeper look into it and/or ask the OPEA team what they did for v3.1.
unsloth/Mistral-Small-3.2-24B-Instruct-2506 seems to load.
I've only seen bnb and ggufs from unsloth.
Is there any update on this? Is it possible to get vision working with AWQ? Does AutoAWQ not support vision multimodal models?
And is there any new information on that OPEA setup? To have a working AWQ multimodal would be really great!
So far, I only know of Gemma 3-27B, as it is converted directly from QAT. So it seems that would work in the AWQ format. There is no other model currently in AWQ with vision? Thanks!
Is there any update on this? Is it possible to get vision working with AWQ? Does AutoAWQ not support vision multimodal models?
And is there any new information on that OPEA setup? To have a working AWQ multimodal would be really great!So far, I only know of Gemma 3-27B, as it is converted directly from QAT. So it seems that would work in the AWQ format. There is no other model currently in AWQ with vision? Thanks!
The OPEA/Mistral-Small-3.1-24B-Instruct-2503-int4-AutoRound-awq-sym most certainly works with vision as I've used it.
jeffcookio did a successful awq quant with a newer version of llm-compressor jeffcookio/Mistral-Small-3.2-24B-Instruct-2506-awq-sym
tool calling does not work yet as he mentioned on his model card - did'n try it yet
Just sah that autoround had an updated which I wanted to test, too.