This model performs worse than the Mistral-Small-3.1-24B model with a 4-bit quantization.

by zletpm - opened Jun 22

Jun 22

I’m using a prompt to convert an MD file to plain text. However, this model performs worse than the Mistral-Small-3.1-24B model with a 4-bit quantization.

The model follows fewer instructions. When prompted to return the processed plain text only, it provides “here is processed chunks:” or explanations like “I have done what….” instead.
Repeating the generation, this 4-bit model gives a much higher probability of generating duplicated text.

damhack

Jul 2

•

edited Jul 2

Did you set temperature=0.15 and ensure your System Message comes first?

Also, did you check that your chat UI or code is a) updated to a version that supports 3.2 and is b) NOT sending user, frequency_penalty and presence_penalty params as Mistral will error on them when fronting 3.2 with an OpenAI-compatible API?

If you're using Transformers, have you updated mistral-common >= 1.6.2 as per the model card?

If not, you can get some strange results.

You would need to post a lot more info, such as your inference code or Chat UI product name/config/logs for anyone to really be able to help you. Check the above info and make sure you can eliminate them as the cause of the issue first.

zletpm

Jul 5

Did you set temperature=0.15 and ensure your System Message comes first?

Also, did you check that your chat UI or code is a) updated to a version that supports 3.2 and is b) NOT sending user, frequency_penalty and presence_penalty params as Mistral will error on them when fronting 3.2 with an OpenAI-compatible API?

If you're using Transformers, have you updated mistral-common >= 1.6.2 as per the model card?

If not, you can get some strange results.

You would need to post a lot more info, such as your inference code or Chat UI product name/config/logs for anyone to really be able to help you. Check the above info and make sure you can eliminate them as the cause of the issue first.

Thank you for your reply. It appears that I’ve identified the reason for the duplicate output: speculative decoding. I’m using a draft model compatible with Mistral-Small 3.1, not 3.2. When I disable speculative decoding, I don’t encounter any duplicates. I’m using an MLX-LM backend and the Mistral model. The other setup remains the same.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment