This model performs worse than the Mistral-Small-3.1-24B model with a 4-bit quantization.

#6
by zletpm - opened

I’m using a prompt to convert an MD file to plain text. However, this model performs worse than the Mistral-Small-3.1-24B model with a 4-bit quantization.

  1. The model follows fewer instructions. When prompted to return the processed plain text only, it provides “here is processed chunks:” or explanations like “I have done what….” instead.

  2. Repeating the generation, this 4-bit model gives a much higher probability of generating duplicated text.

Did you set temperature=0.15 and ensure your System Message comes first?

Also, did you check that your chat UI or code is a) updated to a version that supports 3.2 and is b) NOT sending user, frequency_penalty and presence_penalty params as Mistral will error on them when fronting 3.2 with an OpenAI-compatible API?

If you're using Transformers, have you updated mistral-common >= 1.6.2 as per the model card?

If not, you can get some strange results.

You would need to post a lot more info, such as your inference code or Chat UI product name/config/logs for anyone to really be able to help you. Check the above info and make sure you can eliminate them as the cause of the issue first.

Did you set temperature=0.15 and ensure your System Message comes first?

Also, did you check that your chat UI or code is a) updated to a version that supports 3.2 and is b) NOT sending user, frequency_penalty and presence_penalty params as Mistral will error on them when fronting 3.2 with an OpenAI-compatible API?

If you're using Transformers, have you updated mistral-common >= 1.6.2 as per the model card?

If not, you can get some strange results.

You would need to post a lot more info, such as your inference code or Chat UI product name/config/logs for anyone to really be able to help you. Check the above info and make sure you can eliminate them as the cause of the issue first.

Thank you for your reply. It appears that I’ve identified the reason for the duplicate output: speculative decoding. I’m using a draft model compatible with Mistral-Small 3.1, not 3.2. When I disable speculative decoding, I don’t encounter any duplicates. I’m using an MLX-LM backend and the Mistral model. The other setup remains the same.

Sign up or log in to comment