How to get such good quality as this quant? (For translations)
@awni
Hi Awni
The quality of your 4-bit quant is MUCH better than what I or others are able to create. Could you perhaps please share how I can improve my attempts?
I want to use DeepSeek V3-0324
for translations on a Mac Studio 512GB. It seems that translations are very sensitive for quantization. So I'm trying to squeeze the maximum accuracy out of quants that will fit into the 512GB. I created a 5-bit quant by using your gist to create a BF16
version, and then the 5-bit version. The quality of the translations of my quant is the same as that of the 5-bit quant created by Ivan Fioravanti at mlx-community/DeepSeek-V3-0324-5bit
. I also tried a 4-bit and a mixed_4_6 version.
(Used mlx_lm.convert --hf-path DeepSeek-V3-0324-BF16 -q --q-bits 5 --mlx-path DeepSeek-V3-0324-5bit
, and using the latest version of mlx-lm.)
But all of them give worse quality for the translations than your 4-bit quant. I've tried to see what is different, but did not find anything obvious.
Would really appreciate it if you could point me in the right direction please!
I didn't do anything special to make this quant. What machine are you making your quants on? And maybe you could share a sample prompt for testing?
Hi Awni
Wow. Thank you for answering even on a weekend.
It is a Mac Studio M3 Ultra with 512 GB, and 4 TB SSD.
A Mac Mini M4 Pro 64GB gave the same results.
A sample prompt:
mlx_lm.generate \
--model mlx-community_DeepSeek-V3-0324-4bit \
--temp 1.3 \
--min-p 0.01 \
--seed 3407 \
--max-tokens 5000 \
--prompt "<|User|>Translate from English to Dutch: These are Bible studies on different topics, written by various brothers. Please republish and translate these studies as God leads you. The studies are published to be a help to people. We certainly don't claim to know everything, or be correct in all places. If you see any mistakes, if you have additional facts that will help people to better understand a point, or if you want a specific topic discussed, please let us know.<|Assistant|>"
For example the piece where it says "Please republish and translate...", the word "republish" is translated as "hervat" in mine and other quants. "Hervat" means "resume" instead of "republish". Your quant translates it as "Herpubliceer", which is the correct translation.
All my MLX quants (4-bit, mixed_4_6, and 5-bit) all translate it as "hervat" (resume). Also the 5-bit version on Hugging Face. And unsloth_DeepSeek-V3-0324-GGUF-UD:Q5_K_M (GGUF).
The "seed" is that I get repeatable results for now while testing it.
The temperature is what DeepSeek recommend at https://api-docs.deepseek.com/quick_start/parameter_settings/
My main question is probably how to get the best possible translations with DeepSeek V3-0324 on the Mac Studio 512GB.
PS: This is just one example. I obviously test it with more than this one paragraph. This quant of yours just seem to produce better quality Dutch translations. 🤷♂️
@awni
Hi Awni
OK, I managed to track it. You were correct when you said "I didn't do anything special to make this quant."
The only difference I could see was that you used mlx-lm 0.22.2, where I was using the latest 0.26.3.
When I used 0.22.2 I managed to get the same quant as you.
So, I traced it back to where the DeepSeek translations got (significantly) worse. With mlx-lm 0.22.3 was still fine, whereas 0.22.4 is not good with translations. (At least for translations with DeepSeek V3-0324.)
Looking at the timeline it seems to be something that changed between 4 April (0.22.3) to 7 April (0.22.4)
I do not know the inner workings of LLMs like you, so this is not something I can fix on my own.
When I used 0.22.2 I managed to get the same quant as you.
Did you change the MLX version as well or just mlx-lm?
I just changed mlx-lmpip install mlx-lm==0.22.2
It seems like the difference is due to how we dequantize. I'm not sure if it's systematically worse or if you are just getting unlucky with this one prompt. You can try the diff iin this PR and let me know what you think? https://github.com/ml-explore/mlx-lm/pull/376
Thank you Awni. Respect. You have a deep knowledge.
I'll test it with a two or three languages and a few pages.
Good morning Awni. I tested it with a few languages, sentences on various topics, etc. The generated output is identical between the deepseek3.py of 0.26.3 and the diff you provided.
I created various quants (4bit, 5bit, group-size 32, mixed_4_6), but your original 4-bit version is still the best with the translations.
That doesn't jive with my results. With the current mlx-lm and your generation command from above at 4-bit I get the following response:
==========
Vertaling van Engels naar Nederlands:
Dit zijn Bijbelstudies over verschillende onderwerpen, geschreven door diverse broeders. Herzien en vertaal deze studies zoals God jou leidt. De studies zijn gepubliceerd om mensen tot hulp te zijn. We beweren zeker niet alles te weten of overal correct te zijn. Als je fouten opmerkt, aanvullende informatie hebt die mensen kan helpen een punt beter te begrijpen, of als je een specifiek onderwerp besproken wilt hebben, laat het ons dan weten.
*(Vrije vertaling met natuurlijk Nederlands taalgebruik, waarbij de toon van nederigheid en uitnodiging tot samenwerking behouden is.)*
==========
And re-quantizing with the change I linked to I get this response:
==========
Hier zijn Bijbelstudies over verschillende onderwerpen, geschreven door diverse broeders. Publiceer en vertaal deze studies opnieuw zoals God u leidt. De studies zijn gepubliceerd om mensen te helpen. We beweren beslist niet alles te weten of overal correct te zijn. Als u fouten opmerkt, als u aanvullende feiten hebt die kunnen helpen een punt beter te begrijpen, of als u een specifiek onderwerp besproken wilt hebben, laat het ons dan weten.
==========
It would be quite surprising if they are identical for you .. that should not be the case and suggests the change wasn't used properly. You might want to double check that the two 4-bit quantizations you are making are indeed different..
Yes, I'm doing something wrong. They are NOT different. Same MD5 checksums.
I'm making a 4b-bit quantization. Then replacing the models/deepseek_v3.py file with the updated "diff" version which I downloaded from GitHub (RAW file), double-checking that it's the right .py file, and then making the second quantization. I know very little about Python, so probably doing something very stupid. My knowledge of 15 other (older) programming languages is not helping me here...
Please help me with a short description of what I'm supposed to do.
You are building mlx-lm from source right? Most likely what happened is you don't have an editable installation of mlx-lm
. What I would do first is pip uninstall mlx-lm
. Then install it in editable mode: pip install -e .
Hi Awni. Thank you. I did that
I now put a print()
statement in deepseek_v3.py in the def __init__
of class Model
. This gets called and does show me the message. So I know it calls the correct file.
In def dequant
where you made the changes I also put a print()
statement and even a sys.exit()
statement . These never get called. So it never gets to the changes that you made?
Oh that's odd.. what model are you converting it from? I am using the original deepseek-ai/DeepSeek-V3-0324
I thought that MLX did not directly create quants from FP8. So I made a BF16. I also tested it on unsloth/DeepSeek-V3-0324-BF16
. It is identical to mine. The last week I used the unsloth version.
So I can just use the original?
Yes that's the precisely what dequantization step is for -> it loads the fp8 and dequantizes it to bf16 before quantizing. What I changed in the code is the used to compute the dequantization.
Hi Awni
Thank you. Your modified deepseek_v3.py
is working well. I tried it with other languages, and on different topics. The changes definitely improved the quality of the translations!
- Will this
deepseek_v3.py
file work with Kimi K2 as well? It seems to be based on DeepSeek. (And creating a 2 TB file for BF16 is not fun.) - With your knowledge of LLMs, what are other things I can look at that could also improve the quality of quants with translations? Will something like a
group size
of 32 help? Is it worth the time to look at AWQ, DWQ, and other options? I saw you have started to add these, but I do not yet know anything about these.
Many thanks for your patient help taking me through this. Much appreciated!
Thank you. Your modified deepseek_v3.py is working well. I tried it with other languages, and on different topics. The changes definitely improved the quality of the translations!
That's interesting that it's better. Maybe we should switch how we dequantize fp8.. I'm surprised by that.
Will this deepseek_v3.py file work with Kimi K2 as well?
Yes it should work
With your knowledge of LLMs, what are other things I can look at that could also improve the quality of quants with translations? Will something like a group size of 32 help? Is it worth the time to look at AWQ, DWQ, and other options? I saw you have started to add these, but I do not yet know anything about these.
All of the above really. Higher precision or lower group size will help (e.g. 32 instead of 64).
AWQ and DWQ can definitely help but they are hard to do for such large models which is why we don't have them yet. So that requires some work still.
That's interesting that it's better. Maybe we should switch how we dequantize fp8.. I'm surprised by that.
In the translations there is a clear difference. It is the actually the difference between unacceptable versus acceptable quality.
I have no hesitation to recommend investigating this.
FYI I ran an eval you can see the results here: https://github.com/ml-explore/mlx-lm/pull/376
The eval suggests minor differences between the two versions with the version which gives you worse translation quality giving slightly better results. I don't know what to make of your qualitative experience.. it might simply be coincidence. Quantization to 4-bit is lossy and in this case we get unlucky in terms of what we are losing. My recommendation would be to try 4-bit with a smaller group size (32) or even go up a bit in precision. That will be a more stable way of ensuring higher quality results that are faithful to the original fp8 model