Accuracy of the dynamic quants compared to usual quants?

#21
by inputout - opened

Thanks first of all, I read "https://unsloth.ai/blog/deepseekr1-dynamic" and dynamic quants are a great development and shows the fantastic ability to obtain a usable accuracy despite small quantifications (shown at the Flappy Bird game).
But I do not understand how the dynamic quants can be classified in terms of accuracy compared to the usual quants e.g. @bartowski IQ4_XS, IQ3_M, IQ3_S, IQ3_XXS, IQ2_M... (https://huggingface.co/bartowski/DeepSeek-R1-GGUF).
ADDED NOTE: By "accuracy" i mean how well the LLM gives correct/intelligent answers (like in benchs/leaderboards) and not the speed of execution.
I can't find any benchmarks for real comparison (the leaderboards only show unquantified results).
Can e.g. the dynamic quant Q2_K_XL achieve the accuracy of a IQ4_XS?
If you were to draw up a kind of ranking list, how would you roughly categorize the new quants in comparison to the usual quants?
It would be great to know which dynamic quant corresponds to which usual quant in terms of accuracy. For example:

  • 212GB Q2_K_XL 2.51-bit(MoE) 3.5/2.5bit(Down_proj): corresponds approximately to: IQ4_XS, IQ3_M, IQ3_S, IQ3_XXS, IQ2_M... ??
  • 183GB IQ2_XXS 2.22-bit(MoE) 2.5/2.06bit(Down_proj): corresponds approximately to: IQ4_XS, IQ3_M, IQ3_S, IQ3_XXS, IQ2_M... ??

In my understanding, unsloth dynamic quant uses the same quant types as bartowski and others, it just decides per layer and per matrix which type to use. And decides differently than the default gguf quantization code.

This graph will give a rough idea: https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9

exactly, i know it is very useful for the usual quants but does not show where the four unsloth dynamic quants are located.

In my understanding, unsloth dynamic quant uses the same quant types as bartowski and others, it just decides per layer and per matrix which type to use. And decides differently than the default gguf quantization code.

Yes it is a mixture of different bits, that is basically clear. The site provides this information: (https://unsloth.ai/blog/deepseekr1-dynamic)

  • First 3 dense layers use 0.5% of all weights leave as 4 or 6bit
  • MoE layers use shared experts, using 1.5% of weights. We’ll use 6bit.
  • MLA attention modules as 4 or 6bit, using <5% of weights
  • leaves ~88% of the weights can massively shrink.

1.58-bit 131GB IQ1_S: Range 1.58 to 4/6 bit
1.73-bit 158GB IQ1_M: Range 1.73 to 4/6 bit
2.22-bit 183GB IQ2_XXS: Range 2.22 to 4/6 bit
2.51-bit 212GB Q2_K_XL: Range 2.51 to 4/6 bit

According to my logic, it is therefore completely unclear where in the large range (1.58/1.73/2.22/2.51 to 4/6 bit) the dynamic quants stand in relation of accuracy. Based on the bits, it doesn't seem possible because they are fluid.
Therefore it would be great approx to know which dynamic quant corresponds to which usual quant in terms of accuracy. For me in the first step it would even be enough to know Q2_K_XL vs IQ4_XS (Similar accuracy, worse, better?). Hopefully someone with suitable hardware will make benchmarks so that the four unsloth dynamic quants can be roughly placed in context with usual quants.

I have a project ongoing to compute perplexity for the 4 low bit dynamic quant versions and compare to Q8 or FP8.
It'll take time and money since I don't own HW that can do that and have to rent a large cloud instance.

I've read that the dynamic Q2s run significantly faster than the dynamic Q1s on the same machine.

It's a bit of a moot point, and I won't be testing it by downloading a Q1 because I can run Q2_K_XL so why sacrifice quality.

I have replaced "performance" with "accuracy", it is the clearer term.

By "performance" i mean how well the LLM gives correct/intelligent answers. Sorry maybe i should have worded it more clearly.
So the question was about the "intelligence" of the LLMs (like Arena Score in chatbot-arena-leaderboard) and not the speed of execution.
I am not a native speaker, sorry maybe "performance" was misleading, is there technically unmistakably better word for that??
(Maybe ist "perplexity" or "accuracy" better?, i could modify the headline)

Oh right I understand now! Yes, this is definitely important to work out.

I have a project ongoing to compute perplexity for the 4 low bit dynamic quant versions and compare to Q8 or FP8.
It'll take time and money since I don't own HW that can do that and have to rent a large cloud instance.

I'm not so familiar with this, is the perplexity metric suitable for this type of model with MoE architecture and Reasoning/CoT? But at the other hand, getting a deviation from Q8/FP8 is very valuable either way. It becomes particularly interesting when the deviation of the dynamic quants is then compared with the deviation of the usual quants (IQ4_XS, IQ3_M, ...). So it roughly estimate how much “intelligence” remains und where they are located between the usual quants.
I found this: https://www.reddit.com/r/LocalLLaMA/comments/1idi5cr/i_did_a_very_short_perplexity_test_with_deepseek/ (But I can't interpret it).

inputout changed discussion title from Performance of the dynamic quants compared to usual quants? to Accuracy of the dynamic quants compared to usual quants?

I have tried to determine the perplexity from IQ4_XS downwards with llama-perplexity.
Of course the tests should only be an indication, more is not possible on my limited system.
I just wanted to find out the deviation between IQ4_XS (bartowski) and the dynamic Q2_K_XL (unsloth).
Starting with IQ4_XS (bartowski) I reduced wiki.test to 1/4 and tested the 7 chunks. The PPL is 2.6108.
Unfortunately that was it because the other two dynamic unsloth (Q2_K_XL and IQ2_XL) generate [nan] errors. The nan error only occurs with dynamic quants. Perhaps the dynamic quants cannot (yet) be tested well with llama-perplexity.

Ok it was worth a try :)

They shouldn't have any issue being run with perplexity.. that's quite curious..

Does the same happen with my own Q2_K?

They shouldn't have any issue being run with perplexity.. that's quite curious..

Does the same happen with my own Q2_K?

@bartowski
I have done a perplexity test and your quant Q2_K works fine without a nan (as well as your IQ4_XS) everything works.
The test was executed with a shortened wiki.test and -c 2048 -b 2048.

Regarding the unsloth dynamic quants, I found out that only -c 512 -b 512 starts the first chunks without nan error: (perhaps also interesting for @danielhanchen @shimmyshimmer )
-c 2048 -b 2048: -> nan error (first chunks)
-c 2048 -b 1024: -> nan error (first chunks)
-c 1024 -b 1024: -> nan error (first chunks)
-c 1024 -b 512: -> nan error (first chunks)
-c 512 -b 512: -> works up to chunk 35 and then comes the nan error

To the reddit post (where someone has also tried perplexity tests), maybe it was no coincidence that he also used 512.
At the end he also wrote something about nan-errors.
Perhaps the nan-errors in relation to the dynamic quants are not a random event for me, it seems.
(https://www.reddit.com/r/LocalLLaMA/comments/1idi5cr/i_did_a_very_short_perplexity_test_with_deepseek/?rdt=40449)

Update: I have now found out that if I use a completely different textfile with a different text, the first chunks work.
The nan-error is only triggered in combination with the file wiki.test.raw (but this file works with other models without any problems).
There must be something in the file that triggers especially these quants. If you look inside, everything is formatted a bit strangely with some special characters. Does anyone know a better test file?
However, it is a bit strange that other quants have no problem with the file.

Unsloth AI org

They shouldn't have any issue being run with perplexity.. that's quite curious..

Does the same happen with my own Q2_K?

@bartowski
I have done a perplexity test and your quant Q2_K works fine without a nan (as well as your IQ4_XS) everything works.
The test was executed with a shortened wiki.test and -c 2048 -b 2048.

Regarding the unsloth dynamic quants, I found out that only -c 512 -b 512 starts the first chunks without nan error: (perhaps also interesting for @danielhanchen @shimmyshimmer )
-c 2048 -b 2048: -> nan error (first chunks)
-c 2048 -b 1024: -> nan error (first chunks)
-c 1024 -b 1024: -> nan error (first chunks)
-c 1024 -b 512: -> nan error (first chunks)
-c 512 -b 512: -> works up to chunk 35 and then comes the nan error

To the reddit post (where someone has also tried perplexity tests), maybe it was no coincidence that he also used 512.
At the end he also wrote something about nan-errors.
Perhaps the nan-errors in relation to the dynamic quants are not a random event for me, it seems.
(https://www.reddit.com/r/LocalLLaMA/comments/1idi5cr/i_did_a_very_short_perplexity_test_with_deepseek/?rdt=40449)

Interesting thanks for letting us know I''ll notify Daniel

Interesting thanks for letting us know I''ll notify Daniel

@shimmyshimmer @danielhanchen
thank you, as additional info, i have tested further with other text files because i thought the error is related to the content of wiki.test.raw and it is a kind of refusal. But this has not been confirmed. The nan error also happens with completely different text input.
Depending on which settings you have, the error starts at different chunk. But if a nan error has occurred, then all subsequent chunks are also nan.
The chunk position from which the nan error occurs can be influenced by the following:

  • changing the textfile
  • changing the temp (5 or 7 instead of 6)
  • changing the context size

I have a project ongoing to compute perplexity for the 4 low bit dynamic quant versions and compare to Q8 or FP8.
It'll take time and money since I don't own HW that can do that and have to rent a large cloud instance.

I was successful now with my comparison runs. I used my own text file as base for perplexity so the numbers are not directly comparable to wiki.text numbers.
Due to storage limitations on my instance Q5 was the largest I could run.

DeepSeek-R1-UD-IQ1_S [1]3.1155,[2]3.5787,[3]3.2524, Final estimate: PPL = 3.2524 +/- 0.22546
DeepSeek-R1-UD-IQ1_M [1]3.0334,[2]3.5265,[3]3.1973, Final estimate: PPL = 3.1973 +/- 0.21761
DeepSeek-R1-UD-IQ2_XXS [1]3.0212,[2]3.4453,[3]3.1714, Final estimate: PPL = 3.1714 +/- 0.21693
DeepSeek-R1-UD-Q2_K_XL [1]2.9057,[2]3.4079,[3]3.1455, Final estimate: PPL = 3.1455 +/- 0.21094
DeepSeek-R1-Q5_K_M [1]2.8424,[2]3.3452,[3]3.0508, Final estimate: PPL = 3.0508 +/- 0.20548

Assuming Q5 is quite close to FP8, maybe less than 0.001 off the mark, I postulate FP8 to be around 3.0500 .

Basically that's an impressive one bit improvement at low bitrates.
IQ1 performs like IQ2 so far and IQ2 is close to IQ3 performance.

@TobDeBer
Thanks for the results! that's very interesting.
I'm currently doing a few tests with -c 1024 and also my own text file (-> 40 chunks). it all takes a long time with this large model.
At first glance, the deviations in your tests look a little smaller than mine (Q5_K_M to the dynamic quants).
Have you also tested lower non-dynamic quants? at the moment in my tests it looks as if Q2_K is very similar to UD-IQ2_XXS and UD-Q2_K_XL. UD-IQ2_XXS and UD-Q2_K_XL are also very similar. When my tests are properly completed at some point, I will show this in more detail...
i'm still struggling with the nan errors, so i'll evaluate it more graphically using the chunk progression up to the point where the nan stops the progression (which varies depending on the model + temp). wit nan there is no PPL calculation.

Interesting thanks for letting us know I''ll notify Daniel

Does the same happen with my own Q2_K?

@shimmyshimmer @danielhanchen
I have to correct my statement, there was now also a nan-error with the quant of @bartowski (Q4_K_S).
This means that it doesn't just affect the unsloth dynamic quants, which is what it looked like at first.
I do not know the reason for the errors in some perplexity calculations. But so far it has only occurred for me in the context of Deepseek R1.

A technical question, can the DeepSeek R1 perplexity of unsloth and bartowski ggufs be compared with each other?
(You wrote "We leveraged Bartowski’s importance matrix for the lower quants." it is therefore comparable? )

i don't think they even would need the same importance matrix

I don't love perplexity as a measurement of quality between models but it should be mostly reasonable here since they're the same source model

Sign up or log in to comment