Awful loss of coherence between MLX 2bit and IQ2_XSS 🙀
Hi! Thank you for taking the time to make and share your MLX 2bit quant. 🥰👍
I'm using Unloth IQ2_XSS GGUF and it works amazingly well, if a bit slow.
On the other hand your MLX 2bit version (and another one I quantized myself) is awfully incoherent by comparison (using same settings).
Example prompt, where the model confuses a tongue 👅 with a language 🗣️ and rambles like a drunk man 🥴
👤 How would you describe a cat's tongue?
🤖 (MLX 2bit) The first thing that comes to mind for those who are from countries where onomatopoeia is used frequently and cats are common, would likely be "meow" or "purr". The word meow is a familiar sound and it seems to be the only thing that comes to mind when thinking of cats. However, there are other words that can be used to describe a cat's tongue.
The word meow is often used as an interjection and it can be difficult to tell if the speaker means something else. The word meow has been used in many different contexts and it can be difficult to know which meaning is being referred to. The word meow has been used in many different contexts and it can be difficult to know which meaning is being referred to.
(LM Studio: Temp 0.06, MinP 0.05, TopP 0.92)
Same experience.
Chatted with Nightmedia about it and he says it's difference between the options GGUF and MLX allow for fine grained mixed quants. GGUF allows more freedom and I think a smaller page size.
Unsloth's XXS is working well for me: about 13t/s with 91/93 layers offloaded (full offload ≤1t/s), flash attention on, 24k context.
Can push it to 32k with a fp16 V quant no K quant - holds together fairly well.
sudo sysctl iogpu.wired_limit_mb=120826
On a M3 128GB.
Thanks @L285348120816 for sharing this feedback (and your speed benchmark).
I ran Unsloth's IQ2_XXS at 17 t/s with a M4 Max 128GB via Kobold.cpp (full offload, context shift, flash attention, use MMAP, use mlock, use FastForwarding, 8bit KV cache), with 16384 context., sysctl iogpu.wired_limit_mb=120000.
I know user deleted but an update: Unsloth updated their GLM 4.5 hf so now can run it through MLX quants easier. (Thanks @shimmyshimmer and @NeuralNovel )
I'll convert the Unsloth 4.5 2bit and 3bit and maybe those will run better.
(edit) both are now converted using unsloth as the base model for GLM 4.5 2bit and 3bit.
I'll convert the Unsloth 4.5 2bit and 3bit and maybe those will run better.
(edit) both are now converted using unsloth as the base model for GLM 4.5 2bit and 3bit.
Sadly, no visible improvement for me. 😓
Bummer, I was able to get it to behave a little better with modifying sampler settings but it’s not as good as deepseek for me. Better off with the 3 bit version than 2 bit
Thank you for your efforts. I tested the new 2-bit version in LM Studio, but unfortunately I have to say that it does not deliver usable results.
Might just not be flexible for around 110GB for MLX quants. I'll look into other quant methods more and see if any work well.
Until then, around 110GB size you can run 4.5 Air at 8Bit (113GB) or GLM 4.5v 8bit (115GB). My personal favorite lately is any fine tune or quant of Wayfarer Large which is much smaller at 8bit (75GB).