inferencerlabs
/

Kimi-K2-Instruct-MLX-3.985bit

Text Generation

4-bit precision

Model card Files Files and versions

Kimi-K2-Instruct-MLX-3.985bit / README.md

inferencerlabs's picture

Upload complete model

114ae50 verified 26 days ago

|

1.1 kB

	---
	license: other
	license_name: modified-mit
	library_name: mlx
	base_model: moonshotai/Kimi-K2-Instruct
	pipeline_tag: text-generation
	tags:
	- mlx
	---
	See Kimi-K2 Dynamic MLX in action - [https://youtu.be/-zfUvA2CDqE](https://youtu.be/-zfUvA2CDqE)

	q3.95bit dynamic quant achieves 1.243 perplexity in our testing, slotting closer to q4 perplexity (1.168) than q3 perplexity (1.900).
	\| Quantization \| Perplexity \|
	\|:------------:\|:----------:\|
	\| q2 \| 41.293 \|
	\| q3 \| 1.900 \|
	\| q3.95 \| 1.243 \|
	\| q4 \| 1.168 \|
	\| q6 \| 1.128 \|
	\| q8 \| 1.128 \|

	## Usage Notes

	* Built with a modified version of [MLX](https://github.com/ml-explore/mlx) 0.26
	* Runs on a single M3 Ultra 512GB RAM
	* Requires expanding VRAM limit to at least ~500000 MB
	* For a larger context window, 507000 is used in VRAM limit command below.
	* `sudo sysctl iogpu.wired_limit_mb=507000`
	* Expect ~20 tokens/s
	* For more details see [demonstration video](https://youtu.be/-zfUvA2CDqE) or visit [Kimi K2](https://moonshotai.github.io/Kimi-K2/).