GLM-4.5-MLX-8bit / README.md

Update README.md

0711303 verified 6 days ago

3.82 kB

	---
	license: mit
	library_name: mlx-lm
	tags:
	- mlx
	- apple-silicon
	- quantized
	- moe
	- text-generation
	base_model: zai-org/GLM-4.5
	model_type: glm
	language:
	- en
	- zh
	pipeline_tag: text-generation
	---

	# GLM-4.5 MLX 8-bit
	[Upload in progress, sorry my internet is slow]

	## Model Description

	This is an 8-bit quantized MLX version of [zai-org/GLM-4.5](https://huggingface.co/zai-org/GLM-4.5), optimized for Apple Silicon with high unified memory configurations.

	## Key Features

	- 8-bit quantization (8.502 bits per weight) for memory efficiency
	- MLX optimized for Apple Silicon unified memory architecture
	- High-memory optimized: Designed for systems with 512GB+ unified memory
	- Long context capable: Tested with multiple 6,500+ word documents, 30K token chunks
	- Performance: ~11.75 tokens/second on Mac Studio with 512GB RAM

	## Model Details

	- Base Model: GLM-4.5 by ZhipuAI
	- Architecture: MoE (Mixture of Experts)
	- Quantization: 8-bit MLX with group size 64
	- MLX-LM Version: 0.26.3
	- Model Size: ~375GB
	- Context Length: 131,072 tokens (tested stable up to 132K+ tokens)

	## System Requirements

	- Hardware: Mac Studio or Mac Pro with Apple Silicon (M3 Ultra)
	- Memory: 512GB+ unified memory strongly recommended
	- Storage: ~400GB free space
	- Software: macOS with MLX framework

	## Performance Benchmarks

	Test Configuration: 2025 Mac Studio M3 Ultra with 512GB unified memory

	### Context Length Performance
	- Short Context (6.5K tokens): 11.75 tokens/second
	- Long Context (72K tokens): 5.0 tokens/second, 86% memory usage
	- Extended Context (121K tokens): 30K token input prompt, 2.53 tokens/second, 92% memory usage
	- Beyond Theoretical Limit (132K tokens): 11k token input prompt, 5.74 tokens/second, 85% peak memory
	- Proven Capability: Successfully exceeds stated 131K context window (102.2% capacity)
	- Quality: Full comprehension and analysis of complex, sprawling content at maximum context

	### Recommended Generation Settings
	- Temperature: 0.8
	- Top K: 100
	- Repeat Penalty: 1.1
	- Min P: Default/unset
	- Top P: Default/unset

	### Comparison with GGUF
	- MLX Version: System remains responsive during inference, stable performance
	- GGUF Version: System becomes unusable, frequent crashes around 30-40K tokens in context window

	## Usage

	### With MLX-LM

	```python
	from mlx_lm import load, generate

	model, tokenizer = load("mlx-community/GLM-4.5-MLX-8bit")
	response = generate(model, tokenizer, "Your prompt here", max_tokens=500)
	```

	### With LM Studio

	1. Download the model files
	2. Load in LM Studio
	3. Set appropriate context length based on your memory
	4. Recommended settings: [Add any specific settings you found worked well]

	## Limitations

	- Requires substantial unified memory (512GB+ recommended)
	- Optimized specifically for Apple Silicon; may not perform well on other architectures
	- Quantization may introduce minor quality differences compared to the full-precision model

	## Training Data & Bias

	Please refer to the original [GLM-4.5 model card](https://huggingface.co/zai-org/GLM-4.5) for information about training data, intended use, and potential biases.

	## Citation

	If you use this model, please cite both the original GLM-4.5 work and acknowledge this MLX conversion:

	```bibtex
	@misc{glm45-mlx-8bit,
	title={GLM-4.5 MLX 8-bit},
	author={Onceler},
	year={2025},
	howpublished={\url{https://huggingface.co/mlx-community/GLM-4.5-MLX-8bit}},
	}
	```

	## Acknowledgments

	- Original model by ZhipuAI (zai-org/GLM-4.5)
	- MLX framework by Apple
	- Conversion performed on Mac Studio with 512GB unified memory

	## License

	This model inherits the license from the original GLM-4.5 model. Please refer to the original model repository for license details.