GLM-4.5-MLX-8bit / README.md
Onceler's picture
Update README.md
0711303 verified
---
license: mit
library_name: mlx-lm
tags:
- mlx
- apple-silicon
- quantized
- moe
- text-generation
base_model: zai-org/GLM-4.5
model_type: glm
language:
- en
- zh
pipeline_tag: text-generation
---
# GLM-4.5 MLX 8-bit
[Upload in progress, sorry my internet is slow]
## Model Description
This is an 8-bit quantized MLX version of [zai-org/GLM-4.5](https://huggingface.co/zai-org/GLM-4.5), optimized for Apple Silicon with high unified memory configurations.
## Key Features
- **8-bit quantization** (8.502 bits per weight) for memory efficiency
- **MLX optimized** for Apple Silicon unified memory architecture
- **High-memory optimized**: Designed for systems with 512GB+ unified memory
- **Long context capable**: Tested with multiple 6,500+ word documents, 30K token chunks
- **Performance**: ~11.75 tokens/second on Mac Studio with 512GB RAM
## Model Details
- **Base Model**: GLM-4.5 by ZhipuAI
- **Architecture**: MoE (Mixture of Experts)
- **Quantization**: 8-bit MLX with group size 64
- **MLX-LM Version**: 0.26.3
- **Model Size**: ~375GB
- **Context Length**: 131,072 tokens (tested stable up to 132K+ tokens)
## System Requirements
- **Hardware**: Mac Studio or Mac Pro with Apple Silicon (M3 Ultra)
- **Memory**: 512GB+ unified memory strongly recommended
- **Storage**: ~400GB free space
- **Software**: macOS with MLX framework
## Performance Benchmarks
**Test Configuration**: 2025 Mac Studio M3 Ultra with 512GB unified memory
### Context Length Performance
- **Short Context (6.5K tokens)**: 11.75 tokens/second
- **Long Context (72K tokens)**: 5.0 tokens/second, 86% memory usage
- **Extended Context (121K tokens)**: 30K token input prompt, 2.53 tokens/second, 92% memory usage
- **Beyond Theoretical Limit (132K tokens)**: 11k token input prompt, 5.74 tokens/second, 85% peak memory
- **Proven Capability**: Successfully exceeds stated 131K context window (102.2% capacity)
- **Quality**: Full comprehension and analysis of complex, sprawling content at maximum context
### Recommended Generation Settings
- **Temperature**: 0.8
- **Top K**: 100
- **Repeat Penalty**: 1.1
- **Min P**: Default/unset
- **Top P**: Default/unset
### Comparison with GGUF
- **MLX Version**: System remains responsive during inference, stable performance
- **GGUF Version**: System becomes unusable, frequent crashes around 30-40K tokens in context window
## Usage
### With MLX-LM
```python
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/GLM-4.5-MLX-8bit")
response = generate(model, tokenizer, "Your prompt here", max_tokens=500)
```
### With LM Studio
1. Download the model files
2. Load in LM Studio
3. Set appropriate context length based on your memory
4. Recommended settings: [Add any specific settings you found worked well]
## Limitations
- Requires substantial unified memory (512GB+ recommended)
- Optimized specifically for Apple Silicon; may not perform well on other architectures
- Quantization may introduce minor quality differences compared to the full-precision model
## Training Data & Bias
Please refer to the original [GLM-4.5 model card](https://huggingface.co/zai-org/GLM-4.5) for information about training data, intended use, and potential biases.
## Citation
If you use this model, please cite both the original GLM-4.5 work and acknowledge this MLX conversion:
```bibtex
@misc{glm45-mlx-8bit,
title={GLM-4.5 MLX 8-bit},
author={Onceler},
year={2025},
howpublished={\url{https://huggingface.co/mlx-community/GLM-4.5-MLX-8bit}},
}
```
## Acknowledgments
- Original model by ZhipuAI (zai-org/GLM-4.5)
- MLX framework by Apple
- Conversion performed on Mac Studio with 512GB unified memory
## License
This model inherits the license from the original GLM-4.5 model. Please refer to the original model repository for license details.