mlx-community
/

GLM-4.5-MLX-8bit

+---
+license: mit
+library_name: mlx-lm
+tags:
+- mlx
+- apple-silicon
+- quantized
+- moe
+- text-generation
+base_model: zai-org/GLM-4.5
+model_type: glm
+language:
+- en
+- zh
+pipeline_tag: text-generation
+---
+# GLM-4.5 MLX 8-bit
+## Model Description
+This is an 8-bit quantized MLX version of [zai-org/GLM-4.5](https://huggingface.co/zai-org/GLM-4.5), optimized for Apple Silicon with high unified memory configurations.
+## Key Features
+- **8-bit quantization** (8.502 bits per weight) for memory efficiency
+- **MLX optimized** for Apple Silicon unified memory architecture
+- **High-memory optimized**: Designed for systems with 512GB+ unified memory
+- **Long context capable**: Tested with 6,500+ word documents
+- **Performance**: ~11.75 tokens/second on Mac Studio with 512GB RAM
+## Model Details
+- **Base Model**: GLM-4.5 by ZhipuAI
+- **Architecture**: MoE (Mixture of Experts)
+- **Quantization**: 8-bit MLX with group size 64
+- **MLX-LM Version**: 0.26.3
+- **Model Size**: ~375GB
+- **Context Length**: 131,072 tokens (tested stable up to 72K+ tokens)
+## System Requirements
+- **Hardware**: Mac Studio or Mac Pro with Apple Silicon (M1/M2/M3 series)
+- **Memory**: 512GB+ unified memory strongly recommended
+- **Storage**: ~400GB free space
+- **Software**: macOS with MLX framework
+## Performance Benchmarks
+**Test Configuration**: Mac Studio with 512GB unified memory
+### Context Length Performance
+- **Short Context (6.5K tokens)**: 11.75 tokens/second
+- **Long Context (72K tokens)**: 5.0 tokens/second, 86% memory usage
+- **Extended Context (121K tokens)**: 2.53 tokens/second, 92% memory usage
+- **Beyond Theoretical Limit (132K tokens)**: 5.74 tokens/second, 85% peak memory
+- **Proven Capability**: Successfully exceeds stated 131K context window (102.2% capacity)
+- **Quality**: Full comprehension and analysis of complex, sprawling content at maximum context
+### Recommended Generation Settings
+- **Temperature**: 0.8
+- **Top K**: 100
+- **Repeat Penalty**: 1.1
+- **Min P**: Default/unset
+- **Top P**: Default/unset
+### Comparison with GGUF
+- **MLX Version**: System remains responsive during inference, stable performance
+- **GGUF Version**: System becomes unusable, frequent crashes around 30-40K tokens
+## Usage
+### With MLX-LM
+```python
+from mlx_lm import load, generate
+model, tokenizer = load("mlx-community/GLM-4.5-MLX-8bit")
+response = generate(model, tokenizer, "Your prompt here", max_tokens=500)
+```
+### With LM Studio
+1. Download the model files
+2. Load in LM Studio
+3. Set appropriate context length based on your memory
+4. Recommended settings: [Add any specific settings you found worked well]
+## Limitations
+- Requires substantial unified memory (512GB+ recommended)
+- Optimized specifically for Apple Silicon; may not perform well on other architectures
+- Quantization may introduce minor quality differences compared to the full-precision model
+## Training Data & Bias
+Please refer to the original [GLM-4.5 model card](https://huggingface.co/zai-org/GLM-4.5) for information about training data, intended use, and potential biases.
+## Citation
+If you use this model, please cite both the original GLM-4.5 work and acknowledge this MLX conversion:
+```bibtex
+@misc{glm45-mlx-8bit,
+  title={GLM-4.5 MLX 8-bit},
+  author={Onceler},
+  year={2025},
+  howpublished={\url{https://huggingface.co/mlx-community/GLM-4.5-MLX-8bit}},
+}
+```
+## Acknowledgments
+- Original model by ZhipuAI (zai-org/GLM-4.5)
+- MLX framework by Apple
+- Conversion performed on Mac Studio with 512GB unified memory
+## License
+This model inherits the license from the original GLM-4.5 model. Please refer to the original model repository for license details.