--- license: mit library_name: mlx-lm tags: - mlx - apple-silicon - quantized - moe - text-generation base_model: zai-org/GLM-4.5 model_type: glm language: - en - zh pipeline_tag: text-generation --- # GLM-4.5 MLX 8-bit [Upload in progress, sorry my internet is slow] ## Model Description This is an 8-bit quantized MLX version of [zai-org/GLM-4.5](https://huggingface.co/zai-org/GLM-4.5), optimized for Apple Silicon with high unified memory configurations. ## Key Features - **8-bit quantization** (8.502 bits per weight) for memory efficiency - **MLX optimized** for Apple Silicon unified memory architecture - **High-memory optimized**: Designed for systems with 512GB+ unified memory - **Long context capable**: Tested with multiple 6,500+ word documents, 30K token chunks - **Performance**: ~11.75 tokens/second on Mac Studio with 512GB RAM ## Model Details - **Base Model**: GLM-4.5 by ZhipuAI - **Architecture**: MoE (Mixture of Experts) - **Quantization**: 8-bit MLX with group size 64 - **MLX-LM Version**: 0.26.3 - **Model Size**: ~375GB - **Context Length**: 131,072 tokens (tested stable up to 132K+ tokens) ## System Requirements - **Hardware**: Mac Studio or Mac Pro with Apple Silicon (M3 Ultra) - **Memory**: 512GB+ unified memory strongly recommended - **Storage**: ~400GB free space - **Software**: macOS with MLX framework ## Performance Benchmarks **Test Configuration**: 2025 Mac Studio M3 Ultra with 512GB unified memory ### Context Length Performance - **Short Context (6.5K tokens)**: 11.75 tokens/second - **Long Context (72K tokens)**: 5.0 tokens/second, 86% memory usage - **Extended Context (121K tokens)**: 30K token input prompt, 2.53 tokens/second, 92% memory usage - **Beyond Theoretical Limit (132K tokens)**: 11k token input prompt, 5.74 tokens/second, 85% peak memory - **Proven Capability**: Successfully exceeds stated 131K context window (102.2% capacity) - **Quality**: Full comprehension and analysis of complex, sprawling content at maximum context ### Recommended Generation Settings - **Temperature**: 0.8 - **Top K**: 100 - **Repeat Penalty**: 1.1 - **Min P**: Default/unset - **Top P**: Default/unset ### Comparison with GGUF - **MLX Version**: System remains responsive during inference, stable performance - **GGUF Version**: System becomes unusable, frequent crashes around 30-40K tokens in context window ## Usage ### With MLX-LM ```python from mlx_lm import load, generate model, tokenizer = load("mlx-community/GLM-4.5-MLX-8bit") response = generate(model, tokenizer, "Your prompt here", max_tokens=500) ``` ### With LM Studio 1. Download the model files 2. Load in LM Studio 3. Set appropriate context length based on your memory 4. Recommended settings: [Add any specific settings you found worked well] ## Limitations - Requires substantial unified memory (512GB+ recommended) - Optimized specifically for Apple Silicon; may not perform well on other architectures - Quantization may introduce minor quality differences compared to the full-precision model ## Training Data & Bias Please refer to the original [GLM-4.5 model card](https://huggingface.co/zai-org/GLM-4.5) for information about training data, intended use, and potential biases. ## Citation If you use this model, please cite both the original GLM-4.5 work and acknowledge this MLX conversion: ```bibtex @misc{glm45-mlx-8bit, title={GLM-4.5 MLX 8-bit}, author={Onceler}, year={2025}, howpublished={\url{https://huggingface.co/mlx-community/GLM-4.5-MLX-8bit}}, } ``` ## Acknowledgments - Original model by ZhipuAI (zai-org/GLM-4.5) - MLX framework by Apple - Conversion performed on Mac Studio with 512GB unified memory ## License This model inherits the license from the original GLM-4.5 model. Please refer to the original model repository for license details.