|
--- |
|
license: mit |
|
library_name: mlx-lm |
|
tags: |
|
- mlx |
|
- apple-silicon |
|
- quantized |
|
- moe |
|
- text-generation |
|
base_model: zai-org/GLM-4.5 |
|
model_type: glm |
|
language: |
|
- en |
|
- zh |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# GLM-4.5 MLX 8-bit |
|
[Upload in progress, sorry my internet is slow] |
|
|
|
## Model Description |
|
|
|
This is an 8-bit quantized MLX version of [zai-org/GLM-4.5](https://huggingface.co/zai-org/GLM-4.5), optimized for Apple Silicon with high unified memory configurations. |
|
|
|
## Key Features |
|
|
|
- **8-bit quantization** (8.502 bits per weight) for memory efficiency |
|
- **MLX optimized** for Apple Silicon unified memory architecture |
|
- **High-memory optimized**: Designed for systems with 512GB+ unified memory |
|
- **Long context capable**: Tested with multiple 6,500+ word documents, 30K token chunks |
|
- **Performance**: ~11.75 tokens/second on Mac Studio with 512GB RAM |
|
|
|
## Model Details |
|
|
|
- **Base Model**: GLM-4.5 by ZhipuAI |
|
- **Architecture**: MoE (Mixture of Experts) |
|
- **Quantization**: 8-bit MLX with group size 64 |
|
- **MLX-LM Version**: 0.26.3 |
|
- **Model Size**: ~375GB |
|
- **Context Length**: 131,072 tokens (tested stable up to 132K+ tokens) |
|
|
|
## System Requirements |
|
|
|
- **Hardware**: Mac Studio or Mac Pro with Apple Silicon (M3 Ultra) |
|
- **Memory**: 512GB+ unified memory strongly recommended |
|
- **Storage**: ~400GB free space |
|
- **Software**: macOS with MLX framework |
|
|
|
## Performance Benchmarks |
|
|
|
**Test Configuration**: 2025 Mac Studio M3 Ultra with 512GB unified memory |
|
|
|
### Context Length Performance |
|
- **Short Context (6.5K tokens)**: 11.75 tokens/second |
|
- **Long Context (72K tokens)**: 5.0 tokens/second, 86% memory usage |
|
- **Extended Context (121K tokens)**: 30K token input prompt, 2.53 tokens/second, 92% memory usage |
|
- **Beyond Theoretical Limit (132K tokens)**: 11k token input prompt, 5.74 tokens/second, 85% peak memory |
|
- **Proven Capability**: Successfully exceeds stated 131K context window (102.2% capacity) |
|
- **Quality**: Full comprehension and analysis of complex, sprawling content at maximum context |
|
|
|
### Recommended Generation Settings |
|
- **Temperature**: 0.8 |
|
- **Top K**: 100 |
|
- **Repeat Penalty**: 1.1 |
|
- **Min P**: Default/unset |
|
- **Top P**: Default/unset |
|
|
|
### Comparison with GGUF |
|
- **MLX Version**: System remains responsive during inference, stable performance |
|
- **GGUF Version**: System becomes unusable, frequent crashes around 30-40K tokens in context window |
|
|
|
## Usage |
|
|
|
### With MLX-LM |
|
|
|
```python |
|
from mlx_lm import load, generate |
|
|
|
model, tokenizer = load("mlx-community/GLM-4.5-MLX-8bit") |
|
response = generate(model, tokenizer, "Your prompt here", max_tokens=500) |
|
``` |
|
|
|
### With LM Studio |
|
|
|
1. Download the model files |
|
2. Load in LM Studio |
|
3. Set appropriate context length based on your memory |
|
4. Recommended settings: [Add any specific settings you found worked well] |
|
|
|
## Limitations |
|
|
|
- Requires substantial unified memory (512GB+ recommended) |
|
- Optimized specifically for Apple Silicon; may not perform well on other architectures |
|
- Quantization may introduce minor quality differences compared to the full-precision model |
|
|
|
## Training Data & Bias |
|
|
|
Please refer to the original [GLM-4.5 model card](https://huggingface.co/zai-org/GLM-4.5) for information about training data, intended use, and potential biases. |
|
|
|
## Citation |
|
|
|
If you use this model, please cite both the original GLM-4.5 work and acknowledge this MLX conversion: |
|
|
|
```bibtex |
|
@misc{glm45-mlx-8bit, |
|
title={GLM-4.5 MLX 8-bit}, |
|
author={Onceler}, |
|
year={2025}, |
|
howpublished={\url{https://huggingface.co/mlx-community/GLM-4.5-MLX-8bit}}, |
|
} |
|
``` |
|
|
|
## Acknowledgments |
|
|
|
- Original model by ZhipuAI (zai-org/GLM-4.5) |
|
- MLX framework by Apple |
|
- Conversion performed on Mac Studio with 512GB unified memory |
|
|
|
## License |
|
|
|
This model inherits the license from the original GLM-4.5 model. Please refer to the original model repository for license details. |