---
license: mit
library_name: mlx-lm
tags:
- mlx
- apple-silicon
- quantized
- moe
- text-generation
base_model: zai-org/GLM-4.5
model_type: glm
language:
- en
- zh
pipeline_tag: text-generation
---

# GLM-4.5 MLX 8-bit
[Upload in progress, sorry my internet is slow]

## Model Description

This is an 8-bit quantized MLX version of [zai-org/GLM-4.5](https://huggingface.co/zai-org/GLM-4.5), optimized for Apple Silicon with high unified memory configurations.

## Key Features

- **8-bit quantization** (8.502 bits per weight) for memory efficiency
- **MLX optimized** for Apple Silicon unified memory architecture
- **High-memory optimized**: Designed for systems with 512GB+ unified memory
- **Long context capable**: Tested with multiple 6,500+ word documents, 30K token chunks
- **Performance**: ~11.75 tokens/second on Mac Studio with 512GB RAM

## Model Details

- **Base Model**: GLM-4.5 by ZhipuAI
- **Architecture**: MoE (Mixture of Experts)
- **Quantization**: 8-bit MLX with group size 64
- **MLX-LM Version**: 0.26.3
- **Model Size**: ~375GB
- **Context Length**: 131,072 tokens (tested stable up to 132K+ tokens)

## System Requirements

- **Hardware**: Mac Studio or Mac Pro with Apple Silicon (M3 Ultra)
- **Memory**: 512GB+ unified memory strongly recommended
- **Storage**: ~400GB free space
- **Software**: macOS with MLX framework

## Performance Benchmarks

**Test Configuration**: 2025 Mac Studio M3 Ultra with 512GB unified memory

### Context Length Performance
- **Short Context (6.5K tokens)**: 11.75 tokens/second
- **Long Context (72K tokens)**: 5.0 tokens/second, 86% memory usage
- **Extended Context (121K tokens)**: 30K token input prompt, 2.53 tokens/second, 92% memory usage
- **Beyond Theoretical Limit (132K tokens)**: 11k token input prompt, 5.74 tokens/second, 85% peak memory
- **Proven Capability**: Successfully exceeds stated 131K context window (102.2% capacity)
- **Quality**: Full comprehension and analysis of complex, sprawling content at maximum context

### Recommended Generation Settings
- **Temperature**: 0.8
- **Top K**: 100
- **Repeat Penalty**: 1.1
- **Min P**: Default/unset
- **Top P**: Default/unset

### Comparison with GGUF
- **MLX Version**: System remains responsive during inference, stable performance
- **GGUF Version**: System becomes unusable, frequent crashes around 30-40K tokens in context window

## Usage

### With MLX-LM

```python
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/GLM-4.5-MLX-8bit")
response = generate(model, tokenizer, "Your prompt here", max_tokens=500)
```

### With LM Studio

1. Download the model files
2. Load in LM Studio
3. Set appropriate context length based on your memory
4. Recommended settings: [Add any specific settings you found worked well]

## Limitations

- Requires substantial unified memory (512GB+ recommended)
- Optimized specifically for Apple Silicon; may not perform well on other architectures
- Quantization may introduce minor quality differences compared to the full-precision model

## Training Data & Bias

Please refer to the original [GLM-4.5 model card](https://huggingface.co/zai-org/GLM-4.5) for information about training data, intended use, and potential biases.

## Citation

If you use this model, please cite both the original GLM-4.5 work and acknowledge this MLX conversion:

```bibtex
@misc{glm45-mlx-8bit,
  title={GLM-4.5 MLX 8-bit},
  author={Onceler},
  year={2025},
  howpublished={\url{https://huggingface.co/mlx-community/GLM-4.5-MLX-8bit}},
}
```

## Acknowledgments

- Original model by ZhipuAI (zai-org/GLM-4.5)
- MLX framework by Apple
- Conversion performed on Mac Studio with 512GB unified memory

## License

This model inherits the license from the original GLM-4.5 model. Please refer to the original model repository for license details.