Onceler commited on
Commit
d6442da
·
verified ·
1 Parent(s): 24fd5bf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -7
README.md CHANGED
@@ -26,7 +26,7 @@ This is an 8-bit quantized MLX version of [zai-org/GLM-4.5](https://huggingface.
26
  - **8-bit quantization** (8.502 bits per weight) for memory efficiency
27
  - **MLX optimized** for Apple Silicon unified memory architecture
28
  - **High-memory optimized**: Designed for systems with 512GB+ unified memory
29
- - **Long context capable**: Tested with 6,500+ word documents
30
  - **Performance**: ~11.75 tokens/second on Mac Studio with 512GB RAM
31
 
32
  ## Model Details
@@ -36,24 +36,24 @@ This is an 8-bit quantized MLX version of [zai-org/GLM-4.5](https://huggingface.
36
  - **Quantization**: 8-bit MLX with group size 64
37
  - **MLX-LM Version**: 0.26.3
38
  - **Model Size**: ~375GB
39
- - **Context Length**: 131,072 tokens (tested stable up to 72K+ tokens)
40
 
41
  ## System Requirements
42
 
43
- - **Hardware**: Mac Studio or Mac Pro with Apple Silicon (M1/M2/M3 series)
44
  - **Memory**: 512GB+ unified memory strongly recommended
45
  - **Storage**: ~400GB free space
46
  - **Software**: macOS with MLX framework
47
 
48
  ## Performance Benchmarks
49
 
50
- **Test Configuration**: Mac Studio with 512GB unified memory
51
 
52
  ### Context Length Performance
53
  - **Short Context (6.5K tokens)**: 11.75 tokens/second
54
  - **Long Context (72K tokens)**: 5.0 tokens/second, 86% memory usage
55
- - **Extended Context (121K tokens)**: 2.53 tokens/second, 92% memory usage
56
- - **Beyond Theoretical Limit (132K tokens)**: 5.74 tokens/second, 85% peak memory
57
  - **Proven Capability**: Successfully exceeds stated 131K context window (102.2% capacity)
58
  - **Quality**: Full comprehension and analysis of complex, sprawling content at maximum context
59
 
@@ -66,7 +66,7 @@ This is an 8-bit quantized MLX version of [zai-org/GLM-4.5](https://huggingface.
66
 
67
  ### Comparison with GGUF
68
  - **MLX Version**: System remains responsive during inference, stable performance
69
- - **GGUF Version**: System becomes unusable, frequent crashes around 30-40K tokens
70
 
71
  ## Usage
72
 
 
26
  - **8-bit quantization** (8.502 bits per weight) for memory efficiency
27
  - **MLX optimized** for Apple Silicon unified memory architecture
28
  - **High-memory optimized**: Designed for systems with 512GB+ unified memory
29
+ - **Long context capable**: Tested with multiple 6,500+ word documents, 30K token chunks
30
  - **Performance**: ~11.75 tokens/second on Mac Studio with 512GB RAM
31
 
32
  ## Model Details
 
36
  - **Quantization**: 8-bit MLX with group size 64
37
  - **MLX-LM Version**: 0.26.3
38
  - **Model Size**: ~375GB
39
+ - **Context Length**: 131,072 tokens (tested stable up to 132K+ tokens)
40
 
41
  ## System Requirements
42
 
43
+ - **Hardware**: Mac Studio or Mac Pro with Apple Silicon (M3 Ultra)
44
  - **Memory**: 512GB+ unified memory strongly recommended
45
  - **Storage**: ~400GB free space
46
  - **Software**: macOS with MLX framework
47
 
48
  ## Performance Benchmarks
49
 
50
+ **Test Configuration**: 2025 Mac Studio M3 Ultra with 512GB unified memory
51
 
52
  ### Context Length Performance
53
  - **Short Context (6.5K tokens)**: 11.75 tokens/second
54
  - **Long Context (72K tokens)**: 5.0 tokens/second, 86% memory usage
55
+ - **Extended Context (121K tokens)**: 30K token input prompt, 2.53 tokens/second, 92% memory usage
56
+ - **Beyond Theoretical Limit (132K tokens)**: 11k token input prompt, 5.74 tokens/second, 85% peak memory
57
  - **Proven Capability**: Successfully exceeds stated 131K context window (102.2% capacity)
58
  - **Quality**: Full comprehension and analysis of complex, sprawling content at maximum context
59
 
 
66
 
67
  ### Comparison with GGUF
68
  - **MLX Version**: System remains responsive during inference, stable performance
69
+ - **GGUF Version**: System becomes unusable, frequent crashes around 30-40K tokens in context window
70
 
71
  ## Usage
72