File size: 3,991 Bytes
26bb2bf f722ee7 af83a73 f722ee7 88258eb f722ee7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
---
base_model:
- THUDM/GLM-4-32B-Base-0414
license: mit
pipeline_tag: text-generation
library_name: transformers
language:
- zh
- en
---
### GLM-4-32B-Base-32K
GLM-4-32B-Base-32K is an enhanced version of [THUDM's GLM-4-32B-Base-0414](https://huggingface.co/THUDM/GLM-4-32B-Base-0414), specifically engineered to offer robust performance over an extended context window. While the original model's capabilities degraded after 8,192 tokens, this version maintains strong performance up to a 32,000-token context, making it ideal for tasks requiring long-context understanding and processing.
This model was developed as a proof-of-concept to validate that a merging-centric approach to context extension can be successfully applied to larger-scale models. The techniques employed resulted in an approximate 5% overall improvement on standard base model benchmarks while significantly improving 32k recall.
More details can be found in our blog post [here](https://www.arcee.ai/blog/extending-afm-4-5b-to-64k-context-length) where we applied this work to our upcoming AFM 4.5B
## Model Details
- Architecture Base: [THUDM/GLM-4-32B-Base-0414](https://huggingface.co/THUDM/GLM-4-32B-Base-0414)
- Parameter Count: 32B
- License: [MIT](https://huggingface.co/arcee-ai/GLM-4-32B-Base-32K#license)
## Improvements
The primary improvement in this model is its enhanced long-context capability. The following methods were used to achieve this:
- Targeted Long-Context Training: The model underwent continued pretraining on sequences up to its full 32,000 token context length.
- Iterative Merging: Various model checkpoints were iteratively merged to combine the benefits of different training runs, enhancing both long-context and short-context performance.
- Short-Context Distillation: Knowledge from the original high-performing short-context model was distilled into the long-context-trained model to recover and retain its initial capabilities on shorter tasks.
As a result, where the original model's performance on the Needle in a Haystack (NIAH) benchmark would decline after 8,000 tokens, this extended version maintains reliable performance across the entire 32,000 token context window.
## Benchmarks
| Benchmark | GLM-4-32B-Base-0414 | GLM-4-32B-Base-32K |
|-----------|--------------------:|-----------------:|
| arc_challenge | 59.39% | **64.93%** |
| arc_easy | 85.44% | **87.88%** |
| hellaswag | 64.75% | **65.40%** |
| mmlu | 77.05% | **77.87%** |
| piqa | 81.61% | **83.19%** |
| truthfulqa_mc2 | 49.27% | **50.07%** |
| winogrande | 78.69% | **80.03%** |
### NIAH Benchmark Results Comparison
| Model | Task | 4,096 | 8,192 | 16,384 | 24,576 | 32,768 |
|-------|------|------:|------:|-------:|-------:|-------:|
| **GLM-4-32B-Base-0414** | | | | | | |
| | niah_single_1 | 100.0% | 100.0% | 77.0% | 5.2% | 1.2% |
| | niah_single_2 | 100.0% | 100.0% | 73.4% | 2.6% | 0.0% |
| | niah_single_3 | 100.0% | **99.8%** | 48.0% | 1.4% | 0.0% |
| **GLM-4-32B-Base-32k** | | | | | | |
| | niah_single_1 | 100.0% | 100.0% | **100.0%** | **99.2%** | **99.6%** |
| | niah_single_2 | 100.0% | 100.0% | **99.2%** | **80.2%** | **68.8%** |
| | niah_single_3 | 100.0% | 99.6% | **95.6%** | **86.6%** | **61.0%** |
### NIAH Averages
| Model | 4,096 | 8,192 | 16,384 | 24,576 | 32,768 |
|-------|------:|------:|-------:|-------:|-------:|
| GLM-4-32B-Base-0414 | 100.0% | 99.9% | 66.1% | 3.1% | 0.4% |
| GLM-4-32B-Base-32k | 100.0% | 99.9% | **98.3%** | **88.7%** | **76.5%** |
## Use Cases
This model serves as a new base for continued training at 32K context
## License
**GLM-4-32B-Base-32K (32B)** is released under the [MIT](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/mit.md) license following with the original model's license.
If you have questions or would like to share your experiences using GLM-4-32B-Base-32K (32B), please connect with us on social media. We’re excited to see what you build—and how this model helps you innovate! |