|
--- |
|
license: apache-2.0 |
|
tags: |
|
- moe |
|
- llm |
|
- efficient-inference |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# TC-MoE: Augmenting Mixture of Experts with Ternary Expert Choice |
|
|
|
## Model Description |
|
|
|
TC-MoE is a novel Mixture-of-Experts (MoE) architecture that enhances traditional MoE models through expert space expansion. By applying the ternary set {-1, 0, 1} to each original expert, TC-MoE achieves: |
|
|
|
- **9% reduction** in activated experts compared to Top-K routing |
|
- **1.1% average performance gain** on language understanding benchmarks |
|
- Flexible efficiency-effectiveness trade-off via reward mechanism |
|
|
|
Key innovations: |
|
- 🎯 **Ternary Expert Expansion**: Creates parameter-sharing expert variants (-1, 0, +1) without significant computational overhead |
|
- ⚖️ **Adaptive Load Balancing**: Novel load balance loss for expert workload distribution |
|
- 🎮 **Reward-Driven Routing**: Dynamic control of expert activation ratios |
|
|
|
## Model Overview |
|
|
|
- **Architecture**: Decoder-only transformer based on LLaMA |
|
- **Pretraining Data**: |
|
- RedPajama (100B tokens) |
|
- **Model Size**: |
|
- Base (681M/2.3B params) |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
model = AutoModelForCausalLM.from_pretrained("stiger1000/TC-MoE") |
|
tokenizer = AutoTokenizer.from_pretrained("stiger1000/TC-MoE") |
|
|
|
inputs = tokenizer("The capital of France is", return_tensors="pt") |
|
outputs = model.generate(**inputs, max_length=50) |
|
print(tokenizer.decode(outputs[0])) |
|
``` |
|
|
|
## Training Details |
|
|
|
- **Optimizer**: AdamW (β₁=0.9, β₂=0.95) |
|
- **Learning Rate**: 1e-4 with cosine decay |
|
- **Batch Size**: 4M tokens |
|
- **Loss Components**: |
|
- Language Modeling Loss |
|
- Load Balance Loss (α₁=0.01) |
|
- Reward Loss (α₂=0.0) |
|
|
|
## Citation |
|
```bibtex |
|
@inproceedings{yan2025tcmoe, |
|
title={TC-MoE: Augmenting Mixture of Experts with Ternary Expert Choice}, |
|
author={Yan, Shen and Bin, Xingyan and Zhang, Sijun and Wang, Yisen and Lin, Zhouchen}, |
|
booktitle={The Thirteenth International Conference on Learning Representations}, |
|
year={2025} |
|
} |
|
``` |
|
|
|
📚 **Repository**: [GitHub](https://github.com/stiger1000/TC-MoE) |