|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
--- |
|
|
|
|
|
# Model Card for sparsing-law-0.1b-relu |
|
|
|
|
|
- **Paper:** [paper](https://arxiv.org/pdf/2411.02335) |
|
|
- **Repository and demo code:** [github](https://github.com/thunlp/SparsingLaw) |
|
|
|
|
|
This model is ReLU-activated and contains approximately 0.1 billion non-embedding parameters. |
|
|
|
|
|
The model was trained from scratch using the pre-training dataset described in our paper, with the WSD (Warmup-Stable-Decay) learning rate scheduler. It represents the final checkpoint of the stable stage in WSD, meaning it has not undergone the decay stage. |
|
|
|