library_name: transformers | |
datasets: | |
- HuggingFaceTB/smollm-corpus | |
# Doge-tokenizer | |
Tokenizer for the training model on [smollm-corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus), and support reasoning fine-tuning like R1. | |
This tokenizer was trained on 2M samples from: | |
- FineWeb-Edu 70% | |
- Cosmopedia v2 20% | |
- Python-Edu 5% | |
- FineMath 5% | |