Transformers
English
electra
pretraining
Inference Endpoints

TensorFlow Model Garden LMs: FineWeb WordPiece Tokenizer

This WordPiece tokenizer was trained as part of the TensorFlow Model Garden LMs project.

The tokenizer was trained on the sample-10BT packages of the FineWeb and FineWeb-Edu dataset, using a vocabulary size of 64,000 subtokens.

A script for training that tokenizer can be found here.

Downloads last month
2
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

Datasets used to train model-garden-lms/fineweb-lms-vocab-64000