TensorFlow Model Garden LMs: FineWeb WordPiece Tokenizer

This WordPiece tokenizer was trained as part of the TensorFlow Model Garden LMs project.

The tokenizer was trained on the sample-10BT packages of the FineWeb and FineWeb-Edu dataset, using a vocabulary size of 64,000 subtokens.

A script for training that tokenizer can be found here.

Inference Providers NEW

This model is not currently available via any of the supported Inference Providers.

The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

model-garden-lms
/

fineweb-lms-vocab-64000