fasttext cbow on dclm400

A continuous-bag-of-words model trained on https://huggingface.co/datasets/mlfoundations/dclm-pool-400m-1x

the cbow model was trained with https://github.com/facebookresearch/fastText/

the dataset was downloaded with git-lfs

the dataset commit was: f20ae752116ce7b4ab15d31e1e40b094229bf911

the files decompressed with:

parallel "zstd --keep --stdout -d {} | jq .text > {/}.txt" ::: /root/lfs/dclm-pool-400m-1x/*.jsonl.zst

concatenated with

cat *.txt > CC_SHARD_ALL.jsonl.txt

the sha256sum CC_SHARD_ALL.json.txt is

576e4e79e76b9ca24dc77a8da0df17ad5efc9c5ca16c9a86f62e7b7b4ae8c640 CC_SHARD_ALL.jsonl.txt

then the fasttext model was trained with defaults settings from

compiled with gcc 13.3.1

fasttext-repo (main branch) with the commit hash 1142dc4c4ecbc19cc16eee5cdd28472e689267e6

training command:

prlimit -m 3200000000 fasttext cbow -input CC_SHARD_ALL.jsonl.txt -output fasttext_models/model

the exact fasttext binary is included in this repo as fasttext

the decompression and concatenating took a few hours.

the model training took 100 hours on 8 cores plus a few hours to read in the words (fasttext)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.