TUBELEX Statistical Language Models

N-gram models on the TUBELEX YouTube subtitle corpora. We provide modified Kneser-Ney language models of order 5 (Heafield et al., 2013), i.e. KenLM models.

The files are in LZMA-compressed ARPA format.

What is TUBELEX?

TUBELEX is a YouTube subtitle corpus currently available for Chinese, English, Indonesian, Japanese, and Spanish.

@article{nohejl_etal_2024_film,
  title={Beyond {{Film Subtitles}}: {{Is YouTube}} the {{Best Approximation}} of {{Spoken Vocabulary}}?},
  author={Nohejl, Adam and Hudi, Frederikus and Kardinata, Eunike Andriani and Ozaki, Shintaro and Riera Machin, Maria Angelica and Sun, Hongyu and Vasselli, Justin and Watanabe, Taro},
  year={2024}, eprint={2410.03240}, archiveprefix={arXiv}, primaryclass={cs.CL},
  url={https://arxiv.org/abs/2410.03240v1}, journal={ArXiv preprint}, volume={arXiv:2410.03240v1 [cs]}
}

Usage

To download and use the KenLM models in Python, first install dependencies:

pip install huggingface_hub
pip install https://github.com/kpu/kenlm/archive/master.zip

You can then use e.g. the English (en) model in the following way:

import kenlm
from huggingface_hub import hf_hub_download

model_file = hf_hub_download(repo_id='naist-nlp/tubelex-kenlm', filename='tubelex-en.arpa.xz')
# Loading the model requires KenLM to be compiled with LZMA support (`HAVE_XZLIB`).
# Otherwise you fill first need to decompress the model.
model = kenlm.Model(model_file)

text = ''a sequence of words'  # pre-tokenized, lower-cased, without punctuation
model.perplexity(text)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.