A monolingual tokenizer for Azerbaijani trained on `azj_Latn` subset of FineWeb-2 corpus.

Citation

BibTeX:

@inproceedings{isbarov-etal-2024-open,
    title = "Open foundation models for {A}zerbaijani language",
    author = "Isbarov, Jafar  and
      Huseynova, Kavsar  and
      Mammadov, Elvin  and
      Hajili, Mammad  and
      Ataman, Duygu",
    editor = {Ataman, Duygu  and
      Derin, Mehmet Oguz  and
      Ivanova, Sardana  and
      K{\"o}ksal, Abdullatif  and
      S{\"a}lev{\"a}, Jonne  and
      Zeyrek, Deniz},
    booktitle = "Proceedings of the First Workshop on Natural Language Processing for Turkic Languages (SIGTURK 2024)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand and Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.sigturk-1.2/",
    pages = "18--28",
    abstract = "The emergence of multilingual large language models has enabled the development of language understanding and generation systems in Azerbaijani. However, most of the production-grade systems rely on cloud solutions, such as GPT-4. While there have been several attempts to develop open foundation models for Azerbaijani, these works have not found their way into common use due to a lack of systemic benchmarking. This paper encompasses several lines of work that promote open-source foundation models for Azerbaijani. We introduce (1) a large text corpus for Azerbaijani, (2) a family of encoder-only language models trained on this dataset, (3) labeled datasets for evaluating these models, and (4) extensive evaluation that covers all major open-source models with Azerbaijani support."
}

allmalab
/

aLLMA-2-tokenizer

A monolingual tokenizer for Azerbaijani trained on `azj_Latn` subset of FineWeb-2 corpus.

Citation

Dataset used to train allmalab/aLLMA-2-tokenizer

A monolingual tokenizer for Azerbaijani trained on azj_Latn subset of FineWeb-2 corpus.

Citation

Dataset used to train allmalab/aLLMA-2-tokenizer

A monolingual tokenizer for Azerbaijani trained on `azj_Latn` subset of FineWeb-2 corpus.