cbow.uk.300.bin is pre-trained word vectors for the Ukrainian language, trained with fastText on (yet unreleased) UberText2.0 dataset, collected and processed by the lang-uk. This model was trained using cbow in dimension 300, with character n-grams range of 4-6, and 15 negative samples.
The dataset for Ukrainian word analogy is available here.
Extrinsic evaluations were performed on two sequence labeling tasks: NER and POS tagging. NER-UK dataset was released by the lang-uk, and Ukrainian (UD) corpus was developed by a non-profit organization Institute for Ukrainian.
Results:
- Word analogy task: 0.49
- spaCy NER F-score: 0.82
- POS Flair Accuracy: 0.82
- POS spaCy Accuracy: 0.87
Usage
import fasttext.util
ft = fasttext.load_model('cbow.uk.300.bin')
ft.get_word_vector('привіт')
BibTeX entry and citation info
@inproceedings{romanyshyn-etal-2023-learning,
title = "Learning Word Embeddings for {U}krainian: A Comparative Study of Fast{T}ext Hyperparameters",
author = "Romanyshyn, Nataliia and
Chaplynskyi, Dmytro and
Zakharov, Kyrylo",
booktitle = "Proceedings of the Second Ukrainian Natural Language Processing Workshop",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.unlp-1.3",
pages = "20--31",
}
Copyright: Dmytro Chaplynskyi, lang-uk project, Nataliia Romanyshyn, Ukrainian Catholic University, 2022
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The HF Inference API does not support feature-extraction models for generic library.