File size: 5,446 Bytes

---
library_name: transformers
datasets:
- uonlp/CulturaX
- others
language:
- bg
license: mit
---

# Model Card for Model ID

BERT model trained on Bulgarian literature, Web, and other datasets - uncased.

## Model Details

657M parameter BERT model trained on 19B (23B depending on tokenization) tokens for 5 epochs with Masked Language Modelling objective.
- Tokenizer vocabulary size is 50176.
- Model hidden dimension is 1024.
- Feed-Forward dimension is 4096.
- Hidden layer count is 48.

- **Developed by:** Artificial Inteligence and Language Technologies Department at [Institute of Information and Communication Technologies](https://www.iict.bas.bg/en/index.html) - Bulgarian Academy of Sciences.
- **Funded by:** The model is pretrained within the [CLaDA-BG: National Interdisciplinary Research E-Infrastructure for
Bulgarian Language and Cultural heritage - member of the pan-European research consortia CLARIN-ERIC & DARIAH-ERIC](https://clada-bg.eu/en/),
funded by the Ministry of Education and Science of Bulgaria (support for the Bulgarian National Roadmap for Research Infrastructure).
The training was performed at the supercomputer [HEMUS](http://ict.acad.bg/?page_id=1659) at IICT-BAS, part of the RIs of the CoE on Informatics and ICT, financed by the OP SESG (2014–2020), and co-financed by the European Union through the ESIF.
- **Model type:** BERT
- **Language(s) (NLP):** Bulgarian.
- **License:** MIT

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
The model is intended to be used as a base model for fine-tuning tasks in NLP.

### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
```python
>>> from transformers import (
>>>     PreTrainedTokenizerFast,
>>>     BertForMaskedLM,
>>>     pipeline 
>>> )

>>> model = BertForMaskedLM.from_pretrained('AIaLT-IICT/bert_bg_lit_web_extra_large_uncased')
>>> tokenizer = PreTrainedTokenizerFast.from_pretrained('AIaLT-IICT/bert_bg_lit_web_extra_large_uncased')

>>> fill_mask = pipeline(
>>>     "fill-mask",
>>>     model=model,
>>>     tokenizer=tokenizer
>>> )


>>> fill_mask("Заради 3 завода няма да [MASK] нито есенниците неподхранени, нито зърното да поскъпне заради тях.")

[{'score': 0.3125033974647522,
  'token': 19273,
  'token_str': 'останат',
  'sequence': 'заради 3 завода няма да останат нито есенниците неподхранени, нито зърното да поскъпне заради тях.'},
 {'score': 0.24685998260974884,
  'token': 15953,
  'token_str': 'остави',
  'sequence': 'заради 3 завода няма да остави нито есенниците неподхранени, нито зърното да поскъпне заради тях.'},
 {'score': 0.2007855325937271,
  'token': 27509,
  'token_str': 'оставим',
  'sequence': 'заради 3 завода няма да оставим нито есенниците неподхранени, нито зърното да поскъпне заради тях.'},
 {'score': 0.10415761172771454,
  'token': 24533,
  'token_str': 'оставят',
  'sequence': 'заради 3 завода няма да оставят нито есенниците неподхранени, нито зърното да поскъпне заради тях.'},
 {'score': 0.016328580677509308,
  'token': 13989,
  'token_str': 'има',
  'sequence': 'заради 3 завода няма да има нито есенниците неподхранени, нито зърното да поскъпне заради тях.'}]
```

### Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
The model is not trained on Next Sentence prediction so the [CLS] token embedding will not be useful out of the box.
If you want to use the model for Sequence classification it is recommended to fine-tune it.

### Recommendations

It is recommended to use the model for Token Classification and Sequence classification fine-tuning tasks. 
The model can be used within SentenceTransformers framework for producing embeddings.


## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

Trained on 19B tokens mainly consisting of:
- uonlp/CulturaX
- [MaCoCu-bg 2.0](https://www.clarin.si/repository/xmlui/handle/11356/1800)
- Literature
- others

### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
Trained with Masked Language Modelling with 20% masks for 5 epochs with tf32 mixed precision, 512 tokens context and batch size of 256*512 tokens.


## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->
The model is evaluated on the Masked Language Modelling objective on test split with 20% random masked tokens.
It achieves test loss of 1.09 and test accuracy of 75.55%


## Model Card Authors
Nikolay Paev, Kiril Simov

## Model Card Contact
[email protected]