File size: 5,446 Bytes
e5ea2ba
 
1952761
 
 
 
 
3fd895a
e5ea2ba
 
 
 
3efd247
e5ea2ba
 
 
1952761
 
 
 
 
e5ea2ba
ab5ef3a
 
 
 
d9e6411
1952761
 
3fd895a
e5ea2ba
 
 
 
1952761
e5ea2ba
 
 
 
1952761
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e5ea2ba
 
 
 
1952761
 
e5ea2ba
 
 
1952761
 
e5ea2ba
 
 
 
 
 
 
 
1952761
 
 
 
 
e5ea2ba
 
 
 
1952761
e5ea2ba
 
 
 
 
1952761
 
e5ea2ba
 
1952761
f9db441
e5ea2ba
 
1952761
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
library_name: transformers
datasets:
- uonlp/CulturaX
- others
language:
- bg
license: mit
---

# Model Card for Model ID

BERT model trained on Bulgarian literature, Web, and other datasets - uncased.

## Model Details

657M parameter BERT model trained on 19B (23B depending on tokenization) tokens for 5 epochs with Masked Language Modelling objective.
- Tokenizer vocabulary size is 50176.
- Model hidden dimension is 1024.
- Feed-Forward dimension is 4096.
- Hidden layer count is 48.

- **Developed by:** Artificial Inteligence and Language Technologies Department at [Institute of Information and Communication Technologies](https://www.iict.bas.bg/en/index.html) - Bulgarian Academy of Sciences.
- **Funded by:** The model is pretrained within the [CLaDA-BG: National Interdisciplinary Research E-Infrastructure for
Bulgarian Language and Cultural heritage - member of the pan-European research consortia CLARIN-ERIC & DARIAH-ERIC](https://clada-bg.eu/en/),
funded by the Ministry of Education and Science of Bulgaria (support for the Bulgarian National Roadmap for Research Infrastructure).
The training was performed at the supercomputer [HEMUS](http://ict.acad.bg/?page_id=1659) at IICT-BAS, part of the RIs of the CoE on Informatics and ICT, financed by the OP SESG (2014–2020), and co-financed by the European Union through the ESIF.
- **Model type:** BERT
- **Language(s) (NLP):** Bulgarian.
- **License:** MIT

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
The model is intended to be used as a base model for fine-tuning tasks in NLP.

### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
```python
>>> from transformers import (
>>>     PreTrainedTokenizerFast,
>>>     BertForMaskedLM,
>>>     pipeline 
>>> )

>>> model = BertForMaskedLM.from_pretrained('AIaLT-IICT/bert_bg_lit_web_extra_large_uncased')
>>> tokenizer = PreTrainedTokenizerFast.from_pretrained('AIaLT-IICT/bert_bg_lit_web_extra_large_uncased')

>>> fill_mask = pipeline(
>>>     "fill-mask",
>>>     model=model,
>>>     tokenizer=tokenizer
>>> )


>>> fill_mask("Заради 3 завода няма да [MASK] нито есенниците неподхранени, нито зърното да поскъпне заради тях.")

[{'score': 0.3125033974647522,
  'token': 19273,
  'token_str': 'останат',
  'sequence': 'заради 3 завода няма да останат нито есенниците неподхранени, нито зърното да поскъпне заради тях.'},
 {'score': 0.24685998260974884,
  'token': 15953,
  'token_str': 'остави',
  'sequence': 'заради 3 завода няма да остави нито есенниците неподхранени, нито зърното да поскъпне заради тях.'},
 {'score': 0.2007855325937271,
  'token': 27509,
  'token_str': 'оставим',
  'sequence': 'заради 3 завода няма да оставим нито есенниците неподхранени, нито зърното да поскъпне заради тях.'},
 {'score': 0.10415761172771454,
  'token': 24533,
  'token_str': 'оставят',
  'sequence': 'заради 3 завода няма да оставят нито есенниците неподхранени, нито зърното да поскъпне заради тях.'},
 {'score': 0.016328580677509308,
  'token': 13989,
  'token_str': 'има',
  'sequence': 'заради 3 завода няма да има нито есенниците неподхранени, нито зърното да поскъпне заради тях.'}]
```

### Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
The model is not trained on Next Sentence prediction so the [CLS] token embedding will not be useful out of the box.
If you want to use the model for Sequence classification it is recommended to fine-tune it.

### Recommendations

It is recommended to use the model for Token Classification and Sequence classification fine-tuning tasks. 
The model can be used within SentenceTransformers framework for producing embeddings.


## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

Trained on 19B tokens mainly consisting of:
- uonlp/CulturaX
- [MaCoCu-bg 2.0](https://www.clarin.si/repository/xmlui/handle/11356/1800)
- Literature
- others

### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
Trained with Masked Language Modelling with 20% masks for 5 epochs with tf32 mixed precision, 512 tokens context and batch size of 256*512 tokens.


## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->
The model is evaluated on the Masked Language Modelling objective on test split with 20% random masked tokens.
It achieves test loss of 1.09 and test accuracy of 75.55%


## Model Card Authors
Nikolay Paev, Kiril Simov

## Model Card Contact
[email protected]