Model Card for Model ID

BERT model trained on Bulgarian literature, Web, and other datasets - uncased.

Model Details

657M parameter BERT model trained on 19B (23B depending on tokenization) tokens for 5 epochs with Masked Language Modelling objective.

Uses

The model is intended to be used as a base model for fine-tuning tasks in NLP.

Direct Use

>>> from transformers import (
>>>     PreTrainedTokenizerFast,
>>>     BertForMaskedLM,
>>>     pipeline 
>>> )

>>> model = BertForMaskedLM.from_pretrained('AIaLT-IICT/bert_bg_lit_web_extra_large_uncased')
>>> tokenizer = PreTrainedTokenizerFast.from_pretrained('AIaLT-IICT/bert_bg_lit_web_extra_large_uncased')

>>> fill_mask = pipeline(
>>>     "fill-mask",
>>>     model=model,
>>>     tokenizer=tokenizer
>>> )


>>> fill_mask("Заради 3 завода няма да [MASK] нито есенниците неподхранени, нито зърното да поскъпне заради тях.")

[{'score': 0.3125033974647522,
  'token': 19273,
  'token_str': 'останат',
  'sequence': 'заради 3 завода няма да останат нито есенниците неподхранени, нито зърното да поскъпне заради тях.'},
 {'score': 0.24685998260974884,
  'token': 15953,
  'token_str': 'остави',
  'sequence': 'заради 3 завода няма да остави нито есенниците неподхранени, нито зърното да поскъпне заради тях.'},
 {'score': 0.2007855325937271,
  'token': 27509,
  'token_str': 'оставим',
  'sequence': 'заради 3 завода няма да оставим нито есенниците неподхранени, нито зърното да поскъпне заради тях.'},
 {'score': 0.10415761172771454,
  'token': 24533,
  'token_str': 'оставят',
  'sequence': 'заради 3 завода няма да оставят нито есенниците неподхранени, нито зърното да поскъпне заради тях.'},
 {'score': 0.016328580677509308,
  'token': 13989,
  'token_str': 'има',
  'sequence': 'заради 3 завода няма да има нито есенниците неподхранени, нито зърното да поскъпне заради тях.'}]

Out-of-Scope Use

The model is not trained on Next Sentence prediction so the [CLS] token embedding will not be useful out of the box. If you want to use the model for Sequence classification it is recommended to fine-tune it.

Recommendations

It is recommended to use the model for Token Classification and Sequence classification fine-tuning tasks. The model can be used within SentenceTransformers framework for producing embeddings.

Training Details

Training Data

Trained on 19B tokens mainly consisting of:

Training Procedure

Trained with Masked Language Modelling with 20% masks for 5 epochs with tf32 mixed precision, 512 tokens context and batch size of 256*512 tokens.

Evaluation

The model is evaluated on the Masked Language Modelling objective on test split with 20% random masked tokens. It achieves test loss of 1.09 and test accuracy of 75.55%

Model Card Authors

Nikolay Paev, Kiril Simov

Model Card Contact

[email protected]

Downloads last month
2
Safetensors
Model size
657M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Dataset used to train AIaLT-IICT/bert_bg_lit_web_extra_large_uncased