Fill-Mask
Transformers
Safetensors
Russian
English
modernbert
Inference Endpoints
RuModernBERT-base / README.md
SpirinEgor's picture
Upload model, tokenizer, and documentation
3d74599
|
raw
history blame
8.11 kB
---
library_name: transformers
license: apache-2.0
datasets:
- deepvk/cultura_ru_edu
- HuggingFaceFW/fineweb-2
- HuggingFaceFW/fineweb
language:
- ru
- en
pipeline_tag: fill-mask
---
# RuModernBERT-base
The Russian version of the modernized bidirectional encoder-only Transformer model, [ModernBERT](https://arxiv.org/abs/2412.13663).
RuModernBERT was pre-trained on approximately 2 trillion tokens of Russian, English, and code data with a context length of up to 8,192 tokens, using data from the internet, books, scientific sources, and social media.
| | Model Size | Hidden Dim | Num Layers | Vocab Size | Context Length | Task |
|------------------------------------------------------------------------------:|:----------:|:----------:|:----------:|:----------:|:--------------:|:---------:|
| [deepvk/RuModernBERT-small](https://huggingface.co/deepvk/RuModernBERT-small) | 35M | 384 | 12 | 50368 | 8192 | Masked LM |
| deepvk/RuModernBERT-base [this] | 150M | 768 | 22 | 50368 | 8192 | Masked LM |
## Usage
Don't forget to update `transformers` and install `flash-attn` if your GPU supports it.
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
# Prepare model
model_id = "deepvk/RuModernBERT-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id, attn_implementation="flash_attention_2")
model = model.eval()
# Prepare input
text = "Лимончелло это настойка из [MASK]."
inputs = tokenizer(text, return_tensors="pt")
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
# Make prediction
outputs = model(**inputs)
# Show prediction
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)
# Predicted token: лимона
```
## Training Details
This is the base version with 150 million parameters and the same configuration as in [`ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base).
The crucial difference lies in the data we used to pre-train this model.
### Tokenizer
We trained a new tokenizer following the original configuration.
We maintained the size of the vocabulary and added the same special tokens.
The tokenizer was trained on a mixture of Russian and English from FineWeb.
### Dataset
Pre-training includes three main stages: massive pre-training, context extension, and cooldown.
Unlike the original model, we did not use the same data for all stages.
For the second and third stages, we used cleaner data sources.
| Data Source | Stage 1 | Stage 2 | Stage 3 |
|----------------------:|:--------:|:-------:|:--------:|
| FineWeb (En+Ru) | ✅ | ❌ | ❌ |
| CulturaX-Ru-Edu (Ru) | ❌ | ✅ | ❌ |
| Wiki (En+Ru) | ✅ | ✅ | ✅ |
| ArXiv (En) | ✅ | ✅ | ✅ |
| Book (En+Ru) | ✅ | ✅ | ✅ |
| Code | ✅ | ✅ | ✅ |
| StackExchange (En+Ru) | ✅ | ✅ | ✅ |
| Social (Ru) | ✅ | ✅ | ✅ |
| **Total Tokens** | 1.7T | 250B | 50B |
### Context length
In the first stage, the model was trained with a context length of `1,024`.
In the second and third stages, it was extended to `8,192`.
## Evaluation
To evaluate the model, we measure quality on the [`encodechka`](https://github.com/avidale/encodechka) and [`Russian Super Glue (RSG)`](https://russiansuperglue.com/) benchmarks.
For RSG, we perform a grid search for optimal hyperparameters and report metrics from the **dev** split.
For a fair comparison, we compare the RuModernBERT model only with raw encoders that were not trained on retrieval or sentence embedding tasks.
### Russian Super Glue
<img src="./rsg.jpg">
| Model | RCB | PARus | MuSeRC | TERRa | RUSSE | RWSD | DaNetQA | Score |
|-------------------------------------------------------------------------------:|:---------:|:------:|:-------:|:-----:|:-------:|:-------:|:-------:|:---------:|
| [deepvk/deberta-v1-distill](https://huggingface.co/deepvk/deberta-v1-distill) | 0.433 | 0.56 | 0.625 | 0.590 | 0.943 | 0.569 | 0.726 | 0.635 |
| [deepvk/deberta-v1-base](https://huggingface.co/deepvk/deberta-v1-base) | 0.450 | 0.61 | 0.722 | 0.704 | 0.948 | 0.578 | **0.760** | 0.682 |
| [ai-forever/ruBert-base](https://huggingface.co/ai-forever/ruBert-base) | 0.491 | 0.61 | 0.663 | 0.769 | 0.962 | 0.574 | 0.678 | 0.678 |
| [deepvk/RuModernBERT-small](https://huggingface.co/deepvk/RuModernBERT-small) | 0.555 | **0.64** | 0.746 | 0.593 | 0.930 | 0.574 | 0.743 | 0.683 |
| deepvk/RuModernBERT-base [this] | **0.556** | 0.61 | **0.857** | **0.818** | **0.977** | **0.583** | 0.758 | **0.737** |
### Encodechka
| | Model Size | STS-B | Paraphraser | XNLI | Sentiment | Toxicity | Inappropriateness | Intents | IntentsX | FactRu | RuDReC | Avg. S | Avg. S+W |
|------------------------------------------------------------------------------------:|:----------:|:--------:|:-----------:|:--------:|:---------:|:--------:|:-----------------:|:--------:|:--------:|:--------:|:--------:|:----------:|:---------:|
| [cointegrated/rubert-tiny](https://huggingface.co/cointegrated/rubert-tiny) | 11.9M | 0.66 | 0.53 | **0.40** | 0.71 | 0.89 | 0.68 | 0.70 | **0.58** | 0.24 | 0.34 | 0.645 | 0.575 |
| [deepvk/deberta-v1-distill](https://huggingface.co/deepvk/deberta-v1-distill) | 81.5M | **0.70** | **0.57** | 0.38 | **0.77** | **0.98** | 0.79 | 0.77 | 0.36 | 0.36 | **0.44** | 0.665 | **0.612** |
| [deepvk/deberta-v1-base](https://huggingface.co/deepvk/deberta-v1-base) | 124M | 0.68 | 0.54 | 0.38 | 0.76 | **0.98** | **0.80** | **0.78** | 0.29 | 0.29 | 0.40 | 0.653 | 0.591 |
| [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) | 150M | 0.50 | 0.29 | 0.36 | 0.64 | 0.79 | 0.62 | 0.59 | 0.10 | 0.22 | 0.20 | 0.486 | 0.431 |
| [ai-forever/ruBert-base](https://huggingface.co/ai-forever/ruBert-base) | 178M | 0.67 | 0.53 | 0.39 | **0.77** | **0.98** | 0.78 | 0.77 | 0.38 | 🥴 | 🥴 | 0.659 | 🥴 |
| [DeepPavlov/rubert-base-cased](https://huggingface.co/DeepPavlov/rubert-base-cased) | 180M | 0.63 | 0.50 | 0.38 | 0.73 | 0.94 | 0.74 | 0.74 | 0.31 | 🥴 | 🥴 | 0.621 | 🥴 |
| [deepvk/RuModernBERT-small](https://huggingface.co/deepvk/RuModernBERT-small) | 35M | 0.64 | 0.50 | 0.36 | 0.72 | 0.95 | 0.73 | 0.72 | 0.47 | 0.28 | 0.26 | 0.636 | 0.563 |
| deepvk/RuModernBERT-base [this] | 150M | 0.67 | 0.54 | 0.35 | 0.75 | 0.97 | 0.76 | 0.76 | **0.58** | **0.37** | 0.36 | **0.673** | 0.611 |
## Citation
```
@misc{deepvk2025rumodernbert,
title={RuModernBERT: Modernized BERT for Russian},
author={Spirin, Egor and Malashenko, Boris and Sokolov Andrey},
url={https://huggingface.co/deepvk/rumodernbert-base},
publisher={Hugging Face}
year={2025},
}
```