Fill-Mask
Transformers
Safetensors
Russian
English
modernbert
Inference Endpoints
File size: 8,109 Bytes
3d74599
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
library_name: transformers
license: apache-2.0
datasets:
- deepvk/cultura_ru_edu
- HuggingFaceFW/fineweb-2
- HuggingFaceFW/fineweb
language:
- ru
- en
pipeline_tag: fill-mask
---

# RuModernBERT-base

The Russian version of the modernized bidirectional encoder-only Transformer model, [ModernBERT](https://arxiv.org/abs/2412.13663).
RuModernBERT was pre-trained on approximately 2 trillion tokens of Russian, English, and code data with a context length of up to 8,192 tokens, using data from the internet, books, scientific sources, and social media.

|                                                                               | Model Size | Hidden Dim | Num Layers | Vocab Size | Context Length |    Task   |
|------------------------------------------------------------------------------:|:----------:|:----------:|:----------:|:----------:|:--------------:|:---------:|
| [deepvk/RuModernBERT-small](https://huggingface.co/deepvk/RuModernBERT-small) |     35M    |     384    |     12     |    50368   |      8192      | Masked LM |
|                                               deepvk/RuModernBERT-base [this] |    150M    |     768    |     22     |    50368   |      8192      | Masked LM |


## Usage

Don't forget to update `transformers` and install `flash-attn` if your GPU supports it.

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

# Prepare model
model_id = "deepvk/RuModernBERT-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id, attn_implementation="flash_attention_2")
model = model.eval()

# Prepare input
text = "Лимончелло это настойка из [MASK]."
inputs = tokenizer(text, return_tensors="pt")
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)

# Make prediction
outputs = model(**inputs)

# Show prediction
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)
# Predicted token:  лимона
```

## Training Details

This is the base version with 150 million parameters and the same configuration as in [`ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base).
The crucial difference lies in the data we used to pre-train this model.

### Tokenizer

We trained a new tokenizer following the original configuration.
We maintained the size of the vocabulary and added the same special tokens.
The tokenizer was trained on a mixture of Russian and English from FineWeb.

### Dataset

Pre-training includes three main stages: massive pre-training, context extension, and cooldown.
Unlike the original model, we did not use the same data for all stages.
For the second and third stages, we used cleaner data sources.

|           Data Source | Stage 1  | Stage 2 | Stage 3  |
|----------------------:|:--------:|:-------:|:--------:|
|       FineWeb (En+Ru) |    ✅    |    ❌    |    ❌    |
|  CulturaX-Ru-Edu (Ru) |    ❌    |    ✅    |    ❌    |
|          Wiki (En+Ru) |    ✅    |    ✅    |    ✅    |
|            ArXiv (En) |    ✅    |    ✅    |    ✅    |
|          Book (En+Ru) |    ✅    |    ✅    |    ✅    |
|                  Code |    ✅    |    ✅    |    ✅    |
| StackExchange (En+Ru) |    ✅    |    ✅    |    ✅    |
|           Social (Ru) |    ✅    |    ✅    |    ✅    |
|      **Total Tokens** |   1.7T   |   250B  |    50B   |


### Context length

In the first stage, the model was trained with a context length of `1,024`.
In the second and third stages, it was extended to `8,192`.

## Evaluation

To evaluate the model, we measure quality on the [`encodechka`](https://github.com/avidale/encodechka) and [`Russian Super Glue (RSG)`](https://russiansuperglue.com/) benchmarks.
For RSG, we perform a grid search for optimal hyperparameters and report metrics from the **dev** split.

For a fair comparison, we compare the RuModernBERT model only with raw encoders that were not trained on retrieval or sentence embedding tasks.

### Russian Super Glue

<img src="./rsg.jpg">

| Model                                                                          | RCB       |  PARus | MuSeRC  | TERRa | RUSSE   | RWSD    | DaNetQA | Score     |
|-------------------------------------------------------------------------------:|:---------:|:------:|:-------:|:-----:|:-------:|:-------:|:-------:|:---------:|
| [deepvk/deberta-v1-distill](https://huggingface.co/deepvk/deberta-v1-distill)  | 0.433     |  0.56  | 0.625   | 0.590 | 0.943   | 0.569   | 0.726   | 0.635     |
| [deepvk/deberta-v1-base](https://huggingface.co/deepvk/deberta-v1-base)        | 0.450     |  0.61  | 0.722   | 0.704 | 0.948   | 0.578   | **0.760**   | 0.682     |
| [ai-forever/ruBert-base](https://huggingface.co/ai-forever/ruBert-base)        | 0.491     |  0.61  | 0.663   | 0.769 | 0.962   | 0.574   | 0.678   | 0.678     |
| [deepvk/RuModernBERT-small](https://huggingface.co/deepvk/RuModernBERT-small)  | 0.555     |  **0.64**  | 0.746   | 0.593 | 0.930   | 0.574   | 0.743   | 0.683     |
| deepvk/RuModernBERT-base [this]                                                | **0.556** |  0.61  | **0.857**   | **0.818** | **0.977**   | **0.583**   | 0.758   | **0.737**     |

### Encodechka

|                                                                                     | Model Size |   STS-B  | Paraphraser |   XNLI   | Sentiment | Toxicity | Inappropriateness |  Intents | IntentsX |  FactRu  |  RuDReC  |   Avg. S   |  Avg. S+W |
|------------------------------------------------------------------------------------:|:----------:|:--------:|:-----------:|:--------:|:---------:|:--------:|:-----------------:|:--------:|:--------:|:--------:|:--------:|:----------:|:---------:|
|         [cointegrated/rubert-tiny](https://huggingface.co/cointegrated/rubert-tiny) |    11.9M   |   0.66   |     0.53    | **0.40** |    0.71   |   0.89   |        0.68       |   0.70   | **0.58** |   0.24   |   0.34   |    0.645   |   0.575   |
|       [deepvk/deberta-v1-distill](https://huggingface.co/deepvk/deberta-v1-distill) |    81.5M   | **0.70** |   **0.57**  |   0.38   |  **0.77** | **0.98** |        0.79       |   0.77   |   0.36   |   0.36   | **0.44** |    0.665   | **0.612** |
|             [deepvk/deberta-v1-base](https://huggingface.co/deepvk/deberta-v1-base) |    124M    |   0.68   |     0.54    |   0.38   |    0.76   | **0.98** |      **0.80**     | **0.78** |   0.29   |   0.29   |   0.40   |    0.653   |   0.591   |
|   [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) |    150M    |   0.50   |     0.29    |   0.36   |    0.64   |   0.79   |        0.62       |   0.59   |   0.10   |   0.22   |   0.20   |    0.486   |   0.431   |
|             [ai-forever/ruBert-base](https://huggingface.co/ai-forever/ruBert-base) |    178M    |   0.67   |     0.53    |   0.39   |  **0.77** | **0.98** |        0.78       |   0.77   |   0.38   |    🥴    |    🥴    |    0.659   |    🥴    |
| [DeepPavlov/rubert-base-cased](https://huggingface.co/DeepPavlov/rubert-base-cased) |    180M    |   0.63   |     0.50    |   0.38   |    0.73   |   0.94   |        0.74       |   0.74   |   0.31   |    🥴    |    🥴    |    0.621   |    🥴    |
|       [deepvk/RuModernBERT-small](https://huggingface.co/deepvk/RuModernBERT-small) |     35M    |   0.64   |     0.50    |   0.36   |    0.72   |   0.95   |        0.73       |   0.72   |   0.47   |   0.28   |   0.26   |    0.636   |   0.563   |
|                                                     deepvk/RuModernBERT-base [this] |    150M    |   0.67   |     0.54    |   0.35   |    0.75   |   0.97   |        0.76       |   0.76   | **0.58** | **0.37** |   0.36   | **0.673**  |   0.611   |

## Citation

```
@misc{deepvk2025rumodernbert,
    title={RuModernBERT: Modernized BERT for Russian},
    author={Spirin, Egor and Malashenko, Boris and Sokolov Andrey},
    url={https://huggingface.co/deepvk/rumodernbert-base},
    publisher={Hugging Face}
    year={2025},
}
```