|
--- |
|
library_name: transformers |
|
license: apache-2.0 |
|
datasets: |
|
- deepvk/cultura_ru_edu |
|
- HuggingFaceFW/fineweb-2 |
|
- HuggingFaceFW/fineweb |
|
language: |
|
- ru |
|
- en |
|
pipeline_tag: fill-mask |
|
--- |
|
|
|
# RuModernBERT-base |
|
|
|
The Russian version of the modernized bidirectional encoder-only Transformer model, [ModernBERT](https://arxiv.org/abs/2412.13663). |
|
RuModernBERT was pre-trained on approximately 2 trillion tokens of Russian, English, and code data with a context length of up to 8,192 tokens, using data from the internet, books, scientific sources, and social media. |
|
|
|
| | Model Size | Hidden Dim | Num Layers | Vocab Size | Context Length | Task | |
|
|------------------------------------------------------------------------------:|:----------:|:----------:|:----------:|:----------:|:--------------:|:---------:| |
|
| [deepvk/RuModernBERT-small](https://huggingface.co/deepvk/RuModernBERT-small) | 35M | 384 | 12 | 50368 | 8192 | Masked LM | |
|
| deepvk/RuModernBERT-base [this] | 150M | 768 | 22 | 50368 | 8192 | Masked LM | |
|
|
|
|
|
## Usage |
|
|
|
Don't forget to update `transformers` and install `flash-attn` if your GPU supports it. |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
|
|
# Prepare model |
|
model_id = "deepvk/RuModernBERT-base" |
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
model = AutoModelForMaskedLM.from_pretrained(model_id, attn_implementation="flash_attention_2") |
|
model = model.eval() |
|
|
|
# Prepare input |
|
text = "Лимончелло это настойка из [MASK]." |
|
inputs = tokenizer(text, return_tensors="pt") |
|
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id) |
|
|
|
# Make prediction |
|
outputs = model(**inputs) |
|
|
|
# Show prediction |
|
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1) |
|
predicted_token = tokenizer.decode(predicted_token_id) |
|
print("Predicted token:", predicted_token) |
|
# Predicted token: лимона |
|
``` |
|
|
|
## Training Details |
|
|
|
This is the base version with 150 million parameters and the same configuration as in [`ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base). |
|
The crucial difference lies in the data we used to pre-train this model. |
|
|
|
### Tokenizer |
|
|
|
We trained a new tokenizer following the original configuration. |
|
We maintained the size of the vocabulary and added the same special tokens. |
|
The tokenizer was trained on a mixture of Russian and English from FineWeb. |
|
|
|
### Dataset |
|
|
|
Pre-training includes three main stages: massive pre-training, context extension, and cooldown. |
|
Unlike the original model, we did not use the same data for all stages. |
|
For the second and third stages, we used cleaner data sources. |
|
|
|
| Data Source | Stage 1 | Stage 2 | Stage 3 | |
|
|----------------------:|:--------:|:-------:|:--------:| |
|
| FineWeb (En+Ru) | ✅ | ❌ | ❌ | |
|
| CulturaX-Ru-Edu (Ru) | ❌ | ✅ | ❌ | |
|
| Wiki (En+Ru) | ✅ | ✅ | ✅ | |
|
| ArXiv (En) | ✅ | ✅ | ✅ | |
|
| Book (En+Ru) | ✅ | ✅ | ✅ | |
|
| Code | ✅ | ✅ | ✅ | |
|
| StackExchange (En+Ru) | ✅ | ✅ | ✅ | |
|
| Social (Ru) | ✅ | ✅ | ✅ | |
|
| **Total Tokens** | 1.7T | 250B | 50B | |
|
|
|
|
|
### Context length |
|
|
|
In the first stage, the model was trained with a context length of `1,024`. |
|
In the second and third stages, it was extended to `8,192`. |
|
|
|
## Evaluation |
|
|
|
To evaluate the model, we measure quality on the [`encodechka`](https://github.com/avidale/encodechka) and [`Russian Super Glue (RSG)`](https://russiansuperglue.com/) benchmarks. |
|
For RSG, we perform a grid search for optimal hyperparameters and report metrics from the **dev** split. |
|
|
|
For a fair comparison, we compare the RuModernBERT model only with raw encoders that were not trained on retrieval or sentence embedding tasks. |
|
|
|
### Russian Super Glue |
|
|
|
<img src="./rsg.jpg"> |
|
|
|
| Model | RCB | PARus | MuSeRC | TERRa | RUSSE | RWSD | DaNetQA | Score | |
|
|-------------------------------------------------------------------------------:|:---------:|:------:|:-------:|:-----:|:-------:|:-------:|:-------:|:---------:| |
|
| [deepvk/deberta-v1-distill](https://huggingface.co/deepvk/deberta-v1-distill) | 0.433 | 0.56 | 0.625 | 0.590 | 0.943 | 0.569 | 0.726 | 0.635 | |
|
| [deepvk/deberta-v1-base](https://huggingface.co/deepvk/deberta-v1-base) | 0.450 | 0.61 | 0.722 | 0.704 | 0.948 | 0.578 | **0.760** | 0.682 | |
|
| [ai-forever/ruBert-base](https://huggingface.co/ai-forever/ruBert-base) | 0.491 | 0.61 | 0.663 | 0.769 | 0.962 | 0.574 | 0.678 | 0.678 | |
|
| [deepvk/RuModernBERT-small](https://huggingface.co/deepvk/RuModernBERT-small) | 0.555 | **0.64** | 0.746 | 0.593 | 0.930 | 0.574 | 0.743 | 0.683 | |
|
| deepvk/RuModernBERT-base [this] | **0.556** | 0.61 | **0.857** | **0.818** | **0.977** | **0.583** | 0.758 | **0.737** | |
|
|
|
### Encodechka |
|
|
|
| | Model Size | STS-B | Paraphraser | XNLI | Sentiment | Toxicity | Inappropriateness | Intents | IntentsX | FactRu | RuDReC | Avg. S | Avg. S+W | |
|
|------------------------------------------------------------------------------------:|:----------:|:--------:|:-----------:|:--------:|:---------:|:--------:|:-----------------:|:--------:|:--------:|:--------:|:--------:|:----------:|:---------:| |
|
| [cointegrated/rubert-tiny](https://huggingface.co/cointegrated/rubert-tiny) | 11.9M | 0.66 | 0.53 | **0.40** | 0.71 | 0.89 | 0.68 | 0.70 | **0.58** | 0.24 | 0.34 | 0.645 | 0.575 | |
|
| [deepvk/deberta-v1-distill](https://huggingface.co/deepvk/deberta-v1-distill) | 81.5M | **0.70** | **0.57** | 0.38 | **0.77** | **0.98** | 0.79 | 0.77 | 0.36 | 0.36 | **0.44** | 0.665 | **0.612** | |
|
| [deepvk/deberta-v1-base](https://huggingface.co/deepvk/deberta-v1-base) | 124M | 0.68 | 0.54 | 0.38 | 0.76 | **0.98** | **0.80** | **0.78** | 0.29 | 0.29 | 0.40 | 0.653 | 0.591 | |
|
| [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) | 150M | 0.50 | 0.29 | 0.36 | 0.64 | 0.79 | 0.62 | 0.59 | 0.10 | 0.22 | 0.20 | 0.486 | 0.431 | |
|
| [ai-forever/ruBert-base](https://huggingface.co/ai-forever/ruBert-base) | 178M | 0.67 | 0.53 | 0.39 | **0.77** | **0.98** | 0.78 | 0.77 | 0.38 | 🥴 | 🥴 | 0.659 | 🥴 | |
|
| [DeepPavlov/rubert-base-cased](https://huggingface.co/DeepPavlov/rubert-base-cased) | 180M | 0.63 | 0.50 | 0.38 | 0.73 | 0.94 | 0.74 | 0.74 | 0.31 | 🥴 | 🥴 | 0.621 | 🥴 | |
|
| [deepvk/RuModernBERT-small](https://huggingface.co/deepvk/RuModernBERT-small) | 35M | 0.64 | 0.50 | 0.36 | 0.72 | 0.95 | 0.73 | 0.72 | 0.47 | 0.28 | 0.26 | 0.636 | 0.563 | |
|
| deepvk/RuModernBERT-base [this] | 150M | 0.67 | 0.54 | 0.35 | 0.75 | 0.97 | 0.76 | 0.76 | **0.58** | **0.37** | 0.36 | **0.673** | 0.611 | |
|
|
|
## Citation |
|
|
|
``` |
|
@misc{deepvk2025rumodernbert, |
|
title={RuModernBERT: Modernized BERT for Russian}, |
|
author={Spirin, Egor and Malashenko, Boris and Sokolov Andrey}, |
|
url={https://huggingface.co/deepvk/rumodernbert-base}, |
|
publisher={Hugging Face} |
|
year={2025}, |
|
} |
|
``` |