--- library_name: transformers license: apache-2.0 datasets: - deepvk/cultura_ru_edu - HuggingFaceFW/fineweb-2 - HuggingFaceFW/fineweb language: - ru - en pipeline_tag: fill-mask --- # RuModernBERT-base The Russian version of the modernized bidirectional encoder-only Transformer model, [ModernBERT](https://arxiv.org/abs/2412.13663). RuModernBERT was pre-trained on approximately 2 trillion tokens of Russian, English, and code data with a context length of up to 8,192 tokens, using data from the internet, books, scientific sources, and social media. | | Model Size | Hidden Dim | Num Layers | Vocab Size | Context Length | Task | |------------------------------------------------------------------------------:|:----------:|:----------:|:----------:|:----------:|:--------------:|:---------:| | [deepvk/RuModernBERT-small](https://huggingface.co/deepvk/RuModernBERT-small) | 35M | 384 | 12 | 50368 | 8192 | Masked LM | | deepvk/RuModernBERT-base [this] | 150M | 768 | 22 | 50368 | 8192 | Masked LM | ## Usage Don't forget to update `transformers` and install `flash-attn` if your GPU supports it. ```python from transformers import AutoTokenizer, AutoModelForMaskedLM # Prepare model model_id = "deepvk/RuModernBERT-base" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForMaskedLM.from_pretrained(model_id, attn_implementation="flash_attention_2") model = model.eval() # Prepare input text = "Лимончелло это настойка из [MASK]." inputs = tokenizer(text, return_tensors="pt") masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id) # Make prediction outputs = model(**inputs) # Show prediction predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1) predicted_token = tokenizer.decode(predicted_token_id) print("Predicted token:", predicted_token) # Predicted token: лимона ``` ## Training Details This is the base version with 150 million parameters and the same configuration as in [`ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base). The crucial difference lies in the data we used to pre-train this model. ### Tokenizer We trained a new tokenizer following the original configuration. We maintained the size of the vocabulary and added the same special tokens. The tokenizer was trained on a mixture of Russian and English from FineWeb. ### Dataset Pre-training includes three main stages: massive pre-training, context extension, and cooldown. Unlike the original model, we did not use the same data for all stages. For the second and third stages, we used cleaner data sources. | Data Source | Stage 1 | Stage 2 | Stage 3 | |----------------------:|:--------:|:-------:|:--------:| | FineWeb (En+Ru) | ✅ | ❌ | ❌ | | CulturaX-Ru-Edu (Ru) | ❌ | ✅ | ❌ | | Wiki (En+Ru) | ✅ | ✅ | ✅ | | ArXiv (En) | ✅ | ✅ | ✅ | | Book (En+Ru) | ✅ | ✅ | ✅ | | Code | ✅ | ✅ | ✅ | | StackExchange (En+Ru) | ✅ | ✅ | ✅ | | Social (Ru) | ✅ | ✅ | ✅ | | **Total Tokens** | 1.7T | 250B | 50B | ### Context length In the first stage, the model was trained with a context length of `1,024`. In the second and third stages, it was extended to `8,192`. ## Evaluation To evaluate the model, we measure quality on the [`encodechka`](https://github.com/avidale/encodechka) and [`Russian Super Glue (RSG)`](https://russiansuperglue.com/) benchmarks. For RSG, we perform a grid search for optimal hyperparameters and report metrics from the **dev** split. For a fair comparison, we compare the RuModernBERT model only with raw encoders that were not trained on retrieval or sentence embedding tasks. ### Russian Super Glue | Model | RCB | PARus | MuSeRC | TERRa | RUSSE | RWSD | DaNetQA | Score | |-------------------------------------------------------------------------------:|:---------:|:------:|:-------:|:-----:|:-------:|:-------:|:-------:|:---------:| | [deepvk/deberta-v1-distill](https://huggingface.co/deepvk/deberta-v1-distill) | 0.433 | 0.56 | 0.625 | 0.590 | 0.943 | 0.569 | 0.726 | 0.635 | | [deepvk/deberta-v1-base](https://huggingface.co/deepvk/deberta-v1-base) | 0.450 | 0.61 | 0.722 | 0.704 | 0.948 | 0.578 | **0.760** | 0.682 | | [ai-forever/ruBert-base](https://huggingface.co/ai-forever/ruBert-base) | 0.491 | 0.61 | 0.663 | 0.769 | 0.962 | 0.574 | 0.678 | 0.678 | | [deepvk/RuModernBERT-small](https://huggingface.co/deepvk/RuModernBERT-small) | 0.555 | **0.64** | 0.746 | 0.593 | 0.930 | 0.574 | 0.743 | 0.683 | | deepvk/RuModernBERT-base [this] | **0.556** | 0.61 | **0.857** | **0.818** | **0.977** | **0.583** | 0.758 | **0.737** | ### Encodechka | | Model Size | STS-B | Paraphraser | XNLI | Sentiment | Toxicity | Inappropriateness | Intents | IntentsX | FactRu | RuDReC | Avg. S | Avg. S+W | |------------------------------------------------------------------------------------:|:----------:|:--------:|:-----------:|:--------:|:---------:|:--------:|:-----------------:|:--------:|:--------:|:--------:|:--------:|:----------:|:---------:| | [cointegrated/rubert-tiny](https://huggingface.co/cointegrated/rubert-tiny) | 11.9M | 0.66 | 0.53 | **0.40** | 0.71 | 0.89 | 0.68 | 0.70 | **0.58** | 0.24 | 0.34 | 0.645 | 0.575 | | [deepvk/deberta-v1-distill](https://huggingface.co/deepvk/deberta-v1-distill) | 81.5M | **0.70** | **0.57** | 0.38 | **0.77** | **0.98** | 0.79 | 0.77 | 0.36 | 0.36 | **0.44** | 0.665 | **0.612** | | [deepvk/deberta-v1-base](https://huggingface.co/deepvk/deberta-v1-base) | 124M | 0.68 | 0.54 | 0.38 | 0.76 | **0.98** | **0.80** | **0.78** | 0.29 | 0.29 | 0.40 | 0.653 | 0.591 | | [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) | 150M | 0.50 | 0.29 | 0.36 | 0.64 | 0.79 | 0.62 | 0.59 | 0.10 | 0.22 | 0.20 | 0.486 | 0.431 | | [ai-forever/ruBert-base](https://huggingface.co/ai-forever/ruBert-base) | 178M | 0.67 | 0.53 | 0.39 | **0.77** | **0.98** | 0.78 | 0.77 | 0.38 | 🥴 | 🥴 | 0.659 | 🥴 | | [DeepPavlov/rubert-base-cased](https://huggingface.co/DeepPavlov/rubert-base-cased) | 180M | 0.63 | 0.50 | 0.38 | 0.73 | 0.94 | 0.74 | 0.74 | 0.31 | 🥴 | 🥴 | 0.621 | 🥴 | | [deepvk/RuModernBERT-small](https://huggingface.co/deepvk/RuModernBERT-small) | 35M | 0.64 | 0.50 | 0.36 | 0.72 | 0.95 | 0.73 | 0.72 | 0.47 | 0.28 | 0.26 | 0.636 | 0.563 | | deepvk/RuModernBERT-base [this] | 150M | 0.67 | 0.54 | 0.35 | 0.75 | 0.97 | 0.76 | 0.76 | **0.58** | **0.37** | 0.36 | **0.673** | 0.611 | ## Citation ``` @misc{deepvk2025rumodernbert, title={RuModernBERT: Modernized BERT for Russian}, author={Spirin, Egor and Malashenko, Boris and Sokolov Andrey}, url={https://huggingface.co/deepvk/rumodernbert-base}, publisher={Hugging Face} year={2025}, } ```