Upload model, tokenizer, and documentation

3d74599 5 days ago

8.11 kB

	---
	library_name: transformers
	license: apache-2.0
	datasets:
	- deepvk/cultura_ru_edu
	- HuggingFaceFW/fineweb-2
	- HuggingFaceFW/fineweb
	language:
	- ru
	- en
	pipeline_tag: fill-mask
	---

	# RuModernBERT-base

	The Russian version of the modernized bidirectional encoder-only Transformer model, [ModernBERT](https://arxiv.org/abs/2412.13663).
	RuModernBERT was pre-trained on approximately 2 trillion tokens of Russian, English, and code data with a context length of up to 8,192 tokens, using data from the internet, books, scientific sources, and social media.

	\| \| Model Size \| Hidden Dim \| Num Layers \| Vocab Size \| Context Length \| Task \|
	\|------------------------------------------------------------------------------:\|:----------:\|:----------:\|:----------:\|:----------:\|:--------------:\|:---------:\|
	\| [deepvk/RuModernBERT-small](https://huggingface.co/deepvk/RuModernBERT-small) \| 35M \| 384 \| 12 \| 50368 \| 8192 \| Masked LM \|
	\| deepvk/RuModernBERT-base [this] \| 150M \| 768 \| 22 \| 50368 \| 8192 \| Masked LM \|


	## Usage

	Don't forget to update `transformers` and install `flash-attn` if your GPU supports it.

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	# Prepare model
	model_id = "deepvk/RuModernBERT-base"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForMaskedLM.from_pretrained(model_id, attn_implementation="flash_attention_2")
	model = model.eval()

	# Prepare input
	text = "Лимончелло это настойка из [MASK]."
	inputs = tokenizer(text, return_tensors="pt")
	masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)

	# Make prediction
	outputs = model(**inputs)

	# Show prediction
	predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
	predicted_token = tokenizer.decode(predicted_token_id)
	print("Predicted token:", predicted_token)
	# Predicted token: лимона
	```

	## Training Details

	This is the base version with 150 million parameters and the same configuration as in [`ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base).
	The crucial difference lies in the data we used to pre-train this model.

	### Tokenizer

	We trained a new tokenizer following the original configuration.
	We maintained the size of the vocabulary and added the same special tokens.
	The tokenizer was trained on a mixture of Russian and English from FineWeb.

	### Dataset

	Pre-training includes three main stages: massive pre-training, context extension, and cooldown.
	Unlike the original model, we did not use the same data for all stages.
	For the second and third stages, we used cleaner data sources.

	\| Data Source \| Stage 1 \| Stage 2 \| Stage 3 \|
	\|----------------------:\|:--------:\|:-------:\|:--------:\|
	\| FineWeb (En+Ru) \| ✅ \| ❌ \| ❌ \|
	\| CulturaX-Ru-Edu (Ru) \| ❌ \| ✅ \| ❌ \|
	\| Wiki (En+Ru) \| ✅ \| ✅ \| ✅ \|
	\| ArXiv (En) \| ✅ \| ✅ \| ✅ \|
	\| Book (En+Ru) \| ✅ \| ✅ \| ✅ \|
	\| Code \| ✅ \| ✅ \| ✅ \|
	\| StackExchange (En+Ru) \| ✅ \| ✅ \| ✅ \|
	\| Social (Ru) \| ✅ \| ✅ \| ✅ \|
	\| Total Tokens \| 1.7T \| 250B \| 50B \|


	### Context length

	In the first stage, the model was trained with a context length of `1,024`.
	In the second and third stages, it was extended to `8,192`.

	## Evaluation

	To evaluate the model, we measure quality on the [`encodechka`](https://github.com/avidale/encodechka) and [`Russian Super Glue (RSG)`](https://russiansuperglue.com/) benchmarks.
	For RSG, we perform a grid search for optimal hyperparameters and report metrics from the dev split.

	For a fair comparison, we compare the RuModernBERT model only with raw encoders that were not trained on retrieval or sentence embedding tasks.

	### Russian Super Glue

	<img src="./rsg.jpg">

	\| Model \| RCB \| PARus \| MuSeRC \| TERRa \| RUSSE \| RWSD \| DaNetQA \| Score \|
	\|-------------------------------------------------------------------------------:\|:---------:\|:------:\|:-------:\|:-----:\|:-------:\|:-------:\|:-------:\|:---------:\|
	\| [deepvk/deberta-v1-distill](https://huggingface.co/deepvk/deberta-v1-distill) \| 0.433 \| 0.56 \| 0.625 \| 0.590 \| 0.943 \| 0.569 \| 0.726 \| 0.635 \|
	\| [deepvk/deberta-v1-base](https://huggingface.co/deepvk/deberta-v1-base) \| 0.450 \| 0.61 \| 0.722 \| 0.704 \| 0.948 \| 0.578 \| 0.760 \| 0.682 \|
	\| [ai-forever/ruBert-base](https://huggingface.co/ai-forever/ruBert-base) \| 0.491 \| 0.61 \| 0.663 \| 0.769 \| 0.962 \| 0.574 \| 0.678 \| 0.678 \|
	\| [deepvk/RuModernBERT-small](https://huggingface.co/deepvk/RuModernBERT-small) \| 0.555 \| 0.64 \| 0.746 \| 0.593 \| 0.930 \| 0.574 \| 0.743 \| 0.683 \|
	\| deepvk/RuModernBERT-base [this] \| 0.556 \| 0.61 \| 0.857 \| 0.818 \| 0.977 \| 0.583 \| 0.758 \| 0.737 \|

	### Encodechka

	\| \| Model Size \| STS-B \| Paraphraser \| XNLI \| Sentiment \| Toxicity \| Inappropriateness \| Intents \| IntentsX \| FactRu \| RuDReC \| Avg. S \| Avg. S+W \|
	\|------------------------------------------------------------------------------------:\|:----------:\|:--------:\|:-----------:\|:--------:\|:---------:\|:--------:\|:-----------------:\|:--------:\|:--------:\|:--------:\|:--------:\|:----------:\|:---------:\|
	\| [cointegrated/rubert-tiny](https://huggingface.co/cointegrated/rubert-tiny) \| 11.9M \| 0.66 \| 0.53 \| 0.40 \| 0.71 \| 0.89 \| 0.68 \| 0.70 \| 0.58 \| 0.24 \| 0.34 \| 0.645 \| 0.575 \|
	\| [deepvk/deberta-v1-distill](https://huggingface.co/deepvk/deberta-v1-distill) \| 81.5M \| 0.70 \| 0.57 \| 0.38 \| 0.77 \| 0.98 \| 0.79 \| 0.77 \| 0.36 \| 0.36 \| 0.44 \| 0.665 \| 0.612 \|
	\| [deepvk/deberta-v1-base](https://huggingface.co/deepvk/deberta-v1-base) \| 124M \| 0.68 \| 0.54 \| 0.38 \| 0.76 \| 0.98 \| 0.80 \| 0.78 \| 0.29 \| 0.29 \| 0.40 \| 0.653 \| 0.591 \|
	\| [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) \| 150M \| 0.50 \| 0.29 \| 0.36 \| 0.64 \| 0.79 \| 0.62 \| 0.59 \| 0.10 \| 0.22 \| 0.20 \| 0.486 \| 0.431 \|
	\| [ai-forever/ruBert-base](https://huggingface.co/ai-forever/ruBert-base) \| 178M \| 0.67 \| 0.53 \| 0.39 \| 0.77 \| 0.98 \| 0.78 \| 0.77 \| 0.38 \| 🥴 \| 🥴 \| 0.659 \| 🥴 \|
	\| [DeepPavlov/rubert-base-cased](https://huggingface.co/DeepPavlov/rubert-base-cased) \| 180M \| 0.63 \| 0.50 \| 0.38 \| 0.73 \| 0.94 \| 0.74 \| 0.74 \| 0.31 \| 🥴 \| 🥴 \| 0.621 \| 🥴 \|
	\| [deepvk/RuModernBERT-small](https://huggingface.co/deepvk/RuModernBERT-small) \| 35M \| 0.64 \| 0.50 \| 0.36 \| 0.72 \| 0.95 \| 0.73 \| 0.72 \| 0.47 \| 0.28 \| 0.26 \| 0.636 \| 0.563 \|
	\| deepvk/RuModernBERT-base [this] \| 150M \| 0.67 \| 0.54 \| 0.35 \| 0.75 \| 0.97 \| 0.76 \| 0.76 \| 0.58 \| 0.37 \| 0.36 \| 0.673 \| 0.611 \|

	## Citation

	```
	@misc{deepvk2025rumodernbert,
	title={RuModernBERT: Modernized BERT for Russian},
	author={Spirin, Egor and Malashenko, Boris and Sokolov Andrey},
	url={https://huggingface.co/deepvk/rumodernbert-base},
	publisher={Hugging Face}
	year={2025},
	}
	```