TabiBERT
Table of Contents
Model Summary
TabiBERT is a modernized encoder-only Transformer model (BERT-style) based on the ModernBERT-base architecture. TabiBERT is pre-trained on 1 trillion tokens of a diverse dataset including Turkish, English, Code, Math with a native context length of up to 8,192 tokens.
TabiBERT inherits ModernBERT’s architectural improvements, such as:
- Rotary Positional Embeddings (RoPE) for long-context support.
- Local-Global Alternating Attention for efficiency on long inputs.
- Unpadding and Flash Attention for efficient inference.
This makes TabiBERT particularly suitable for:
- Turkish NLP tasks (classification, QA, retrieval, NLI, etc.).
- Multilingual text understanding (Turkish-English).
- Code retrieval and representation learning.
- Mathematical and symbolic reasoning.
- Long-context understanding such as document classification, retrieval, and semantic search.
TabiBERT is built by Tabilab in collaboration with VNGRS.
Usage
You can use TabiBERT directly with the transformers library (v4.48.0+):
pip install -U transformers>=4.48.0
Since TabiBERT is a Masked Language Model (MLM), you can use the fill-mask pipeline or load it via AutoModelForMaskedLM.
⚠️ If your GPU supports it, we recommend using ModernBERT with Flash Attention 2 to reach the highest efficiency. To do so, install Flash Attention as follows, then use the model as normal:
pip install flash-attn
Example usage with AutoModelForMaskedLM:
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
model_id = "boun-tabilab/TabiBERT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
text = "[MASK] Sistemi'ndeki en büyük gezegen Jüpiter'dir."
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model(**inputs)
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_id = outputs.logits[0, masked_index].argmax(axis=-1)
print("Predicted token:", tokenizer.decode(predicted_id))
# Predicted token: Güneş
Example with pipeline:
from transformers import pipeline
pipe = pipeline("fill-mask", model="boun-tabilab/TabiBERT")
print(pipe("[MASK], Türkiye Cumhuriyeti'nin başkentidir."))
Pre-training Data
TabiBERT has been pre-trained on 86 billion tokens of diverse data, primarily:
- A large-scale Turkish corpus covering literature, news, social media, Wikipedia, and academic texts.
- English text, ** code with English commentary**, and math problems in English — together making up about 13% non-Turkish tokens.
Training
- Architecture: Encoder-only, Pre-Norm Transformer with GeGLU activations.
- Sequence Length: Pre-trained up to 1,024 tokens, then extended to 8,192 tokens.
- Data: 86 billion tokens from a union corpus (Turkish; plus English, code with English commentary, and math in English; ~13% non-Turkish).
- Optimizer: StableAdamW with trapezoidal LR scheduling and 1-sqrt decay.
- Hardware: Trained on 8x H100 GPUs.
Evaluation
TabiBERT was comprehensively evaluated on TabiBench, a benchmark consisting of 28 datasets spanning 8 task categories. The model achieves state-of-the-art performance among Turkish models, with a total average score of 77.58, surpassing the previous best Turkish model by 1.62 points.
Key Highlights
- State-of-the-art performance: TabiBERT outperforms all monolingual Turkish baselines across the evaluation suite
- Largest improvement in QA: Achieves an F1 score of 69.71, outperforming the next best Turkish model by 9.55 points (16% relative improvement)
- Leading performance in 5 out of 8 task categories: Including code retrieval and information retrieval
- Strong long-context capabilities: Native support for up to 8,192 tokens, providing advantages on longer sequences
Benchmark: TabiBench
TabiBench is a comprehensive benchmark specifically designed for Turkish NLP, consisting of 28 datasets across 8 task types. The benchmark includes both existing Turkish NLP datasets and newly created/translated datasets for code retrieval and academic domain tasks.
Benchmark Collection: TabiBench on HuggingFace
Overall Evaluation Results
Comparison of downstream task performance across all evaluated models.
For each column, the highest score among the Turkish models (excluding the multilingual mBERT) is shown in bold. The evaluation metric used for each task type is also displayed in the column headers.
| Model | # of params (M) |
Text Clf (F1) |
Token Clf (F1) |
STS (Pearson) |
NLI (F1) |
QA (F1) |
Academic (F1) |
Retrieval (NDCG@10) |
Code Retrieval (NDCG@10) |
Total Avg (tabibench) |
|---|---|---|---|---|---|---|---|---|---|---|
| TurkishBERTweet | 163 | 79.71 | 92.02 | 75.86 | 79.10 | 38.13 | 63.12 | 68.40 | 43.49 | 67.48 |
| YTU-BERT | 111 | 84.25 | 93.60 | 84.68 | 84.16 | 31.50 | 71.78 | 74.29 | 53.80 | 72.26 |
| BERTurk | 110 | 83.42 | 93.67 | 85.33 | 84.33 | 60.16 | 71.40 | 74.84 | 54.54 | 75.96 |
| TabiBERT | 149 | 83.44 | 93.42 | 84.74 | 84.51 | 69.71 | 72.44 | 75.44 | 56.95 | 77.58 |
| mBERT | 307 | 82.54 | 93.81 | 87.05 | 84.38 | 71.47 | 72.65 | 76.20 | 66.02 | 79.26 |
Evaluation Methodology
Systematic hyperparameter tuning was performed for all model-task pairs with the following search space:
| Parameter | Values |
|---|---|
| Learning Rate | 5e-6, 1e-5, 2e-5, 3e-5 |
| Weight Decay | 1e-5, 1e-6 |
| Batch Size | 16, 32 |
| Epochs | Up to 10, with early stopping |
For each task category, a single score is reported by computing a weighted average across all datasets, where each dataset's weight is proportional to its test set size. This ensures that larger, more representative datasets have corresponding influence on overall results (test set sizes range from 150 to 35,000 examples).
Comparison with Baselines
Monolingual Turkish Models
TabiBERT was compared against three established Turkish BERT models:
- BERTurk: Widely used Turkish monolingual encoder, pre-trained on Web and special corpora
- YTU-BERT: Uncased Turkish BERT, pre-trained on large Turkish corpus (Web, Wikipedia, books)
- TurkishBERTweet: Uncased Turkish BERT, pre-trained at large scale for social media
Result: TabiBERT outperforms all monolingual Turkish models with a total average score of 77.58, surpassing BERTurk (previous best) by 1.62 points.
Multilingual Comparison
TabiBERT was also compared against mmBERT, a multilingual ModernBERT-based encoder:
- mmBERT: 307M parameters, pre-trained on 1,800 languages for 3T tokens
- TabiBERT: 149M parameters, pre-trained on Turkish corpus for 1T tokens
Result: While mmBERT achieves a higher total average score (79.26), TabiBERT offers advantages for Turkish-focused applications:
- Lower memory requirements: Smaller model size (149M vs 307M parameters)
- Better tokenization efficiency: 29% more effective context length for Turkish texts
- Optimized for Turkish: Specialized monolingual model vs. general multilingual model
Reproducibility
All evaluation datasets are publicly available on HuggingFace, under the TabiBench collection to facilitate future research and comparisons.
Limitations
- TabiBERT was trained mainly on Turkish, with additional English, code, and math. Its performance on English may be limited relative to Turkish, and it may underperform on other languages.
- As with any large-scale model, it may inherit biases from training data.
- While capable of handling up to 8k tokens, inference on very long sequences may be slower.
- Still under evaluation — recommended to validate results before deployment in critical applications.
License
Released under the Apache 2.0 license.
Citation
Citation is in progress.
- Downloads last month
- 406