TabiBERT

Table of Contents

  1. Model Summary
  2. Usage
  3. Evaluation
  4. Limitations
  5. Training
  6. License
  7. Citation

Model Summary

TabiBERT is a modernized encoder-only Transformer model (BERT-style) based on the ModernBERT-base architecture. TabiBERT is pre-trained on 1 trillion tokens of a diverse dataset including Turkish, English, Code, Math with a native context length of up to 8,192 tokens.

TabiBERT inherits ModernBERT’s architectural improvements, such as:

  • Rotary Positional Embeddings (RoPE) for long-context support.
  • Local-Global Alternating Attention for efficiency on long inputs.
  • Unpadding and Flash Attention for efficient inference.

This makes TabiBERT particularly suitable for:

  • Turkish NLP tasks (classification, QA, retrieval, NLI, etc.).
  • Multilingual text understanding (Turkish-English).
  • Code retrieval and representation learning.
  • Mathematical and symbolic reasoning.
  • Long-context understanding such as document classification, retrieval, and semantic search.

TabiBERT is built by Tabilab in collaboration with VNGRS.


Usage

You can use TabiBERT directly with the transformers library (v4.48.0+):

pip install -U transformers>=4.48.0

Since TabiBERT is a Masked Language Model (MLM), you can use the fill-mask pipeline or load it via AutoModelForMaskedLM.

⚠️ If your GPU supports it, we recommend using ModernBERT with Flash Attention 2 to reach the highest efficiency. To do so, install Flash Attention as follows, then use the model as normal:

pip install flash-attn

Example usage with AutoModelForMaskedLM:

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

model_id = "boun-tabilab/TabiBERT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

text = "[MASK] Sistemi'ndeki en büyük gezegen Jüpiter'dir."
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model(**inputs)

masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_id = outputs.logits[0, masked_index].argmax(axis=-1)
print("Predicted token:", tokenizer.decode(predicted_id))
# Predicted token:  Güneş

Example with pipeline:

from transformers import pipeline

pipe = pipeline("fill-mask", model="boun-tabilab/TabiBERT")

print(pipe("[MASK], Türkiye Cumhuriyeti'nin başkentidir."))

Pre-training Data

TabiBERT has been pre-trained on 86 billion tokens of diverse data, primarily:

  • A large-scale Turkish corpus covering literature, news, social media, Wikipedia, and academic texts.
  • English text, ** code with English commentary**, and math problems in English — together making up about 13% non-Turkish tokens.

Training

  • Architecture: Encoder-only, Pre-Norm Transformer with GeGLU activations.
  • Sequence Length: Pre-trained up to 1,024 tokens, then extended to 8,192 tokens.
  • Data: 86 billion tokens from a union corpus (Turkish; plus English, code with English commentary, and math in English; ~13% non-Turkish).
  • Optimizer: StableAdamW with trapezoidal LR scheduling and 1-sqrt decay.
  • Hardware: Trained on 8x H100 GPUs.

Evaluation

TabiBERT was comprehensively evaluated on TabiBench, a benchmark consisting of 28 datasets spanning 8 task categories. The model achieves state-of-the-art performance among Turkish models, with a total average score of 77.58, surpassing the previous best Turkish model by 1.62 points.

Key Highlights

  • State-of-the-art performance: TabiBERT outperforms all monolingual Turkish baselines across the evaluation suite
  • Largest improvement in QA: Achieves an F1 score of 69.71, outperforming the next best Turkish model by 9.55 points (16% relative improvement)
  • Leading performance in 5 out of 8 task categories: Including code retrieval and information retrieval
  • Strong long-context capabilities: Native support for up to 8,192 tokens, providing advantages on longer sequences

Benchmark: TabiBench

TabiBench is a comprehensive benchmark specifically designed for Turkish NLP, consisting of 28 datasets across 8 task types. The benchmark includes both existing Turkish NLP datasets and newly created/translated datasets for code retrieval and academic domain tasks.

Benchmark Collection: TabiBench on HuggingFace

Overall Evaluation Results

Comparison of downstream task performance across all evaluated models.

For each column, the highest score among the Turkish models (excluding the multilingual mBERT) is shown in bold. The evaluation metric used for each task type is also displayed in the column headers.

Model # of params
(M)
Text Clf
(F1)
Token Clf
(F1)
STS
(Pearson)
NLI
(F1)
QA
(F1)
Academic
(F1)
Retrieval
(NDCG@10)
Code Retrieval
(NDCG@10)
Total Avg
(tabibench)
TurkishBERTweet 163 79.71 92.02 75.86 79.10 38.13 63.12 68.40 43.49 67.48
YTU-BERT 111 84.25 93.60 84.68 84.16 31.50 71.78 74.29 53.80 72.26
BERTurk 110 83.42 93.67 85.33 84.33 60.16 71.40 74.84 54.54 75.96
TabiBERT 149 83.44 93.42 84.74 84.51 69.71 72.44 75.44 56.95 77.58
mBERT 307 82.54 93.81 87.05 84.38 71.47 72.65 76.20 66.02 79.26

Evaluation Methodology

Systematic hyperparameter tuning was performed for all model-task pairs with the following search space:

Parameter Values
Learning Rate 5e-6, 1e-5, 2e-5, 3e-5
Weight Decay 1e-5, 1e-6
Batch Size 16, 32
Epochs Up to 10, with early stopping

For each task category, a single score is reported by computing a weighted average across all datasets, where each dataset's weight is proportional to its test set size. This ensures that larger, more representative datasets have corresponding influence on overall results (test set sizes range from 150 to 35,000 examples).

Comparison with Baselines

Monolingual Turkish Models

TabiBERT was compared against three established Turkish BERT models:

  • BERTurk: Widely used Turkish monolingual encoder, pre-trained on Web and special corpora
  • YTU-BERT: Uncased Turkish BERT, pre-trained on large Turkish corpus (Web, Wikipedia, books)
  • TurkishBERTweet: Uncased Turkish BERT, pre-trained at large scale for social media

Result: TabiBERT outperforms all monolingual Turkish models with a total average score of 77.58, surpassing BERTurk (previous best) by 1.62 points.

Multilingual Comparison

TabiBERT was also compared against mmBERT, a multilingual ModernBERT-based encoder:

  • mmBERT: 307M parameters, pre-trained on 1,800 languages for 3T tokens
  • TabiBERT: 149M parameters, pre-trained on Turkish corpus for 1T tokens

Result: While mmBERT achieves a higher total average score (79.26), TabiBERT offers advantages for Turkish-focused applications:

  • Lower memory requirements: Smaller model size (149M vs 307M parameters)
  • Better tokenization efficiency: 29% more effective context length for Turkish texts
  • Optimized for Turkish: Specialized monolingual model vs. general multilingual model

Reproducibility

All evaluation datasets are publicly available on HuggingFace, under the TabiBench collection to facilitate future research and comparisons.


Limitations

  • TabiBERT was trained mainly on Turkish, with additional English, code, and math. Its performance on English may be limited relative to Turkish, and it may underperform on other languages.
  • As with any large-scale model, it may inherit biases from training data.
  • While capable of handling up to 8k tokens, inference on very long sequences may be slower.
  • Still under evaluation — recommended to validate results before deployment in critical applications.

License

Released under the Apache 2.0 license.

Citation

Citation is in progress.

Downloads last month
406
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support