Model Card for Banglish to Bengali Transliteration using mBART

This model is designed to perform transliteration from Banglish (Romanized Bengali) to Bengali script using the facebook/mbart-large-50-many-to-many-mmt model. The training was conducted using the dataset SKNahin/bengali-transliteration-data.

The notebook used for training can be found here: Kaggle Notebook.

Model Details

Model Description

  • Developed by: Shadab Tanjeed
  • Model type: Sequence-to-sequence (Seq2Seq) Transformer model
  • Language(s) (NLP): Bengali, Banglish (Romanized Bengali)
  • Finetuned from model: facebook/mbart-large-50-many-to-many-mmt

Model Sources

Uses

Direct Use

The model is intended for direct transliteration of Banglish text to Bengali script.

Downstream Use

It can be integrated into NLP applications where transliteration from Banglish to Bengali is required, such as chatbots, text normalization, and digital content processing.

Out-of-Scope Use

The model is not designed for language translation beyond transliteration, and it may not perform well on text containing mixed languages or code-switching.

Bias, Risks, and Limitations

  • The model may struggle with ambiguous words that have multiple possible transliterations.
  • It may not perform well on informal or highly stylized text.
  • Limited dataset coverage could lead to errors in transliterating uncommon words.

Recommendations

Users should validate outputs, especially for critical applications, and consider further fine-tuning if necessary.

How to Get Started with the Model

from transformers import MBartForConditionalGeneration, MBartTokenizer

model_name = "facebook/mbart-large-50-many-to-many-mmt"
tokenizer = MBartTokenizer.from_pretrained(model_name)
model = MBartForConditionalGeneration.from_pretrained(model_name)

text = "ami tomake bhalobashi"
inputs = tokenizer(text, return_tensors="pt")

translated_tokens = model.generate(**inputs)
output = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)

print(output)  # Expected Bengali transliteration

Training Details

Training Data

The dataset used for training is SKNahin/bengali-transliteration-data, which contains pairs of Banglish (Romanized Bengali) and corresponding Bengali script.

Training Procedure

Preprocessing

  • Tokenization was performed using the mBART tokenizer.
  • Text normalization techniques were applied to remove noise.

Training Hyperparameters

  • Batch size: 8
  • Learning rate: 3e-5
  • Epochs: 5

Evaluation

Testing Data, Factors & Metrics

Testing Data

Technical Specifications

Model Architecture and Objective

The model follows the Transformer-based Seq2Seq architecture from mBART.

Software

  • Framework: Hugging Face Transformers

Citation

If you use this model, please cite the dataset and base model:

@inproceedings{SKNahin2023,
  author = {SK Nahin},
  title = {Bengali Transliteration Dataset},
  year = {2023},
  publisher = {Hugging Face Datasets},
  url = {https://huggingface.co/datasets/SKNahin/bengali-transliteration-data}
}

@article{lewis2020mbart,
  title={mBART: Multilingual Denoising Pre-training for Neural Machine Translation},
  author={Lewis, Mike and others},
  journal={arXiv preprint arXiv:2001.08210},
  year={2020}
}
Downloads last month
20
Safetensors
Model size
611M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for shadabtanjeed/mbart-banglish-to-bengali-transliteration

Finetuned
(130)
this model

Dataset used to train shadabtanjeed/mbart-banglish-to-bengali-transliteration