Model Card for Banglish to Bengali Transliteration using mBART

This model is designed to perform transliteration from Banglish (Romanized Bengali) to Bengali script using the facebook/mbart-large-50-many-to-many-mmt model. The training was conducted using the dataset SKNahin/bengali-transliteration-data.

The notebook used for training can be found here: Kaggle Notebook.

Model Details

Model Description

Developed by: Shadab Tanjeed
Model type: Sequence-to-sequence (Seq2Seq) Transformer model
Language(s) (NLP): Bengali, Banglish (Romanized Bengali)
Finetuned from model: facebook/mbart-large-50-many-to-many-mmt

Model Sources

Repository: facebook/mbart-large-50-many-to-many-mmt

Uses

Direct Use

The model is intended for direct transliteration of Banglish text to Bengali script.

Downstream Use

It can be integrated into NLP applications where transliteration from Banglish to Bengali is required, such as chatbots, text normalization, and digital content processing.

Out-of-Scope Use

The model is not designed for language translation beyond transliteration, and it may not perform well on text containing mixed languages or code-switching.

Bias, Risks, and Limitations

The model may struggle with ambiguous words that have multiple possible transliterations.
It may not perform well on informal or highly stylized text.
Limited dataset coverage could lead to errors in transliterating uncommon words.

Recommendations

Users should validate outputs, especially for critical applications, and consider further fine-tuning if necessary.

How to Get Started with the Model

from transformers import MBartForConditionalGeneration, MBartTokenizer

model_name = "facebook/mbart-large-50-many-to-many-mmt"
tokenizer = MBartTokenizer.from_pretrained(model_name)
model = MBartForConditionalGeneration.from_pretrained(model_name)

text = "ami tomake bhalobashi"
inputs = tokenizer(text, return_tensors="pt")

translated_tokens = model.generate(**inputs)
output = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)

print(output)  # Expected Bengali transliteration

Training Details

Training Data

The dataset used for training is SKNahin/bengali-transliteration-data, which contains pairs of Banglish (Romanized Bengali) and corresponding Bengali script.

Training Procedure

Preprocessing

Tokenization was performed using the mBART tokenizer.
Text normalization techniques were applied to remove noise.

Training Hyperparameters

Batch size: 8
Learning rate: 3e-5
Epochs: 5

Evaluation

Testing Data, Factors & Metrics

Testing Data

The same dataset SKNahin/bengali-transliteration-data was used for evaluation.

Technical Specifications

Model Architecture and Objective

The model follows the Transformer-based Seq2Seq architecture from mBART.

Software

Framework: Hugging Face Transformers

Citation

If you use this model, please cite the dataset and base model:

@inproceedings{SKNahin2023,
  author = {SK Nahin},
  title = {Bengali Transliteration Dataset},
  year = {2023},
  publisher = {Hugging Face Datasets},
  url = {https://huggingface.co/datasets/SKNahin/bengali-transliteration-data}
}

@article{lewis2020mbart,
  title={mBART: Multilingual Denoising Pre-training for Neural Machine Translation},
  author={Lewis, Mike and others},
  journal={arXiv preprint arXiv:2001.08210},
  year={2020}
}

shadabtanjeed
/

mbart-banglish-to-bengali-transliteration