🌐 T5-Based Multilingual Text Translator
This repository presents a fine-tuned T5-small model for multilingual text translation across English, French, German, Italian, and Portuguese. It includes quantization for efficient inference and speech synthesis support for accessibility.
📝 Problem Statement
The goal is to translate text between English and multiple European languages using a transformer-based model. Instead of using black-box APIs, this project fine-tunes the T5 model on parallel multilingual corpora, enabling offline translation and potential customization.
📊 Dataset
Source: Custom parallel corpus (
.txt
files) with one-to-one sentence alignments.Languages Supported:
- English
- French
- German
- Italian
- Portuguese
Structure:
- Each language has a corresponding
.txt
file. - Lines are aligned by index to form translation pairs.
- Each language has a corresponding
Example Input Format:
Source: translate English to French: I am a student. Target: Je suis un étudiant.
🧠 Model Details
- Architecture: T5-small
- Tokenizer:
T5Tokenizer
- Model:
T5ForConditionalGeneration
- Task Type: Sequence-to-Sequence Translation (Supervised Fine-tuning)
🔧 Installation
pip install transformers datasets torch gtts
🚀 Loading the Model
from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch
# Load quantized model (float16)
model = T5ForConditionalGeneration.from_pretrained("quantized_model", torch_dtype=torch.float16)
tokenizer = T5Tokenizer.from_pretrained("quantized_model")
# Translation example
source = "translate English to German: How are you?"
inputs = tokenizer(source, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
outputs = model.generate(**inputs)
print("Translated:", tokenizer.decode(outputs[0], skip_special_tokens=True))
📈 Performance Metrics
As this project is based on a single-epoch fine-tuning, performance metrics are not explicitly computed. For a production-level system, BLEU or ROUGE scores should be evaluated.
🏋️ Fine-Tuning Details
📚 Dataset Preparation
- A total of 5 text files (
english.txt
,french.txt
, etc.) - Each sentence aligned by index for parallel translation.
🔧 Training Configuration
- Epochs: 1
- Batch size: 4
- Max sequence length: 128
- Model base:
t5-small
- Framework: Hugging Face Transformers + PyTorch
- Evaluation strategy: 10% test split
🔄 Quantization
Post-training quantization was performed using .half()
precision (FP16) to reduce model size and improve inference speed.
# Load full-precision model
model_fp32 = T5ForConditionalGeneration.from_pretrained("model")
# Convert to half precision
model_fp16 = model_fp32.half()
model_fp16.save_pretrained("quantized_model")
Model Size Comparison:
Type | Size (KB) |
---|---|
FP32 (Original) | ~6,904 KB |
FP16 (Quantized) | ~3,452 KB |
📁 Repository Structure
.
├── model/ # Contains FP32 model files
│ ├── config.json
│ ├── model.safetensors
│ ├── tokenizer_config.json
│ └── ...
├── quantized_model/ # Contains FP16 quantized model files
│ ├── config.json
│ ├── model.safetensors
│ ├── tokenizer_config.json
│ └── ...
├── README.md # Documentation
└── multilingual_translator.py # Training and inference script
⚠️ Limitations
- Trained on a small dataset with only one epoch — may not generalize well to all phrases or complex sentences.
- Language coverage is limited to 5 predefined languages.
- gTTS is dependent on Google API and requires internet access.
🤝 Contributing
Feel free to submit issues or PRs to add more language pairs, extend training, or integrate UI for real-time use.