|
--- |
|
language: |
|
- en |
|
- th |
|
tags: |
|
- transliteration |
|
- thai |
|
- english |
|
- byt5 |
|
- fp16 |
|
license: mit |
|
datasets: |
|
- custom |
|
library_name: transformers |
|
pipeline_tag: text2text-generation |
|
--- |
|
|
|
# English to Thai Transliteration Model |
|
|
|
This model transliterates English text to Thai script, converting the sound of English words into Thai characters. |
|
|
|
## Model Description |
|
|
|
This model is based on ByT5, a token-free sequence-to-sequence model that operates directly on UTF-8 bytes. It has been fine-tuned specifically for English to Thai transliteration using multiple data sources to improve accuracy and coverage. |
|
|
|
- **Developed by**: yacht |
|
- **Model type**: ByT5 (Sequence-to-Sequence) |
|
- **Language(s)**: English → Thai |
|
- **License**: MIT (free for commercial use) |
|
- **FP16 Support**: Yes (model supports half-precision inference) |
|
|
|
## Intended Uses & Limitations |
|
|
|
### Intended Uses |
|
- Converting English names, places, and terms into Thai script |
|
- Assisting with the transliteration of foreign words into Thai |
|
- Educational purposes for learning Thai script |
|
- Improving accessibility of English content for Thai speakers |
|
|
|
### Limitations |
|
- The model may struggle with uncommon or complex English words |
|
- Transliteration quality depends on the training data coverage |
|
- The model focuses on phonetic conversion, not translation |
|
|
|
## Training and Evaluation |
|
|
|
### Training Data |
|
The model was trained on a combined dataset of English-Thai transliteration pairs from multiple sources. The dataset includes: |
|
- Common English words and their Thai transliterations |
|
- Names of people, places, and organizations |
|
- Technical terms and other domain-specific vocabulary |
|
- Geological and scientific terminology |
|
|
|
### Training Procedure |
|
- **Training framework**: Hugging Face Transformers |
|
- **Base model**: `google/byt5-base` |
|
- **Training hyperparameters**: |
|
- Learning rate: `2e-4` |
|
- Batch size: `8` |
|
- Number of epochs: `10` |
|
- Optimizer: `AdamW` |
|
- Mixed precision: FP16 (model was trained with mixed precision) |
|
- Gradient clipping: Yes (max_grad_norm=1.0) |
|
|
|
### Evaluation Results |
|
- **Accuracy**: `0.7831` |
|
- **Character Error Rate**: `0.0591` |
|
- **Mean Levenshtein Distance**: `0.4654` |
|
|
|
## How to Use |
|
|
|
### Standard Usage |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
# Load model and tokenizer |
|
model_name = "yacht/byt5-base-en2th-transliterator" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_name) |
|
|
|
# Transliterate English to Thai |
|
english_text = "hello" |
|
inputs = tokenizer(english_text, return_tensors="pt") |
|
outputs = model.generate(inputs.input_ids) |
|
thai_text = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
print(f"English: {english_text} → Thai: {thai_text}") |
|
``` |
|
|
|
### Using with FP16 for Faster Inference |
|
|
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
# Load model with fp16 for faster inference (requires GPU with CUDA) |
|
model_name = "yacht/byt5-base-en2th-transliterator" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForSeq2SeqLM.from_pretrained( |
|
model_name, |
|
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32 |
|
) |
|
|
|
# Transliterate English to Thai with fp16 |
|
english_text = "hello" |
|
inputs = tokenizer(english_text, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu") |
|
outputs = model.generate(inputs.input_ids) |
|
thai_text = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
print(f"English: {english_text} → Thai: {thai_text}") |
|
``` |
|
|
|
## Examples |
|
|
|
| English | Thai | |
|
|----------|-----------| |
|
| hello | เฮลโล | |
|
| computer | คอมพิวเตอร์ | |
|
| thailand | ไทยแลนด์ | |
|
| bangkok | แบงค็อก | |
|
| graph | กราฟ | |
|
| grossular | กรอสซูลาร์ | |
|
| grossularite | กรอสซูลาไรต์ | |
|
|
|
## Performance Benefits of FP16 |
|
|
|
Using FP16 (half-precision) can provide significant performance benefits: |
|
- Up to 2x faster inference on compatible GPUs |
|
- Reduced memory usage (approximately half compared to FP32) |
|
- Minimal impact on transliteration quality |
|
|
|
## Multi-Dataset Training |
|
|
|
This model was trained on multiple datasets combined together, which provides several advantages: |
|
- Broader vocabulary coverage across different domains |
|
- Improved handling of edge cases and uncommon words |
|
- More consistent transliteration patterns |
|
- Better generalization to new, unseen words |
|
|
|
## Limitations and Bias |
|
|
|
This model is designed specifically for transliteration, not translation. It attempts to convert the sounds of English words into Thai script, not to provide their Thai translations. |
|
|
|
The model's performance may vary based on: |
|
- The phonetic complexity of the input |
|
- Whether the input contains sounds that are difficult to represent in Thai |
|
- The coverage of similar words in the training data |
|
|
|
## Common Errors |
|
|
|
Some common error patterns observed: |
|
- group → กรูป (should be: กรุ๊ป) |
|
- golf → โกล์ฟ (should be: กอล์ฟ) |
|
|
|
## License |
|
|
|
MIT License |
|
|
|
Copyright (c) 2025 yacht |
|
|
|
Permission is hereby granted, free of charge, to any person obtaining a copy |
|
of this software and associated documentation files (the "Software"), to deal |
|
in the Software without restriction, including without limitation the rights |
|
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell |
|
copies of the Software, and to permit persons to whom the Software is |
|
furnished to do so, subject to the following conditions: |
|
|
|
The above copyright notice and this permission notice shall be included in all |
|
copies or substantial portions of the Software. |
|
|
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR |
|
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, |
|
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE |
|
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER |
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, |
|
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE |
|
SOFTWARE. |
|
|