yacht's picture
Upload README.md with huggingface_hub
7f6eb86 verified
---
language:
- en
- th
tags:
- transliteration
- thai
- english
- byt5
- fp16
license: mit
datasets:
- custom
library_name: transformers
pipeline_tag: text2text-generation
---
# English to Thai Transliteration Model
This model transliterates English text to Thai script, converting the sound of English words into Thai characters.
## Model Description
This model is based on ByT5, a token-free sequence-to-sequence model that operates directly on UTF-8 bytes. It has been fine-tuned specifically for English to Thai transliteration using multiple data sources to improve accuracy and coverage.
- **Developed by**: yacht
- **Model type**: ByT5 (Sequence-to-Sequence)
- **Language(s)**: English → Thai
- **License**: MIT (free for commercial use)
- **FP16 Support**: Yes (model supports half-precision inference)
## Intended Uses & Limitations
### Intended Uses
- Converting English names, places, and terms into Thai script
- Assisting with the transliteration of foreign words into Thai
- Educational purposes for learning Thai script
- Improving accessibility of English content for Thai speakers
### Limitations
- The model may struggle with uncommon or complex English words
- Transliteration quality depends on the training data coverage
- The model focuses on phonetic conversion, not translation
## Training and Evaluation
### Training Data
The model was trained on a combined dataset of English-Thai transliteration pairs from multiple sources. The dataset includes:
- Common English words and their Thai transliterations
- Names of people, places, and organizations
- Technical terms and other domain-specific vocabulary
- Geological and scientific terminology
### Training Procedure
- **Training framework**: Hugging Face Transformers
- **Base model**: `google/byt5-base`
- **Training hyperparameters**:
- Learning rate: `2e-4`
- Batch size: `8`
- Number of epochs: `10`
- Optimizer: `AdamW`
- Mixed precision: FP16 (model was trained with mixed precision)
- Gradient clipping: Yes (max_grad_norm=1.0)
### Evaluation Results
- **Accuracy**: `0.7831`
- **Character Error Rate**: `0.0591`
- **Mean Levenshtein Distance**: `0.4654`
## How to Use
### Standard Usage
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Load model and tokenizer
model_name = "yacht/byt5-base-en2th-transliterator"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Transliterate English to Thai
english_text = "hello"
inputs = tokenizer(english_text, return_tensors="pt")
outputs = model.generate(inputs.input_ids)
thai_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"English: {english_text} → Thai: {thai_text}")
```
### Using with FP16 for Faster Inference
```python
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Load model with fp16 for faster inference (requires GPU with CUDA)
model_name = "yacht/byt5-base-en2th-transliterator"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(
model_name,
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)
# Transliterate English to Thai with fp16
english_text = "hello"
inputs = tokenizer(english_text, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")
outputs = model.generate(inputs.input_ids)
thai_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"English: {english_text} → Thai: {thai_text}")
```
## Examples
| English | Thai |
|----------|-----------|
| hello | เฮลโล |
| computer | คอมพิวเตอร์ |
| thailand | ไทยแลนด์ |
| bangkok | แบงค็อก |
| graph | กราฟ |
| grossular | กรอสซูลาร์ |
| grossularite | กรอสซูลาไรต์ |
## Performance Benefits of FP16
Using FP16 (half-precision) can provide significant performance benefits:
- Up to 2x faster inference on compatible GPUs
- Reduced memory usage (approximately half compared to FP32)
- Minimal impact on transliteration quality
## Multi-Dataset Training
This model was trained on multiple datasets combined together, which provides several advantages:
- Broader vocabulary coverage across different domains
- Improved handling of edge cases and uncommon words
- More consistent transliteration patterns
- Better generalization to new, unseen words
## Limitations and Bias
This model is designed specifically for transliteration, not translation. It attempts to convert the sounds of English words into Thai script, not to provide their Thai translations.
The model's performance may vary based on:
- The phonetic complexity of the input
- Whether the input contains sounds that are difficult to represent in Thai
- The coverage of similar words in the training data
## Common Errors
Some common error patterns observed:
- group → กรูป (should be: กรุ๊ป)
- golf → โกล์ฟ (should be: กอล์ฟ)
## License
MIT License
Copyright (c) 2025 yacht
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.