byt5-base-en2th-transliterator / README.md

Upload README.md with huggingface_hub

7f6eb86 verified 7 months ago

6.18 kB

	---
	language:
	- en
	- th
	tags:
	- transliteration
	- thai
	- english
	- byt5
	- fp16
	license: mit
	datasets:
	- custom
	library_name: transformers
	pipeline_tag: text2text-generation
	---

	# English to Thai Transliteration Model

	This model transliterates English text to Thai script, converting the sound of English words into Thai characters.

	## Model Description

	This model is based on ByT5, a token-free sequence-to-sequence model that operates directly on UTF-8 bytes. It has been fine-tuned specifically for English to Thai transliteration using multiple data sources to improve accuracy and coverage.

	- Developed by: yacht
	- Model type: ByT5 (Sequence-to-Sequence)
	- Language(s): English → Thai
	- License: MIT (free for commercial use)
	- FP16 Support: Yes (model supports half-precision inference)

	## Intended Uses & Limitations

	### Intended Uses
	- Converting English names, places, and terms into Thai script
	- Assisting with the transliteration of foreign words into Thai
	- Educational purposes for learning Thai script
	- Improving accessibility of English content for Thai speakers

	### Limitations
	- The model may struggle with uncommon or complex English words
	- Transliteration quality depends on the training data coverage
	- The model focuses on phonetic conversion, not translation

	## Training and Evaluation

	### Training Data
	The model was trained on a combined dataset of English-Thai transliteration pairs from multiple sources. The dataset includes:
	- Common English words and their Thai transliterations
	- Names of people, places, and organizations
	- Technical terms and other domain-specific vocabulary
	- Geological and scientific terminology

	### Training Procedure
	- Training framework: Hugging Face Transformers
	- Base model: `google/byt5-base`
	- Training hyperparameters:
	- Learning rate: `2e-4`
	- Batch size: `8`
	- Number of epochs: `10`
	- Optimizer: `AdamW`
	- Mixed precision: FP16 (model was trained with mixed precision)
	- Gradient clipping: Yes (max_grad_norm=1.0)

	### Evaluation Results
	- Accuracy: `0.7831`
	- Character Error Rate: `0.0591`
	- Mean Levenshtein Distance: `0.4654`

	## How to Use

	### Standard Usage

	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	# Load model and tokenizer
	model_name = "yacht/byt5-base-en2th-transliterator"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

	# Transliterate English to Thai
	english_text = "hello"
	inputs = tokenizer(english_text, return_tensors="pt")
	outputs = model.generate(inputs.input_ids)
	thai_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(f"English: {english_text} → Thai: {thai_text}")
	```

	### Using with FP16 for Faster Inference

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	# Load model with fp16 for faster inference (requires GPU with CUDA)
	model_name = "yacht/byt5-base-en2th-transliterator"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSeq2SeqLM.from_pretrained(
	model_name,
	torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
	)

	# Transliterate English to Thai with fp16
	english_text = "hello"
	inputs = tokenizer(english_text, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")
	outputs = model.generate(inputs.input_ids)
	thai_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(f"English: {english_text} → Thai: {thai_text}")
	```

	## Examples

	\| English \| Thai \|
	\|----------\|-----------\|
	\| hello \| เฮลโล \|
	\| computer \| คอมพิวเตอร์ \|
	\| thailand \| ไทยแลนด์ \|
	\| bangkok \| แบงค็อก \|
	\| graph \| กราฟ \|
	\| grossular \| กรอสซูลาร์ \|
	\| grossularite \| กรอสซูลาไรต์ \|

	## Performance Benefits of FP16

	Using FP16 (half-precision) can provide significant performance benefits:
	- Up to 2x faster inference on compatible GPUs
	- Reduced memory usage (approximately half compared to FP32)
	- Minimal impact on transliteration quality

	## Multi-Dataset Training

	This model was trained on multiple datasets combined together, which provides several advantages:
	- Broader vocabulary coverage across different domains
	- Improved handling of edge cases and uncommon words
	- More consistent transliteration patterns
	- Better generalization to new, unseen words

	## Limitations and Bias

	This model is designed specifically for transliteration, not translation. It attempts to convert the sounds of English words into Thai script, not to provide their Thai translations.

	The model's performance may vary based on:
	- The phonetic complexity of the input
	- Whether the input contains sounds that are difficult to represent in Thai
	- The coverage of similar words in the training data

	## Common Errors

	Some common error patterns observed:
	- group → กรูป (should be: กรุ๊ป)
	- golf → โกล์ฟ (should be: กอล์ฟ)

	## License

	MIT License

	Copyright (c) 2025 yacht

	Permission is hereby granted, free of charge, to any person obtaining a copy
	of this software and associated documentation files (the "Software"), to deal
	in the Software without restriction, including without limitation the rights
	to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
	copies of the Software, and to permit persons to whom the Software is
	furnished to do so, subject to the following conditions:

	The above copyright notice and this permission notice shall be included in all
	copies or substantial portions of the Software.

	THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
	IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
	FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
	AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
	LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
	OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
	SOFTWARE.