Khmer mT5 Summarization Model
📌 Introduction
This repository contains a fine-tuned mT5 model for Khmer text summarization. The model is based on Google's mT5-small and fine-tuned on a dataset of Khmer text and corresponding summaries.
Fine-tuning was performed using the Hugging Face Trainer API, optimizing the model to generate concise and meaningful summaries of Khmer text.
🚀 Model Details
- Base Model:
google/mt5-small - Fine-tuned for: Khmer text summarization
- Training Dataset:
kimleang123/khmer-text-dataset - Framework: Hugging Face
transformers - Task Type: Sequence-to-Sequence (Seq2Seq)
- Input: Khmer text (articles, paragraphs, or documents)
- Output: Summarized Khmer text
- Training Hardware: GPU (Tesla T4)
- Evaluation Metric: ROUGE Score
🔧 Installation & Setup
1️⃣ Install Dependencies
Ensure you have transformers, torch, and datasets installed:
pip install transformers torch datasets
2️⃣ Load the Model
To load and use the fine-tuned model:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_name = "songhieng/khmer-mt5-summarization"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
📌 How to Use
1️⃣ Using Python Code
def summarize_khmer(text, max_length=150):
input_text = f"summarize: {text}"
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512)
summary_ids = model.generate(**inputs, max_length=max_length, num_beams=5, length_penalty=2.0, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
return summary
khmer_text = "កម្ពុជាមានប្រជាជនប្រមាណ ១៦ លាននាក់ ហើយវាគឺជាប្រទេសនៅតំបន់អាស៊ីអាគ្នេយ៍។"
summary = summarize_khmer(khmer_text)
print("🔹 Khmer Summary:", summary)
2️⃣ Using Hugging Face Pipeline
For a simpler approach:
from transformers import pipeline
summarizer = pipeline("summarization", model="songhieng/khmer-mt5-summarization")
khmer_text = "កម្ពុជាមានប្រជាជនប្រមាណ ១៦ លាននាក់ ហើយវាគឺជាប្រទេសនៅតំបន់អាស៊ីអាគ្នេយ៍។"
summary = summarizer(khmer_text, max_length=150, min_length=30, do_sample=False)
print("🔹 Khmer Summary:", summary[0]['summary_text'])
3️⃣ Deploy as an API using FastAPI
You can create a simple API for summarization:
from fastapi import FastAPI
app = FastAPI()
@app.post("/summarize/")
def summarize(text: str):
inputs = tokenizer(f"summarize: {text}", return_tensors="pt", truncation=True, max_length=512)
summary_ids = model.generate(**inputs, max_length=150, num_beams=5, length_penalty=2.0, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
return {"summary": summary}
# Run with: uvicorn filename:app --reload
📊 Model Evaluation
The model was evaluated using ROUGE scores, which measure how similar the generated summaries are to the ground truth summaries.
from datasets import load_metric
rouge = load_metric("rouge")
def compute_metrics(pred):
labels_ids = pred.label_ids
pred_ids = pred.predictions
decoded_preds = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
decoded_labels = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
return rouge.compute(predictions=decoded_preds, references=decoded_labels)
trainer.evaluate()
💾 Saving & Uploading the Model
After fine-tuning, the model was uploaded to Hugging Face Hub:
model.push_to_hub("songhieng/khmer-mt5-summarization")
tokenizer.push_to_hub("songhieng/khmer-mt5-summarization")
To download it later:
model = AutoModelForSeq2SeqLM.from_pretrained("songhieng/khmer-mt5-summarization")
tokenizer = AutoTokenizer.from_pretrained("songhieng/khmer-mt5-summarization")
🎯 Summary
| Feature | Details |
|---|---|
| Base Model | google/mt5-small |
| Task | Summarization |
| Language | Khmer (ខ្មែរ) |
| Dataset | kimleang123/khmer-text-dataset |
| Framework | Hugging Face Transformers |
| Evaluation Metric | ROUGE Score |
| Deployment | Hugging Face Model Hub, API (FastAPI), Python Code |
🤝 Contributing
Contributions are welcome! Feel free to open issues or submit pull requests if you find any improvements.
📬 Contact
If you have any questions, feel free to reach out via Hugging Face Discussions or create an issue in the repository.
📌 Built for Khmer NLP Community 🇰🇭 🚀
- Downloads last month
- 13
Model tree for songhieng/khmer-mt5-summarization
Base model
google/mt5-small