Khmer mT5 Summarization Model (1024 Tokens)
Introduction
This repository contains a fine-tuned mT5 model for Khmer text summarization, extending the capabilities of the original khmer-mt5-summarization model. The primary enhancement in this version is the support for summarizing longer texts, with training adjusted to accommodate inputs up to 1024 tokens.
Model Details
- Base Model:
google/mt5-small
- Fine-tuned for: Khmer text summarization with extended input length
- Training Dataset:
kimleang123/khmer-text-dataset
- Framework: Hugging Face
transformers
- Task Type: Sequence-to-Sequence (Seq2Seq)
- Input: Khmer text (articles, paragraphs, or documents) up to 1024 tokens
- Output: Summarized Khmer text
- Training Hardware: GPU (Tesla T4)
- Evaluation Metric: ROUGE Score
Installation & Setup
1οΈβ£ Install Dependencies
Ensure you have transformers
, torch
, and datasets
installed:
pip install transformers torch datasets
2οΈβ£ Load the Model
To load and use the fine-tuned model:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_name = "songhieng/khmer-mt5-summarization-1024tk"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
How to Use
1οΈβ£ Using Python Code
def summarize_khmer(text, max_length=150):
input_text = f"summarize: {text}"
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=1024)
summary_ids = model.generate(**inputs, max_length=max_length, num_beams=5, length_penalty=2.0, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
return summary
khmer_text = "ααααα»ααΆααΆααααααΆαααααααΆα α‘α¦ ααΆαααΆαα α αΎαααΆααΊααΆαααααααα
αααααα’αΆαααΈα’αΆααααααα"
summary = summarize_khmer(khmer_text)
print("Khmer Summary:", summary)
2οΈβ£ Using Hugging Face Pipeline
For a simpler approach:
from transformers import pipeline
summarizer = pipeline("summarization", model="songhieng/khmer-mt5-summarization-1024tk")
khmer_text = "ααααα»ααΆααΆααααααΆαααααααΆα α‘α¦ ααΆαααΆαα α αΎαααΆααΊααΆαααααααα
αααααα’αΆαααΈα’αΆααααααα"
summary = summarizer(khmer_text, max_length=150, min_length=30, do_sample=False)
print("Khmer Summary:", summary[0]['summary_text'])
3οΈβ£ Deploy as an API using FastAPI
You can create a simple API for summarization:
from fastapi import FastAPI
app = FastAPI()
@app.post("/summarize/")
def summarize(text: str):
inputs = tokenizer(f"summarize: {text}", return_tensors="pt", truncation=True, max_length=1024)
summary_ids = model.generate(**inputs, max_length=150, num_beams=5, length_penalty=2.0, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
return {"summary": summary}
# Run with: uvicorn filename:app --reload
Model Evaluation
The model was evaluated using ROUGE scores, which measure the similarity between the generated summaries and the reference summaries.
from datasets import load_metric
rouge = load_metric("rouge")
def compute_metrics(pred):
labels_ids = pred.label_ids
pred_ids = pred.predictions
decoded_preds = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
decoded_labels = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
return rouge.compute(predictions=decoded_preds, references=decoded_labels)
trainer.evaluate()
Saving & Uploading the Model
After fine-tuning, the model can be uploaded to the Hugging Face Hub:
model.push_to_hub("songhieng/khmer-mt5-summarization-1024tk")
tokenizer.push_to_hub("songhieng/khmer-mt5-summarization-1024tk")
To download it later:
model = AutoModelForSeq2SeqLM.from_pretrained("songhieng/khmer-mt5-summarization-1024tk")
tokenizer = AutoTokenizer.from_pretrained("songhieng/khmer-mt5-summarization-1024tk")
Summary
Feature | Details |
---|---|
Base Model | google/mt5-small |
Task | Summarization |
Language | Khmer (ααααα) |
Dataset | kimleang123/khmer-text-dataset |
Framework | Hugging Face Transformers |
Evaluation Metric | ROUGE Score |
Deployment | Hugging Face Model Hub, API (FastAPI), Python Code |
Contributing
Contributions are welcome! Feel free to open issues or submit pull requests if you have any improvements or suggestions.
Contact
If you have any questions, feel free to reach out via Hugging Face Discussions or create an issue in the repository.
Built for the Khmer NLP Community
- Downloads last month
- 5
Model tree for songhieng/khmer-mt5-summarization-1024tk
Base model
google/mt5-small