Khmer mT5 Summarization Model

πŸ“Œ Introduction

This repository contains a fine-tuned mT5 model for Khmer text summarization. The model is based on Google's mT5-small and fine-tuned on a dataset of Khmer text and corresponding summaries.

Fine-tuning was performed using the Hugging Face Trainer API, optimizing the model to generate concise and meaningful summaries of Khmer text.


πŸš€ Model Details

  • Base Model: google/mt5-small
  • Fine-tuned for: Khmer text summarization
  • Training Dataset: kimleang123/khmer-text-dataset
  • Framework: Hugging Face transformers
  • Task Type: Sequence-to-Sequence (Seq2Seq)
  • Input: Khmer text (articles, paragraphs, or documents)
  • Output: Summarized Khmer text
  • Training Hardware: GPU (Tesla T4)
  • Evaluation Metric: ROUGE Score

πŸ”§ Installation & Setup

1️⃣ Install Dependencies

Ensure you have transformers, torch, and datasets installed:

pip install transformers torch datasets

2️⃣ Load the Model

To load and use the fine-tuned model:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "songhieng/khmer-mt5-summarization"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

πŸ“Œ How to Use

1️⃣ Using Python Code

def summarize_khmer(text, max_length=150):
    input_text = f"summarize: {text}"
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512)
    summary_ids = model.generate(**inputs, max_length=max_length, num_beams=5, length_penalty=2.0, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

khmer_text = "αž€αž˜αŸ’αž–αž»αž‡αžΆαž˜αžΆαž“αž”αŸ’αžšαž‡αžΆαž‡αž“αž”αŸ’αžšαž˜αžΆαžŽ ៑៦ αž›αžΆαž“αž“αžΆαž€αŸ‹ αž αžΎαž™αžœαžΆαž‚αžΊαž‡αžΆαž”αŸ’αžšαž‘αŸαžŸαž“αŸ…αžαŸ†αž”αž“αŸ‹αž’αžΆαžŸαŸŠαžΈαž’αžΆαž‚αŸ’αž“αŸαž™αŸαŸ”"
summary = summarize_khmer(khmer_text)
print("πŸ”Ή Khmer Summary:", summary)

2️⃣ Using Hugging Face Pipeline

For a simpler approach:

from transformers import pipeline

summarizer = pipeline("summarization", model="songhieng/khmer-mt5-summarization")
khmer_text = "αž€αž˜αŸ’αž–αž»αž‡αžΆαž˜αžΆαž“αž”αŸ’αžšαž‡αžΆαž‡αž“αž”αŸ’αžšαž˜αžΆαžŽ ៑៦ αž›αžΆαž“αž“αžΆαž€αŸ‹ αž αžΎαž™αžœαžΆαž‚αžΊαž‡αžΆαž”αŸ’αžšαž‘αŸαžŸαž“αŸ…αžαŸ†αž”αž“αŸ‹αž’αžΆαžŸαŸŠαžΈαž’αžΆαž‚αŸ’αž“αŸαž™αŸαŸ”"
summary = summarizer(khmer_text, max_length=150, min_length=30, do_sample=False)
print("πŸ”Ή Khmer Summary:", summary[0]['summary_text'])

3️⃣ Deploy as an API using FastAPI

You can create a simple API for summarization:

from fastapi import FastAPI

app = FastAPI()

@app.post("/summarize/")
def summarize(text: str):
    inputs = tokenizer(f"summarize: {text}", return_tensors="pt", truncation=True, max_length=512)
    summary_ids = model.generate(**inputs, max_length=150, num_beams=5, length_penalty=2.0, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return {"summary": summary}

# Run with: uvicorn filename:app --reload

πŸ“Š Model Evaluation

The model was evaluated using ROUGE scores, which measure how similar the generated summaries are to the ground truth summaries.

from datasets import load_metric

rouge = load_metric("rouge")

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions
    decoded_preds = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
    return rouge.compute(predictions=decoded_preds, references=decoded_labels)

trainer.evaluate()

πŸ’Ύ Saving & Uploading the Model

After fine-tuning, the model was uploaded to Hugging Face Hub:

model.push_to_hub("songhieng/khmer-mt5-summarization")
tokenizer.push_to_hub("songhieng/khmer-mt5-summarization")

To download it later:

model = AutoModelForSeq2SeqLM.from_pretrained("songhieng/khmer-mt5-summarization")
tokenizer = AutoTokenizer.from_pretrained("songhieng/khmer-mt5-summarization")

🎯 Summary

Feature Details
Base Model google/mt5-small
Task Summarization
Language Khmer (αžαŸ’αž˜αŸ‚αžš)
Dataset kimleang123/khmer-text-dataset
Framework Hugging Face Transformers
Evaluation Metric ROUGE Score
Deployment Hugging Face Model Hub, API (FastAPI), Python Code

🀝 Contributing

Contributions are welcome! Feel free to open issues or submit pull requests if you find any improvements.

πŸ“¬ Contact

If you have any questions, feel free to reach out via Hugging Face Discussions or create an issue in the repository.

πŸ“Œ Built for Khmer NLP Community πŸ‡°πŸ‡­ πŸš€

Downloads last month
116
Safetensors
Model size
300M params
Tensor type
F32
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for songhieng/khmer-mt5-summarization

Base model

google/mt5-small
Finetuned
(417)
this model

Dataset used to train songhieng/khmer-mt5-summarization

Space using songhieng/khmer-mt5-summarization 1