Multi-Label Emotion Classification with XLM-RoBERTa
This repository provides an implementation of fine-tuning the songhieng/khmer-xlmr-base-sentimental-multi-label
model for multi-label emotion classification in Khmer text. The model is trained using the Hugging Face Transformers library, along with datasets and evaluate.
Overview
The task involves predicting multiple emotions (e.g., anger, anticipation, disgust, fear, joy, optimism, sadness, surprise) for a given piece of Khmer text. This implementation demonstrates:
- Data preparation and splitting (90% training, 10% validation)
- Tokenization using the fast XLM-RoBERTa tokenizer
- A custom data collator that leverages the tokenizer’s efficient padding method
- Fine-tuning an XLM-RoBERTa model for multi-label classification
- Computing a custom multi-label subset accuracy metric during evaluation
Dataset
- Total Data Size: 24,969 samples
- Train-Test Split: 90% training, 10% validation
- Model Accuracy (3 epochs): 72.12%
Requirements
Ensure you have the required dependencies installed:
pip install torch transformers datasets evaluate scikit-learn
Data Format
The expected input data is a CSV file with columns structured as follows:
Text | emotion | emotion_score | text_khm | anger | anticipation | disgust | fear | joy | optimism | sadness | surprise |
---|---|---|---|---|---|---|---|---|---|---|---|
... | ... | ... | ... | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 |
- text_khm: The Khmer text input
- Emotion columns: One column per emotion label (binary values)
Training the Model
Data Preparation:
- Load the dataset.
- Select relevant columns (
"text_khm"
and emotion labels). - Split into training (90%) and validation (10%) sets.
- Convert the dataset into a Hugging Face
Dataset
format.
Tokenization:
- Utilize the fast XLM-RoBERTa tokenizer with padding and truncation.
Model Setup:
- Load the pre-trained
songhieng/khmer-xlmr-base-sentimental-multi-label
model. - Convert labels to
float
for BCEWithLogitsLoss.
- Load the pre-trained
Custom Data Collator:
- Use the built-in
DataCollatorWithPadding
for efficient batching.
- Use the built-in
Training and Evaluation:
- Define training arguments (learning rate, batch sizes, number of epochs, etc.).
- Implement a custom compute metrics function for multi-label subset accuracy.
- Train the model using the
Trainer
class from Hugging Face.
Testing the Model
To test the trained model on new Khmer text, use the following script:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F
# Load tokenizer and model
model_name = "songhieng/khmer-xlmr-base-sentimental-multi-label"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example Khmer text for sentiment analysis
text = "ការប្រកាសចំណូលរបស់ក្រុមហ៊ុនមានការកើនឡើងយ៉ាងច្រើន"
# Tokenize input
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
# Perform inference
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probabilities = F.sigmoid(logits).squeeze().numpy()
# Define emotion labels
emotion_labels = ['Anger', 'Anticipation', 'Disgust', 'Fear', 'Joy', 'Optimism', 'Sadness', 'Surprise']
# Create a dictionary mapping emotions to probabilities
emotion_probabilities = {label: prob for label, prob in zip(emotion_labels, probabilities)}
# Print results
print(emotion_probabilities)
Example Output
{
"Anger": 0.016747648,
"Anticipation": 0.051519673,
"Disgust": 0.01696622,
"Fear": 0.0047147004,
"Joy": 0.82434595,
"Optimism": 0.052789055,
"Sadness": 0.026356682,
"Surprise": 0.0024202482
}
This output indicates that the model detected high probabilities for 'Joy' and 'Optimism' in the input text.
Customization
- Thresholding: You can adjust the probability threshold (default: 0.5) to determine which emotions are considered present.
- Fine-tuning Parameters: Modify hyperparameters such as learning rate, batch size, and number of epochs in the training script.
- Alternative Models: You can swap
songhieng/khmer-xlmr-base-sentimental-multi-label
for another Khmer-language model.
Troubleshooting
KeyError: 'text'
This error occurs when the text field is missing after tokenization. Ensure that tokenization is correctly applied before passing the dataset to the trainer.
ValueError in Metric Computation
Since multi-label targets differ from single-label classification, use a custom compute_metrics function to calculate subset accuracy.
License
This project is provided for educational and research purposes. Please refer to the license file for details.
This README provides an overview of the dataset, model training, evaluation, and inference with the fine-tuned Khmer multi-label sentiment model.
- Downloads last month
- 6
Model tree for songhieng/khmer-xlmr-base-sentimental-multi-label
Base model
FacebookAI/xlm-roberta-base