Multi-Label Emotion Classification with XLM-RoBERTa

This repository provides an implementation of fine-tuning the songhieng/khmer-xlmr-base-sentimental-multi-label model for multi-label emotion classification in Khmer text. The model is trained using the Hugging Face Transformers library, along with datasets and evaluate.

Overview

The task involves predicting multiple emotions (e.g., anger, anticipation, disgust, fear, joy, optimism, sadness, surprise) for a given piece of Khmer text. This implementation demonstrates:

Data preparation and splitting (90% training, 10% validation)
Tokenization using the fast XLM-RoBERTa tokenizer
A custom data collator that leverages the tokenizer’s efficient padding method
Fine-tuning an XLM-RoBERTa model for multi-label classification
Computing a custom multi-label subset accuracy metric during evaluation

Dataset

Total Data Size: 24,969 samples
Train-Test Split: 90% training, 10% validation
Model Accuracy (3 epochs): 72.12%

Requirements

Ensure you have the required dependencies installed:

pip install torch transformers datasets evaluate scikit-learn

Data Format

The expected input data is a CSV file with columns structured as follows:

Text	emotion	emotion_score	text_khm	anger	anticipation	disgust	fear	joy	optimism	sadness	surprise
...	...	...	...	0/1	0/1	0/1	0/1	0/1	0/1	0/1	0/1

text_khm: The Khmer text input
Emotion columns: One column per emotion label (binary values)

Training the Model

Data Preparation:
- Load the dataset.
- Select relevant columns ("text_khm" and emotion labels).
- Split into training (90%) and validation (10%) sets.
- Convert the dataset into a Hugging Face Dataset format.
Tokenization:
- Utilize the fast XLM-RoBERTa tokenizer with padding and truncation.
Model Setup:
- Load the pre-trained songhieng/khmer-xlmr-base-sentimental-multi-label model.
- Convert labels to float for BCEWithLogitsLoss.
Custom Data Collator:
- Use the built-in DataCollatorWithPadding for efficient batching.
Training and Evaluation:
- Define training arguments (learning rate, batch sizes, number of epochs, etc.).
- Implement a custom compute metrics function for multi-label subset accuracy.
- Train the model using the Trainer class from Hugging Face.

Testing the Model

To test the trained model on new Khmer text, use the following script:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

# Load tokenizer and model
model_name = "songhieng/khmer-xlmr-base-sentimental-multi-label"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example Khmer text for sentiment analysis
text = "ការប្រកាសចំណូលរបស់ក្រុមហ៊ុនមានការកើនឡើងយ៉ាងច្រើន"

# Tokenize input
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

# Perform inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probabilities = F.sigmoid(logits).squeeze().numpy()

# Define emotion labels
emotion_labels = ['Anger', 'Anticipation', 'Disgust', 'Fear', 'Joy', 'Optimism', 'Sadness', 'Surprise']

# Create a dictionary mapping emotions to probabilities
emotion_probabilities = {label: prob for label, prob in zip(emotion_labels, probabilities)}

# Print results
print(emotion_probabilities)

Example Output

{
    "Anger": 0.016747648,
    "Anticipation": 0.051519673,
    "Disgust": 0.01696622,
    "Fear": 0.0047147004,
    "Joy": 0.82434595,
    "Optimism": 0.052789055,
    "Sadness": 0.026356682,
    "Surprise": 0.0024202482
}

This output indicates that the model detected high probabilities for 'Joy' and 'Optimism' in the input text.

Customization

Thresholding: You can adjust the probability threshold (default: 0.5) to determine which emotions are considered present.
Fine-tuning Parameters: Modify hyperparameters such as learning rate, batch size, and number of epochs in the training script.
Alternative Models: You can swap songhieng/khmer-xlmr-base-sentimental-multi-label for another Khmer-language model.

Troubleshooting

KeyError: 'text'

This error occurs when the text field is missing after tokenization. Ensure that tokenization is correctly applied before passing the dataset to the trainer.

ValueError in Metric Computation

Since multi-label targets differ from single-label classification, use a custom compute_metrics function to calculate subset accuracy.

License

This project is provided for educational and research purposes. Please refer to the license file for details.

This README provides an overview of the dataset, model training, evaluation, and inference with the fine-tuned Khmer multi-label sentiment model.

songhieng
/

khmer-xlmr-base-sentimental-multi-label