Khmer Financial Sentiment Analysis with XLM-RoBERTa

This repository contains a fine-tuned version of the XLM-RoBERTa-base model, specifically trained for Khmer language sentiment analysis in the financial domain. The model has been fine-tuned on a dataset of approximately 4,000 financial text samples, with a test set of 400.

Overview
Model Details
Training Data
Training Details
Results
Usage

Overview

Financial texts—such as reports, news, and earnings statements—contain valuable information for market analysis. However, Khmer-language financial texts have received little attention in NLP research. This project adapts the XLM-RoBERTa-base model for Khmer sentiment analysis, specifically in the financial domain.

This model is trained to classify financial text sentiment as either:

Positive (1): Indicates growth, profitability, or a positive outlook.
Negative (0): Indicates loss, risk, or financial downturns.

Model Details

Base Model: XLM-RoBERTa-base
Task: Sentiment Analysis (Binary Classification: Positive / Negative)
Domain: Financial Data (Khmer Language)
Dataset Size: ~4,000 training samples, 400 test samples
Architecture: Transformer-based sequence classification model

Training Data

The model was fine-tuned using a dataset of Khmer-language financial texts, including:

Bank reports
Financial news articles
Economic forecasts
Investment analysis

The dataset consists of 4,000 labeled examples for training and 400 samples for testing.

Training Details

The model was fine-tuned over 3 epochs, using XLM-RoBERTa-base as the pretrained model.

Epoch	Training Loss	Validation Loss	Accuracy
1	0.163500	0.511470	XX%
2	0.517700	0.581499	XX%
3	0.312900	0.526096	XX%

Training Configuration:

Learning Rate: 2e-5
Batch Size: 8
Optimizer: AdamW
Evaluation Strategy: Per epoch
Loss Function: CrossEntropyLoss

Results

Accuracy: ~96% on the validation set.
Strong Performance: The model effectively classifies Khmer financial sentiment.
Domain-Specific Optimization: The fine-tuning process allows better understanding of financial terminology in Khmer.

Usage

Below is an example of how to use the fine-tuned model for sentiment prediction:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the fine-tuned Khmer financial sentiment model
model_name = "songhieng/khmer-sentiment-xlm-roberta-base"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example Khmer financial text
text = "ការប្រកាសចំណូលរបស់ក្រុមហ៊ុនមានការកើនឡើងយ៉ាងច្រើន"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)

# Get predicted sentiment (0 = Negative, 1 = Positive)
predicted_class = outputs.logits.argmax(dim=1).item()
labels_mapping = {0: "Negative", 1: "Positive"}
print(f"Predicted Sentiment: {labels_mapping[predicted_class]}")

songhieng
/

khmer-sentiment-xlm-roberta-base