Khmer Financial Sentiment Analysis with XLM-RoBERTa

This repository contains a fine-tuned version of the XLM-RoBERTa-base model, specifically trained for Khmer language sentiment analysis in the financial domain. The model has been fine-tuned on a dataset of approximately 4,000 financial text samples, with a test set of 400.

Table of Contents

Overview

Financial texts—such as reports, news, and earnings statements—contain valuable information for market analysis. However, Khmer-language financial texts have received little attention in NLP research. This project adapts the XLM-RoBERTa-base model for Khmer sentiment analysis, specifically in the financial domain.

This model is trained to classify financial text sentiment as either:

  • Positive (1): Indicates growth, profitability, or a positive outlook.
  • Negative (0): Indicates loss, risk, or financial downturns.

Model Details

  • Base Model: XLM-RoBERTa-base
  • Task: Sentiment Analysis (Binary Classification: Positive / Negative)
  • Domain: Financial Data (Khmer Language)
  • Dataset Size: ~4,000 training samples, 400 test samples
  • Architecture: Transformer-based sequence classification model

Training Data

The model was fine-tuned using a dataset of Khmer-language financial texts, including:

  • Bank reports
  • Financial news articles
  • Economic forecasts
  • Investment analysis

The dataset consists of 4,000 labeled examples for training and 400 samples for testing.

Training Details

The model was fine-tuned over 3 epochs, using XLM-RoBERTa-base as the pretrained model.

Epoch Training Loss Validation Loss Accuracy
1 0.163500 0.511470 XX%
2 0.517700 0.581499 XX%
3 0.312900 0.526096 XX%

Training Configuration:

  • Learning Rate: 2e-5
  • Batch Size: 8
  • Optimizer: AdamW
  • Evaluation Strategy: Per epoch
  • Loss Function: CrossEntropyLoss

Results

  • Accuracy: ~96% on the validation set.
  • Strong Performance: The model effectively classifies Khmer financial sentiment.
  • Domain-Specific Optimization: The fine-tuning process allows better understanding of financial terminology in Khmer.

Usage

Below is an example of how to use the fine-tuned model for sentiment prediction:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the fine-tuned Khmer financial sentiment model
model_name = "songhieng/khmer-sentiment-xlm-roberta-base"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example Khmer financial text
text = "ការប្រកាសចំណូលរបស់ក្រុមហ៊ុនមានការកើនឡើងយ៉ាងច្រើន"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)

# Get predicted sentiment (0 = Negative, 1 = Positive)
predicted_class = outputs.logits.argmax(dim=1).item()
labels_mapping = {0: "Negative", 1: "Positive"}
print(f"Predicted Sentiment: {labels_mapping[predicted_class]}")
Downloads last month
25
Safetensors
Model size
278M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.