Finance Document Classifier

This repository contains a classifier for determining whether a document is finance-related.

Model Overview

  • A regression-based classifier with two classes: financial (1) and non-financial (0).
  • Uses Snowflake/snowflake-arctic-embed-m as the embedding model with a classification head. During the training, we train the model in a regression way.
  • We used Qwen/Qwen2.5-72B-Instruct to annotate 110k CulturaX documents with a note between 0 and 5, for the training, scores between [0,2] are converted to 0, [3,5] to 1. Then trained on 108k and test on 2k.

How to Use

from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("LinguaCustodia/finance_classifier")
model = AutoModelForSequenceClassification.from_pretrained("LinguaCustodia/finance_classifier")

# Example text
text = "This is a test sentence."

# Tokenize input
inputs = tokenizer(text, return_tensors="pt", padding="longest", truncation=True)

# Get model outputs
outputs = model(**inputs)
logits = outputs.logits.float().detach().cpu().numpy()
logits = logits.ravel().tolist()

# Convert logits to class labels
int_scores = [int(round(max(0, min(logit, 1)))) for logit in logits]  # 0 for non-financial, 1 for financial

Model Performance

  • Evaluated on the test set of 2000 samples.
                precision    recall  f1-score   support

           0       0.95      0.99      0.97      1750
           1       0.92      0.62      0.74       250
    accuracy                           0.95      2000
   macro avg       0.93      0.81      0.85      2000
weighted avg       0.94      0.95      0.94      2000

Citation

If you use this model in your research or applications, please cite this repository.

@misc{ClassiFin,
  title={ClassiFin: Finance Document Classifier},
  author={Liu, Jingshu and Qader, Raheel and Caillaut, Gaëtan and Nakhlem, Mariam and Barthelemy, Jean-Gabriel and Sadoune, Arezki and Foly, Sabine},
  url={https://huggingface.co/LinguaCustodia/ClassiFin},
  year={2025}
}
Downloads last month
1
Safetensors
Model size
109M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.