Finance Document Classifier
This repository contains a classifier for determining whether a document is finance-related.
Model Overview
- A regression-based classifier with two classes: financial (1) and non-financial (0).
- Uses
Snowflake/snowflake-arctic-embed-m
as the embedding model with a classification head. During the training, we train the model in a regression way. - We used
Qwen/Qwen2.5-72B-Instruct
to annotate 110k CulturaX documents with a note between 0 and 5, for the training, scores between [0,2] are converted to 0, [3,5] to 1. Then trained on 108k and test on 2k.
How to Use
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("LinguaCustodia/finance_classifier")
model = AutoModelForSequenceClassification.from_pretrained("LinguaCustodia/finance_classifier")
# Example text
text = "This is a test sentence."
# Tokenize input
inputs = tokenizer(text, return_tensors="pt", padding="longest", truncation=True)
# Get model outputs
outputs = model(**inputs)
logits = outputs.logits.float().detach().cpu().numpy()
logits = logits.ravel().tolist()
# Convert logits to class labels
int_scores = [int(round(max(0, min(logit, 1)))) for logit in logits] # 0 for non-financial, 1 for financial
Model Performance
- Evaluated on the test set of 2000 samples.
precision recall f1-score support
0 0.95 0.99 0.97 1750
1 0.92 0.62 0.74 250
accuracy 0.95 2000
macro avg 0.93 0.81 0.85 2000
weighted avg 0.94 0.95 0.94 2000
Citation
If you use this model in your research or applications, please cite this repository.
@misc{ClassiFin,
title={ClassiFin: Finance Document Classifier},
author={Liu, Jingshu and Qader, Raheel and Caillaut, Gaëtan and Nakhlem, Mariam and Barthelemy, Jean-Gabriel and Sadoune, Arezki and Foly, Sabine},
url={https://huggingface.co/LinguaCustodia/ClassiFin},
year={2025}
}
- Downloads last month
- 1
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.