--- library_name: transformers language: - en - fr - it - es - ru - uk - tt - ar - hi - ja - zh - he - am - de license: openrail++ datasets: - textdetox/multilingual_toxicity_dataset metrics: - f1 base_model: - FacebookAI/xlm-roberta-large pipeline_tag: text-classification --- ## Multilingual Toxicity Classifier for 15 Languages (2025) This is an instance of [xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large) that was fine-tuned on binary toxicity classification task based on our updated (2025) dataset [textdetox/multilingual_toxicity_dataset](https://huggingface.co/datasets/textdetox/multilingual_toxicity_dataset). Now, the models covers 15 languages from various language families: | Language | Code | F1 Score | |-----------|------|---------| | English | en | 0.9225 | | Russian | ru | 0.9525 | | Ukrainian | uk | 0.96 | | German | de | 0.7325 | | Spanish | es | 0.7125 | | Arabic | ar | 0.6625 | | Amharic | am | 0.5575 | | Hindi | hi | 0.9725 | | Chinese | zh | 0.9175 | | Italian | it | 0.5864 | | French | fr | 0.9235 | | Hinglish | hin | 0.61 | | Hebrew | he | 0.8775 | | Japanese | ja | 0.8773 | | Tatar | tt | 0.5744 | ## How to use ```python import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained('textdetox/xlmr-large-toxicity-classifier-v2') model = AutoModelForSequenceClassification.from_pretrained('textdetox/xlmr-large-toxicity-classifier-v2') batch = tokenizer.encode("You are amazing!", return_tensors="pt") output = model(batch) # idx 0 for neutral, idx 1 for toxic ``` ## Citation The model is prepared for [TextDetox 2025 Shared Task](https://pan.webis.de/clef25/pan25-web/text-detoxification.html) evaluation. Citation TBD soon.