datasets:
- artemkramov/coreference-dataset-ua
language:
- uk
tags:
- coreference-resolution
- anaphora
Coreference resolution model for the Ukrainian language
The coreference resolution model for the Ukrainian language was trained on the silver Ukrainian coreference dataset using the F-Coref library. The model was trained on top of the XML-Roberta-base model.
Model Details
Model Description
- Developed by: Artem Kramov, Andrii Kursin ([email protected]).
- Languages: Ukrainian
- Finetuned from model: XML-Roberta-base
Model Sources
- Repository: https://github.com/artemkramov/fastcoref-ua/blob/main/README.md
- Demo: Google Colab
Out-of-Scope Use
According to the metrics retrieved from the evaluation dataset, the model is more precision-oriented. Also, there is a high level of granularity of mentions. E.g., the mention "Головний виконавчий директор Андрій Сидоренко" can be divided into the following coreferent groups: ["Головний виконавчий директор Андрій Сидоренко", "Головний виконавчий директор", "Андрій Сидоренко"]. Such a feature can also be used to extract some positions, roles, or other features of entities in the text.
How to Get Started with the Model
Use the code below to get started with the model.
from fastcoref import FCoref
import spacy
nlp = spacy.load('uk_core_news_md')
model_path = "artemkramov/coref-ua"
model = FCoref(model_name_or_path=model_path, device='cuda:0', nlp=nlp)
preds = model.predict(
texts=["""Мій друг дав мені свою машину та ключі до неї; крім того, він дав мені його книгу. Я з радістю її читаю."""]
)
preds[0].get_clusters(as_strings=False)
> [[(0, 3), (13, 17), (66, 70), (83, 84)],
[(0, 8), (18, 22), (58, 61), (71, 75)],
[(18, 29), (42, 45)],
[(71, 81), (95, 97)]]
preds[0].get_clusters()
> [['Мій', 'мені', 'мені', 'Я'], ['Мій друг', 'свою', 'він', 'його'], ['свою машину', 'неї'], ['його книгу', 'її']]
preds[0].get_logit(
span_i=(13, 17), span_j=(42, 45)
)
> -6.867196
Training Details
Training Data
The model was trained on the silver coreference resolution dataset: https://huggingface.co/datasets/artemkramov/coreference-dataset-ua.
Evaluation
Metrics
Two types of metrics were considered: mention-based and the coreference resolution metrics themselves.
Mention-based metrics:
- mention precision
- mention recall
- mention F1
Coreference resolution metrics were calculated as the average values across the following metrics: MUC, BCubed, CEAFE:
- coreference precision
- coreference recall
- coreference F1
Results
The metrics for the validation dataset:
Metric | Value |
---|---|
Mention precision | 0.850 |
Mention recall | 0.798 |
Mention F1 | 0.824 |
Coreference precision | 0.758 |
Coreference recall | 0.706 |
Coreference F1 | 0.731 |
Summary
Model Examination [optional]
[More Information Needed]
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: [More Information Needed]
- Hours used: [More Information Needed]
- Cloud Provider: [More Information Needed]
- Compute Region: [More Information Needed]
- Carbon Emitted: [More Information Needed]
Technical Specifications [optional]
Model Architecture and Objective
[More Information Needed]
Compute Infrastructure
[More Information Needed]
Hardware
[More Information Needed]
Software
[More Information Needed]
Citation [optional]
BibTeX:
[More Information Needed]
APA:
[More Information Needed]
Glossary [optional]
[More Information Needed]
More Information [optional]
[More Information Needed]
Model Card Authors [optional]
[More Information Needed]
Model Card Contact
[More Information Needed]