|
--- |
|
pipeline_tag: translation |
|
language: |
|
- multilingual |
|
- af |
|
- am |
|
- ar |
|
- as |
|
- az |
|
- be |
|
- bg |
|
- bn |
|
- br |
|
- bs |
|
- ca |
|
- cs |
|
- cy |
|
- da |
|
- de |
|
- el |
|
- en |
|
- eo |
|
- es |
|
- et |
|
- eu |
|
- fa |
|
- fi |
|
- fr |
|
- fy |
|
- ga |
|
- gd |
|
- gl |
|
- gu |
|
- ha |
|
- he |
|
- hi |
|
- hr |
|
- hu |
|
- hy |
|
- id |
|
- is |
|
- it |
|
- ja |
|
- jv |
|
- ka |
|
- kk |
|
- km |
|
- kn |
|
- ko |
|
- ku |
|
- ky |
|
- la |
|
- lo |
|
- lt |
|
- lv |
|
- mg |
|
- mk |
|
- ml |
|
- mn |
|
- mr |
|
- ms |
|
- my |
|
- ne |
|
- nl |
|
- 'no' |
|
- om |
|
- or |
|
- pa |
|
- pl |
|
- ps |
|
- pt |
|
- ro |
|
- ru |
|
- sa |
|
- sd |
|
- si |
|
- sk |
|
- sl |
|
- so |
|
- sq |
|
- sr |
|
- su |
|
- sv |
|
- sw |
|
- ta |
|
- te |
|
- th |
|
- tl |
|
- tr |
|
- ug |
|
- uk |
|
- ur |
|
- uz |
|
- vi |
|
- xh |
|
- yi |
|
- zh |
|
license: apache-2.0 |
|
base_model: |
|
- FacebookAI/xlm-roberta-large |
|
--- |
|
|
|
# PreCOMET-avg [](https://arxiv.org/abs/2501.18251) |
|
|
|
This is a source-only COMET model used for efficient evaluation subset selection. |
|
Specifically this model predicts expected human score just based on the source. |
|
The lower the scores, the better it is for evaluation because then models will struggle more in translating it. |
|
It is not compatible with the original Unbabel's COMET and to run it you have to install [github.com/zouharvi/PreCOMET](https://github.com/zouharvi/PreCOMET): |
|
```bash |
|
pip install pip3 install git+https://github.com/zouharvi/PreCOMET.git |
|
``` |
|
|
|
You can then use it in Python: |
|
```python |
|
import precomet |
|
model = precomet.load_from_checkpoint(precomet.download_model("zouharvi/PreCOMET-avg")) |
|
model.predict([ |
|
{"src": "This is an easy source sentence."}, |
|
{"src": "this is a much more complicated source sen-tence that will pro·bably lead to loww scores 🤪"} |
|
])["scores"] |
|
> [72.0051040649414, 71.98278045654297] |
|
``` |
|
|
|
The primary use of this model is from the [subset2evaluate](https://github.com/zouharvi/subset2evaluate) package: |
|
|
|
```python |
|
import subset2evaluate |
|
|
|
data_full = subset2evaluate.utils.load_data("wmt23/en-cs") |
|
data_random = subset2evaluate.select_subset.basic(data_full, method="random") |
|
subset2evaluate.evaluate.eval_subset_clusters(data_random[:100]) |
|
> 2 |
|
subset2evaluate.evaluate.eval_subset_correlation(data_random[:100], data_full) |
|
> 0.71 |
|
``` |
|
Random selection gives us only one cluster and system-level Spearman correlation of 0.71 when we have a budget for only 100 segments. However, by using this model: |
|
```python |
|
data_precomet = subset2evaluate.select_subset.basic(data_full, method="precomet_avg") |
|
subset2evaluate.evaluate.eval_subset_clusters(data_precomet[:100]) |
|
> 2 |
|
subset2evaluate.evaluate.eval_subset_correlation(data_precomet[:100], data_full) |
|
> 0.61 |
|
``` |
|
we get more clusters. |
|
Note that this is not the best PreCOMET model and you can expect a bigger effect on a larger scale, as described in the paper. |
|
|
|
|
|
This work is described in [How to Select Datapoints for Efficient Human Evaluation of NLG Models?](https://arxiv.org/abs/2501.18251). |
|
Cite as: |
|
``` |
|
@misc{zouhar2025selectdatapointsefficienthuman, |
|
title={How to Select Datapoints for Efficient Human Evaluation of NLG Models?}, |
|
author={Vilém Zouhar and Peng Cui and Mrinmaya Sachan}, |
|
year={2025}, |
|
eprint={2501.18251}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2501.18251}, |
|
} |
|
``` |