metadata

pipeline_tag: translation
language:
  - multilingual
  - af
  - am
  - ar
  - as
  - az
  - be
  - bg
  - bn
  - br
  - bs
  - ca
  - cs
  - cy
  - da
  - de
  - el
  - en
  - eo
  - es
  - et
  - eu
  - fa
  - fi
  - fr
  - fy
  - ga
  - gd
  - gl
  - gu
  - ha
  - he
  - hi
  - hr
  - hu
  - hy
  - id
  - is
  - it
  - ja
  - jv
  - ka
  - kk
  - km
  - kn
  - ko
  - ku
  - ky
  - la
  - lo
  - lt
  - lv
  - mg
  - mk
  - ml
  - mn
  - mr
  - ms
  - my
  - ne
  - nl
  - 'no'
  - om
  - or
  - pa
  - pl
  - ps
  - pt
  - ro
  - ru
  - sa
  - sd
  - si
  - sk
  - sl
  - so
  - sq
  - sr
  - su
  - sv
  - sw
  - ta
  - te
  - th
  - tl
  - tr
  - ug
  - uk
  - ur
  - uz
  - vi
  - xh
  - yi
  - zh
license: apache-2.0
base_model:
  - FacebookAI/xlm-roberta-large

PreCOMET-var

This is a source-only COMET model used for efficient evaluation subset selection. Specifically this model predicts expected variance in human scores in translations. Trained on direct assessment scores from up to WMT2022. The higher the scores, the better it is for evaluation because it will likely distinguish between systems. It is not compatible with the original Unbabel's COMET and to run it you have to install github.com/zouharvi/PreCOMET:

pip install pip3 install git+https://github.com/zouharvi/PreCOMET.git

You can then use it in Python:

import precomet
model = precomet.load_from_checkpoint(precomet.download_model("zouharvi/PreCOMET-var"))
model.predict([
  {"src": "This is an easy source sentence."},
  {"src": "this is a much more complicated source sen-tence that will pro·bably lead to loww scores 🤪"}
])["scores"]
> [70.99381256103516, 70.99385833740234]

The primary use of this model is from the subset2evaluate package:

import subset2evaluate

data_full = subset2evaluate.utils.load_data("wmt23/en-cs")
data_random = subset2evaluate.select_subset.basic(data_full, method="random")
subset2evaluate.evaluate.eval_subset_clusters(data_random[:100])
> 1
subset2evaluate.evaluate.eval_subset_correlation(data_random[:100], data_full)
> 0.71

Random selection gives us only one cluster and system-level Spearman correlation of 0.71 when we have a budget for only 100 segments. However, by using this model:

data_precomet = subset2evaluate.select_subset.basic(data_full, method="precomet_var")
subset2evaluate.evaluate.eval_subset_clusters(data_precomet[:100])
> 2
subset2evaluate.evaluate.eval_subset_correlation(data_precomet[:100], data_full)
> 0.92

we get higher correlation and number of clusters. You can expect a bigger effect on a larger scale, as described in the paper.

This work is described in How to Select Datapoints for Efficient Human Evaluation of NLG Models?. Cite as:

@misc{zouhar2025selectdatapointsefficienthuman,
    title={How to Select Datapoints for Efficient Human Evaluation of NLG Models?}, 
    author={Vilém Zouhar and Peng Cui and Mrinmaya Sachan},
    year={2025},
    eprint={2501.18251},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2501.18251}, 
}