zouharvi
/

PreCOMET-avg

Model card Files Files and versions Community

PreCOMET-avg / README.md

zouharvi's picture

Update README.md

9faf476 verified 5 days ago

|

history blame contribute delete

3.1 kB

	---
	pipeline_tag: translation
	language:
	- multilingual
	- af
	- am
	- ar
	- as
	- az
	- be
	- bg
	- bn
	- br
	- bs
	- ca
	- cs
	- cy
	- da
	- de
	- el
	- en
	- eo
	- es
	- et
	- eu
	- fa
	- fi
	- fr
	- fy
	- ga
	- gd
	- gl
	- gu
	- ha
	- he
	- hi
	- hr
	- hu
	- hy
	- id
	- is
	- it
	- ja
	- jv
	- ka
	- kk
	- km
	- kn
	- ko
	- ku
	- ky
	- la
	- lo
	- lt
	- lv
	- mg
	- mk
	- ml
	- mn
	- mr
	- ms
	- my
	- ne
	- nl
	- 'no'
	- om
	- or
	- pa
	- pl
	- ps
	- pt
	- ro
	- ru
	- sa
	- sd
	- si
	- sk
	- sl
	- so
	- sq
	- sr
	- su
	- sv
	- sw
	- ta
	- te
	- th
	- tl
	- tr
	- ug
	- uk
	- ur
	- uz
	- vi
	- xh
	- yi
	- zh
	license: apache-2.0
	base_model:
	- FacebookAI/xlm-roberta-large
	---

	# PreCOMET-avg [![Paper](https://img.shields.io/badge/📜%20paper-481.svg)](https://arxiv.org/abs/2501.18251)

	This is a source-only COMET model used for efficient evaluation subset selection.
	Specifically this model predicts expected human score just based on the source.
	The lower the scores, the better it is for evaluation because then models will struggle more in translating it.
	It is not compatible with the original Unbabel's COMET and to run it you have to install [github.com/zouharvi/PreCOMET](https://github.com/zouharvi/PreCOMET):
	```bash
	pip install pip3 install git+https://github.com/zouharvi/PreCOMET.git
	```

	You can then use it in Python:
	```python
	import precomet
	model = precomet.load_from_checkpoint(precomet.download_model("zouharvi/PreCOMET-avg"))
	model.predict([
	{"src": "This is an easy source sentence."},
	{"src": "this is a much more complicated source sen-tence that will pro·bably lead to loww scores 🤪"}
	])["scores"]
	> [72.0051040649414, 71.98278045654297]
	```

	The primary use of this model is from the [subset2evaluate](https://github.com/zouharvi/subset2evaluate) package:

	```python
	import subset2evaluate

	data_full = subset2evaluate.utils.load_data("wmt23/en-cs")
	data_random = subset2evaluate.select_subset.basic(data_full, method="random")
	subset2evaluate.evaluate.eval_subset_clusters(data_random[:100])
	> 2
	subset2evaluate.evaluate.eval_subset_correlation(data_random[:100], data_full)
	> 0.71
	```
	Random selection gives us only one cluster and system-level Spearman correlation of 0.71 when we have a budget for only 100 segments. However, by using this model:
	```python
	data_precomet = subset2evaluate.select_subset.basic(data_full, method="precomet_avg")
	subset2evaluate.evaluate.eval_subset_clusters(data_precomet[:100])
	> 2
	subset2evaluate.evaluate.eval_subset_correlation(data_precomet[:100], data_full)
	> 0.61
	```
	we get more clusters.
	Note that this is not the best PreCOMET model and you can expect a bigger effect on a larger scale, as described in the paper.


	This work is described in [How to Select Datapoints for Efficient Human Evaluation of NLG Models?](https://arxiv.org/abs/2501.18251).
	Cite as:
	```
	@misc{zouhar2025selectdatapointsefficienthuman,
	title={How to Select Datapoints for Efficient Human Evaluation of NLG Models?},
	author={Vilém Zouhar and Peng Cui and Mrinmaya Sachan},
	year={2025},
	eprint={2501.18251},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2501.18251},
	}
	```