general-preference
/

GPO-Llama-3-8B-Instruct-GPM-2B

Text Generation

Model card Files Files and versions

GPO-Llama-3-8B-Instruct-GPM-2B / README.md

yifAI's picture

Update README.md

f6fae48 verified 11 months ago

|

history blame contribute delete

3.23 kB

	---
	language:
	- en
	license: apache-2.0
	datasets:
	- openbmb/UltraFeedback
	pipeline_tag: text-generation
	model-index:
	- name: GPO-Llama-3-8B-Instruct-GPM-2B
	results: []
	---

	General Preference Modeling with Preference Representations for Aligning Language Models (https://arxiv.org/abs/2410.02197)

	# GPO-Llama-3-8B-Instruct-GPM-2B

	This model was developed using [General Preference Optimization (GPO)](https://arxiv.org/abs/2405.00675) at iteration 3 and the [General Preference representation Model (GPM)](https://arxiv.org/abs/2410.02197) (specifically, using [GPM-Gemma-2B](https://huggingface.co/general-preference/GPM-Gemma-2B)), based on the [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) architecture as starting point. We utilized the prompt sets from the [openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) dataset, splited to 3 parts for 3 iterations by [snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset](https://huggingface.co/datasets/snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset). All responses used are synthetic.


	## Links to Other Models
	- [SPPO-Llama-3-8B-Instruct-GPM-2B](https://huggingface.co/general-preference/SPPO-Llama-3-8B-Instruct-GPM-2B)
	- [GPO-Llama-3-8B-Instruct-GPM-2B](https://huggingface.co/general-preference/GPO-Llama-3-8B-Instruct-GPM-2B)

	### Model Description

	- Model type: A 8B parameter GPT-like model fine-tuned on synthetic datasets.
	- Language(s) (NLP): Primarily English
	- License: Apache-2.0
	- Finetuned from model: meta-llama/Meta-Llama-3-8B-Instruct


	## [AlpacaEval Leaderboard Evaluation Results](https://tatsu-lab.github.io/alpaca_eval/)


	\| Model \| LC. Win Rate \| Win Rate \| Avg. Length \|
	\|-------------------------------------------\|:------------:\|:--------:\|:-----------:\|
	\|[GPO-Llama-3-8B-Instruct-GPM-2B](https://huggingface.co/general-preference/GPO-Llama-3-8B-Instruct-GPM-2B) \| 38.43 \| 48.87 \| 2613



	## [Open LLM Leaderboard Evaluation Results](https://github.com/EleutherAI/lm-evaluation-harness)

	Results are reported by using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) v0.4.1

	\| \| arc_challenge \| truthfulqa_mc2 \| winogrande \| gsm8k \| hellaswag \| mmlu \| average \|
	\|--------\|---------------\|----------------\|------------\|-------\|-----------\|-------\|---------\|
	\|[GPO-Llama-3-8B-Instruct-GPM-2B](https://huggingface.co/general-preference/GPO-Llama-3-8B-Instruct-GPM-2B) \| 61.43 \| 53.54 \| 75.22 \| 76.12 \| 78.06 \| 65.65 \| 68.34


	### Training hyperparameters
	The following hyperparameters were used during training:

	- learning_rate: 5e-07
	- beta: 0.001
	- per_device_train_batch_size: 8
	- gradient_accumulation_steps: 1
	- seed: 42
	- distributed_type: deepspeed_zero3
	- num_devices: 8
	- optimizer: RMSProp
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_ratio: 0.1
	- num_train_epochs: 6.0 (stop at epoch=1.0)




	## Citation
	```
	@article{zhang2024general,
	title={General Preference Modeling with Preference Representations for Aligning Language Models},
	author={Zhang, Yifan and Zhang, Ge and Wu, Yue and Xu, Kangping and Gu, Quanquan},
	journal={arXiv preprint arXiv:2410.02197},
	year={2024}
	}
	```