llama-3-8b-dpo-ultrafeedback-5e-7-SFTed-paged_adamw_32bit-increase_linear-0.95to1.0
This is a model released from the preprint: DPO-Shift: Shifting the Distribution of Direct Preference Optimization. Please refer to our repository for more details.
This model is a fine-tuned version of princeton-nlp/Llama-3-Base-8B-SFT on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:
- Loss: 0.5521
- Rewards/chosen: -0.4441
- Rewards/rejected: -0.9724
- Dpo Lambda: 0.9972
- Rewards/accuracies: 0.7330
- Rewards/margins: 0.5282
- Logps/rejected: -368.2690
- Logps/chosen: -345.0610
- Logits/rejected: -1.0526
- Logits/chosen: -1.0164
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-07
- train_batch_size: 2
- eval_batch_size: 2
- seed: 42
- distributed_type: multi-GPU
- num_devices: 2
- gradient_accumulation_steps: 32
- total_train_batch_size: 128
- total_eval_batch_size: 4
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 1
Training results
Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Dpo Lambda | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0.6826 | 0.1047 | 50 | 0.6800 | 0.1038 | 0.0794 | 0.9552 | 0.6390 | 0.0244 | -263.0930 | -290.2701 | -0.9133 | -0.8451 |
0.6082 | 0.2094 | 100 | 0.6337 | -0.0205 | -0.1780 | 0.9605 | 0.7020 | 0.1574 | -288.8296 | -302.7013 | -0.9937 | -0.9367 |
0.6284 | 0.3141 | 150 | 0.6010 | -0.1351 | -0.4143 | 0.9657 | 0.7100 | 0.2792 | -312.4618 | -314.1588 | -0.9587 | -0.9129 |
0.6285 | 0.4187 | 200 | 0.5842 | -0.2139 | -0.5788 | 0.9710 | 0.7190 | 0.3649 | -328.9060 | -322.0359 | -0.9686 | -0.9279 |
0.5768 | 0.5234 | 250 | 0.5719 | -0.3843 | -0.8383 | 0.9762 | 0.7170 | 0.4540 | -354.8630 | -339.0806 | -1.0255 | -0.9855 |
0.5425 | 0.6281 | 300 | 0.5668 | -0.3998 | -0.8767 | 0.9814 | 0.7220 | 0.4769 | -358.6997 | -340.6255 | -1.0275 | -0.9893 |
0.573 | 0.7328 | 350 | 0.5578 | -0.4303 | -0.9403 | 0.9867 | 0.7280 | 0.5101 | -365.0644 | -343.6735 | -1.0352 | -0.9987 |
0.5364 | 0.8375 | 400 | 0.5543 | -0.4206 | -0.9426 | 0.9919 | 0.7320 | 0.5220 | -365.2877 | -342.7060 | -1.0446 | -1.0087 |
0.5385 | 0.9422 | 450 | 0.5521 | -0.4441 | -0.9724 | 0.9972 | 0.7330 | 0.5282 | -368.2690 | -345.0610 | -1.0526 | -1.0164 |
Framework versions
- Transformers 4.44.2
- Pytorch 2.4.0+cu121
- Datasets 2.21.0
- Tokenizers 0.19.1
- Downloads last month
- 2
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
Model tree for NoManDeRY/DPO-Shift-Llama-3-8B-Ultrafeedback-increase_linear_0.95to1.0
Base model
princeton-nlp/Llama-3-Base-8B-SFT