araft_trained_dpo
This model has been obtained by fine-tuning araft_trained_sft with DPO. The trajectories from the Araft dataset were used to adapt the model to make a novel query at every step, instead of repeating the query from the previous one.
Model description
This model has been generated in the context of the Araft project. The Araft project consists in fine-tuning a Llama2-7B model to enable the use of the ReAct pattern for Wikipedia-augmented question-answering. This model is the product of the second and final training step: DPO training.
In the DPO training step, the trajectories from the Araft dataset have been used to fine-tune the model. Each step was used as a desired output for the previous part of the trajectory, whereas the repetition of the previous step was used as undesired output. The model achieves a 26% performace (f1 score) on the HotpotQA dataset.
For further information, please see the Araft github repo.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 1
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 4
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 100
- num_epochs: 1
- mixed_precision_training: Native AMP
Framework versions
- PEFT 0.10.0
- Transformers 4.38.2
- Pytorch 2.2.1+cu121
- Datasets 2.18.0
- Tokenizers 0.15.2
- Downloads last month
- 0
Model tree for FDeRubeis/araft_trained_dpo
Base model
meta-llama/Llama-2-7b-chat-hf