araft_trained_dpo

This model has been obtained by fine-tuning araft_trained_sft with DPO. The trajectories from the Araft dataset were used to adapt the model to make a novel query at every step, instead of repeating the query from the previous one.

Model description

This model has been generated in the context of the Araft project. The Araft project consists in fine-tuning a Llama2-7B model to enable the use of the ReAct pattern for Wikipedia-augmented question-answering. This model is the product of the second and final training step: DPO training.

In the DPO training step, the trajectories from the Araft dataset have been used to fine-tune the model. Each step was used as a desired output for the previous part of the trajectory, whereas the repetition of the previous step was used as undesired output. The model achieves a 26% performace (f1 score) on the HotpotQA dataset.

For further information, please see the Araft github repo.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 1
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 4
total_train_batch_size: 4
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 100
num_epochs: 1
mixed_precision_training: Native AMP

Framework versions

PEFT 0.10.0
Transformers 4.38.2
Pytorch 2.2.1+cu121
Datasets 2.18.0
Tokenizers 0.15.2

FDeRubeis
/

araft_trained_dpo

araft_trained_dpo

Model description

Training hyperparameters

Framework versions

Model tree for FDeRubeis/araft_trained_dpo

Evaluation results