|
|
|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
library_name: transformers |
|
tags: |
|
- llama |
|
- dpo |
|
- preference-optimization |
|
- PEFT |
|
- instruction-tuning |
|
pipeline_tag: text-generation |
|
--- |
|
# DPO Fine-Tuned Adapter - LLM Judge Dataset |
|
|
|
## 🧠 Model |
|
- Base: `meta-llama/Llama-3.2-1B-Instruct` |
|
- Fine-tuned using TRL's `DPOTrainer` with the LLM Judge preference dataset (50 pairs) |
|
|
|
## ⚙️ Training Parameters |
|
| Parameter | Value | |
|
|-----------------------|---------------| |
|
| Learning Rate | 5e-5 | |
|
| Batch Size | 4 | |
|
| Epochs | 3 | |
|
| Beta (DPO regularizer)| 0.1 | |
|
| Max Input Length | 1024 tokens | |
|
| Max Prompt Length | 512 tokens | |
|
| Padding Token | `eos_token` | |
|
|
|
## 📦 Dataset |
|
- Source: `llm_judge_preferences.csv` |
|
- Size: 50 human-labeled pairs with `prompt`, `chosen`, and `rejected` columns |
|
|
|
## 📂 Output |
|
- Adapter saved and uploaded as `Likhith003/dpo-llmjudge-lora-adapter` |
|
|