File size: 3,072 Bytes
5444a43
 
 
 
 
 
 
d0a6d5e
97ef393
 
 
 
 
 
 
 
 
 
 
 
 
 
c3e7f66
97ef393
 
 
 
 
c3e7f66
97ef393
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d0a6d5e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
license: cc-by-nc-sa-4.0
datasets:
- PengxiangLi/SPORT
language:
- en
base_model:
- Qwen/Qwen2-VL-7B-Instruct
---
 

# 🎯 SPORT: Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning

<div align="center">

[![arXiv](https://img.shields.io/badge/arXiv-2504.21561-b31b1b.svg)](https://arxiv.org/abs/2504.21561)
[![Project Page](https://img.shields.io/badge/Project-Page-2ea44f)](https://sport-agents.github.io)
[![Paper](https://img.shields.io/badge/Paper-PDF-red)](https://arxiv.org/pdf/2504.21561)

</div>

This repository contains the **LoRA checkpoint** for **SPORT**, a framework that enables multimodal agents to improve iteratively through self-generated tasks and preference-based optimization.
We finetuned **Qwen2-VL-7B-Instruct** using **LoRA adapters** and **Direct Preference Optimization (DPO)**, making the model more effective at reasoning about multimodal tasks and aligning with preference signals.

---

## 📋 Key Features

* **LoRA Fine-tuning**: Lightweight finetuning on top of Qwen2-VL-7B-Instruct for efficient adaptation.
* **DPO Training**: Preference-based optimization for stronger alignment without human annotations.
* **Task Synthesis**: Multimodal task generation via LLMs for broad coverage.
* **Step Exploration**: Multiple candidate actions sampled per decision point.
* **Step Verification**: LLM-based critics evaluate and rank candidate outcomes.
* **Self-Improvement Loop**: Iterative cycle of task creation, exploration, and refinement.

---

## 🚀 Performance Highlights

On the **GTA benchmark**, SPORT demonstrates consistent improvements over strong baselines:

* **+7%** Answer Accuracy (AnsAcc)
* **+8%** Tool Accuracy (ToolAcc)
* **+7%** Code Execution Success (CodeExec)

---

## 💾 Model Details

* **Base Model**: [Qwen2-VL-7B](https://huggingface.co/Qwen/Qwen2-VL-7B)
* **Finetuning Method**: LoRA (rank 64, α=16)
* **Optimization**: Direct Preference Optimization (DPO)
* **Checkpoint**: LoRA weights only (requires merging with base model for inference)

---

## 🛠️ Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = "Qwen/Qwen2-VL-7B"
lora_ckpt = "your-hf-username/SPORT-LoRA-7B"

tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(base_model, device_map="auto")
model = PeftModel.from_pretrained(model, lora_ckpt)
```

---

## 📝 Citation

If you use SPORT or this checkpoint in your research, please cite:

```bibtex
@inproceedings{li2025iterative,
  title={Iterative Trajectory Exploration for Multimodal Agents}, 
  author={Li, Pengxiang and Gao, Zhi and Zhang, Bofei and Mi, Yapeng and Ma, Xiaojian and Shi, Chenrui and Yuan, Tao and Wu, Yuwei and Jia, Yunde and Zhu, Song-Chun and Li, Qing},
  year={2025},
  eprint={2504.21561},
  archivePrefix={arXiv},
  url={https://arxiv.org/abs/2504.21561}, 
}
```

---

⚠️ **Note**: This repository only provides LoRA weights. You must load them on top of the base **Qwen2-VL-7B** model for inference.