Improve model card: Add prominent links and evaluation instructions (#1)
Browse files- Improve model card: Add prominent links and evaluation instructions (25e00aaabe2d6bc56b439581892da2f4aeb43810)
Co-authored-by: Niels Rogge <[email protected]>
README.md
CHANGED
@@ -3,18 +3,20 @@ base_model:
|
|
3 |
- Qwen/Qwen2.5-VL-7B-Instruct
|
4 |
language:
|
5 |
- en
|
|
|
6 |
license: apache-2.0
|
|
|
7 |
tags:
|
8 |
- gui
|
9 |
- agent
|
10 |
- gui-grounding
|
11 |
- reinforcement-learning
|
12 |
-
pipeline_tag: image-text-to-text
|
13 |
-
library_name: transformers
|
14 |
---
|
15 |
|
16 |
# InfiGUI-G1-7B
|
17 |
|
|
|
|
|
18 |
This repository contains the InfiGUI-G1-7B model from the paper **[InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization](https://arxiv.org/abs/2508.05731)**.
|
19 |
|
20 |
The model is based on `Qwen2.5-VL-7B-Instruct` and is fine-tuned using our proposed **Adaptive Exploration Policy Optimization (AEPO)** framework. AEPO is a novel reinforcement learning method designed to enhance the model's **semantic alignment** for GUI grounding tasks. It overcomes the exploration bottlenecks of standard RLVR methods by integrating a multi-answer generation strategy with a theoretically-grounded adaptive reward function, enabling more effective and efficient learning for complex GUI interactions.
|
@@ -91,7 +93,7 @@ def visualize_points(original_image: Image.Image, points: list,
|
|
91 |
# Draw circle
|
92 |
circle_radius = 20
|
93 |
draw.ellipse([original_x - circle_radius, original_y - circle_radius,
|
94 |
-
original_x + circle_radius, original_y + circle_radius]
|
95 |
fill=(255, 0, 0))
|
96 |
|
97 |
# Draw label
|
@@ -125,7 +127,8 @@ def main():
|
|
125 |
|
126 |
# Prepare model inputs
|
127 |
instruction = "shuffle play the current playlist"
|
128 |
-
system_prompt = 'You FIRST think about the reasoning process as an internal monologue and then provide the final answer
|
|
|
129 |
prompt = f'''The screen's resolution is {new_width}x{new_height}.
|
130 |
Locate the UI element(s) for "{instruction}", output the coordinates using JSON format: [{{"point_2d": [x, y]}}, ...]'''
|
131 |
|
@@ -162,10 +165,6 @@ if __name__ == "__main__":
|
|
162 |
main()
|
163 |
```
|
164 |
|
165 |
-
To reproduce the results in our paper, please refer to our repo for detailed instructions.
|
166 |
-
|
167 |
-
For more details on the methodology and evaluation, please refer to our [paper](https://arxiv.org/abs/2508.05731) and [repository](https://github.com/InfiXAI/InfiGUI-G1).
|
168 |
-
|
169 |
## Results
|
170 |
|
171 |
Our InfiGUI-G1 models, trained with the AEPO framework, establish new state-of-the-art results among open-source models across a diverse and challenging set of GUI grounding benchmarks.
|
@@ -210,7 +209,92 @@ On the widely-used ScreenSpot-V2 benchmark, which provides comprehensive coverag
|
|
210 |
<img src="https://raw.githubusercontent.com/InfiXAI/InfiGUI-G1/main/assets/results_screenspot-v2.png" width="90%" alt="ScreenSpot-V2 Results">
|
211 |
</div>
|
212 |
|
213 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
214 |
|
215 |
If you find this work useful, we would be grateful if you consider citing the following papers:
|
216 |
|
@@ -243,3 +327,7 @@ If you find this work useful, we would be grateful if you consider citing the fo
|
|
243 |
year={2025}
|
244 |
}
|
245 |
```
|
|
|
|
|
|
|
|
|
|
3 |
- Qwen/Qwen2.5-VL-7B-Instruct
|
4 |
language:
|
5 |
- en
|
6 |
+
library_name: transformers
|
7 |
license: apache-2.0
|
8 |
+
pipeline_tag: image-text-to-text
|
9 |
tags:
|
10 |
- gui
|
11 |
- agent
|
12 |
- gui-grounding
|
13 |
- reinforcement-learning
|
|
|
|
|
14 |
---
|
15 |
|
16 |
# InfiGUI-G1-7B
|
17 |
|
18 |
+
**[📚 Paper](https://arxiv.org/abs/2508.05731)** | **[🌐 Project Page](https://osatlas.github.io/)** | **[💻 Code](https://github.com/InfiXAI/InfiGUI-G1)**
|
19 |
+
|
20 |
This repository contains the InfiGUI-G1-7B model from the paper **[InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization](https://arxiv.org/abs/2508.05731)**.
|
21 |
|
22 |
The model is based on `Qwen2.5-VL-7B-Instruct` and is fine-tuned using our proposed **Adaptive Exploration Policy Optimization (AEPO)** framework. AEPO is a novel reinforcement learning method designed to enhance the model's **semantic alignment** for GUI grounding tasks. It overcomes the exploration bottlenecks of standard RLVR methods by integrating a multi-answer generation strategy with a theoretically-grounded adaptive reward function, enabling more effective and efficient learning for complex GUI interactions.
|
|
|
93 |
# Draw circle
|
94 |
circle_radius = 20
|
95 |
draw.ellipse([original_x - circle_radius, original_y - circle_radius,
|
96 |
+
original_x + circle_radius, original_y + circle_radius],\
|
97 |
fill=(255, 0, 0))
|
98 |
|
99 |
# Draw label
|
|
|
127 |
|
128 |
# Prepare model inputs
|
129 |
instruction = "shuffle play the current playlist"
|
130 |
+
system_prompt = 'You FIRST think about the reasoning process as an internal monologue and then provide the final answer.
|
131 |
+
The reasoning process MUST BE enclosed within <think> </think> tags.'
|
132 |
prompt = f'''The screen's resolution is {new_width}x{new_height}.
|
133 |
Locate the UI element(s) for "{instruction}", output the coordinates using JSON format: [{{"point_2d": [x, y]}}, ...]'''
|
134 |
|
|
|
165 |
main()
|
166 |
```
|
167 |
|
|
|
|
|
|
|
|
|
168 |
## Results
|
169 |
|
170 |
Our InfiGUI-G1 models, trained with the AEPO framework, establish new state-of-the-art results among open-source models across a diverse and challenging set of GUI grounding benchmarks.
|
|
|
209 |
<img src="https://raw.githubusercontent.com/InfiXAI/InfiGUI-G1/main/assets/results_screenspot-v2.png" width="90%" alt="ScreenSpot-V2 Results">
|
210 |
</div>
|
211 |
|
212 |
+
## ⚙️ Evaluation
|
213 |
+
|
214 |
+
This section provides instructions for reproducing the evaluation results reported in our paper.
|
215 |
+
|
216 |
+
### 1. Getting Started
|
217 |
+
|
218 |
+
Clone the repository and navigate to the project directory:
|
219 |
+
|
220 |
+
```bash
|
221 |
+
git clone https://github.com/InfiXAI/InfiGUI-G1.git
|
222 |
+
cd InfiGUI-G1
|
223 |
+
```
|
224 |
+
|
225 |
+
### 2. Environment Setup
|
226 |
+
|
227 |
+
The evaluation pipeline is built upon the [vLLM](https://github.com/vllm-project/vllm) library for efficient inference. For detailed installation guidance, please refer to the official vLLM repository. The specific versions used to obtain the results reported in our paper are as follows:
|
228 |
+
|
229 |
+
- **Python**: `3.10.12`
|
230 |
+
- **PyTorch**: `2.6.0`
|
231 |
+
- **Transformers**: `4.50.1`
|
232 |
+
- **vLLM**: `0.8.2`
|
233 |
+
- **CUDA**: `12.6`
|
234 |
+
|
235 |
+
The reported results were obtained on a server equipped with 4 x NVIDIA H800 GPUs.
|
236 |
+
|
237 |
+
### 3. Model Download
|
238 |
+
|
239 |
+
Download the InfiGUI-G1 models from the Hugging Face Hub into the `./models` directory.
|
240 |
+
|
241 |
+
```bash
|
242 |
+
# Create a directory for models
|
243 |
+
mkdir -p ./models
|
244 |
+
|
245 |
+
# Download InfiGUI-G1-3B
|
246 |
+
huggingface-cli download --resume-download InfiX-ai/InfiGUI-G1-3B --local-dir ./models/InfiGUI-G1-3B
|
247 |
+
|
248 |
+
# Download InfiGUI-G1-7B
|
249 |
+
huggingface-cli download --resume-download InfiX-ai/InfiGUI-G1-7B --local-dir ./models/InfiGUI-G1-7B
|
250 |
+
```
|
251 |
+
|
252 |
+
### 4. Dataset Download and Preparation
|
253 |
+
|
254 |
+
Download the required evaluation benchmarks into the `./data` directory.
|
255 |
+
|
256 |
+
```bash
|
257 |
+
# Create a directory for datasets
|
258 |
+
mkdir -p ./data
|
259 |
+
|
260 |
+
# Download benchmarks
|
261 |
+
huggingface-cli download --repo-type dataset --resume-download likaixin/ScreenSpot-Pro --local-dir ./data/ScreenSpot-Pro
|
262 |
+
huggingface-cli download --repo-type dataset --resume-download ServiceNow/ui-vision --local-dir ./data/ui-vision
|
263 |
+
huggingface-cli download --repo-type dataset --resume-download OS-Copilot/ScreenSpot-v2 --local-dir ./data/ScreenSpot-v2
|
264 |
+
huggingface-cli download --repo-type dataset --resume-download OpenGVLab/MMBench-GUI --local-dir ./data/MMBench-GUI
|
265 |
+
huggingface-cli download --repo-type dataset --resume-download vaundys/I2E-Bench --local-dir ./data/I2E-Bench
|
266 |
+
```
|
267 |
+
|
268 |
+
After downloading, some datasets require unzipping compressed image files.
|
269 |
+
|
270 |
+
```bash
|
271 |
+
# Unzip images for ScreenSpot-v2
|
272 |
+
unzip ./data/ScreenSpot-v2/screenspotv2_image.zip -d ./data/ScreenSpot-v2/
|
273 |
+
|
274 |
+
# Unzip images for MMBench-GUI
|
275 |
+
unzip ./data/MMBench-GUI/MMBench-GUI-OfflineImages.zip -d ./data/MMBench-GUI/
|
276 |
+
```
|
277 |
+
|
278 |
+
### 5. Running the Evaluation
|
279 |
+
|
280 |
+
To run the evaluation, use the `eval/eval.py` script. You must specify the path to the model, the benchmark name, and the tensor parallel size.
|
281 |
+
|
282 |
+
Here is an example command to evaluate the `InfiGUI-G1-3B` model on the `screenspot-pro` benchmark using 4 GPUs:
|
283 |
+
|
284 |
+
```bash
|
285 |
+
python eval/eval.py \
|
286 |
+
./models/InfiGUI-G1-3B \
|
287 |
+
--benchmark screenspot-pro \
|
288 |
+
--tensor-parallel 4
|
289 |
+
```
|
290 |
+
|
291 |
+
- **`model_path`**: The first positional argument specifies the path to the downloaded model directory (e.g., `./models/InfiGUI-G1-3B`).
|
292 |
+
- **`--benchmark`**: Specifies the benchmark to evaluate. Available options include `screenspot-pro`, `screenspot-v2`, `ui-vision`, `mmbench-gui`, and `i2e-bench`.
|
293 |
+
- **`--tensor-parallel`**: Sets the tensor parallelism size, which should typically match the number of available GPUs.
|
294 |
+
|
295 |
+
Evaluation results, including detailed logs and performance metrics, will be saved to the `./output/{model_name}/{benchmark}/` directory.
|
296 |
+
|
297 |
+
## 📚 Citation Information
|
298 |
|
299 |
If you find this work useful, we would be grateful if you consider citing the following papers:
|
300 |
|
|
|
327 |
year={2025}
|
328 |
}
|
329 |
```
|
330 |
+
|
331 |
+
## 🙏 Acknowledgements
|
332 |
+
|
333 |
+
We would like to express our gratitude for the following open-source projects: [VERL](https://github.com/volcengine/verl), [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL) and [vLLM](https://github.com/vllm-project/vllm).
|