SiriusL nielsr HF Staff commited on
Commit
8ac16f8
·
verified ·
1 Parent(s): 5e95c7f

Improve model card: Add prominent links and evaluation instructions (#1)

Browse files

- Improve model card: Add prominent links and evaluation instructions (25e00aaabe2d6bc56b439581892da2f4aeb43810)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +97 -9
README.md CHANGED
@@ -3,18 +3,20 @@ base_model:
3
  - Qwen/Qwen2.5-VL-7B-Instruct
4
  language:
5
  - en
 
6
  license: apache-2.0
 
7
  tags:
8
  - gui
9
  - agent
10
  - gui-grounding
11
  - reinforcement-learning
12
- pipeline_tag: image-text-to-text
13
- library_name: transformers
14
  ---
15
 
16
  # InfiGUI-G1-7B
17
 
 
 
18
  This repository contains the InfiGUI-G1-7B model from the paper **[InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization](https://arxiv.org/abs/2508.05731)**.
19
 
20
  The model is based on `Qwen2.5-VL-7B-Instruct` and is fine-tuned using our proposed **Adaptive Exploration Policy Optimization (AEPO)** framework. AEPO is a novel reinforcement learning method designed to enhance the model's **semantic alignment** for GUI grounding tasks. It overcomes the exploration bottlenecks of standard RLVR methods by integrating a multi-answer generation strategy with a theoretically-grounded adaptive reward function, enabling more effective and efficient learning for complex GUI interactions.
@@ -91,7 +93,7 @@ def visualize_points(original_image: Image.Image, points: list,
91
  # Draw circle
92
  circle_radius = 20
93
  draw.ellipse([original_x - circle_radius, original_y - circle_radius,
94
- original_x + circle_radius, original_y + circle_radius],
95
  fill=(255, 0, 0))
96
 
97
  # Draw label
@@ -125,7 +127,8 @@ def main():
125
 
126
  # Prepare model inputs
127
  instruction = "shuffle play the current playlist"
128
- system_prompt = 'You FIRST think about the reasoning process as an internal monologue and then provide the final answer.\nThe reasoning process MUST BE enclosed within <think> </think> tags.'
 
129
  prompt = f'''The screen's resolution is {new_width}x{new_height}.
130
  Locate the UI element(s) for "{instruction}", output the coordinates using JSON format: [{{"point_2d": [x, y]}}, ...]'''
131
 
@@ -162,10 +165,6 @@ if __name__ == "__main__":
162
  main()
163
  ```
164
 
165
- To reproduce the results in our paper, please refer to our repo for detailed instructions.
166
-
167
- For more details on the methodology and evaluation, please refer to our [paper](https://arxiv.org/abs/2508.05731) and [repository](https://github.com/InfiXAI/InfiGUI-G1).
168
-
169
  ## Results
170
 
171
  Our InfiGUI-G1 models, trained with the AEPO framework, establish new state-of-the-art results among open-source models across a diverse and challenging set of GUI grounding benchmarks.
@@ -210,7 +209,92 @@ On the widely-used ScreenSpot-V2 benchmark, which provides comprehensive coverag
210
  <img src="https://raw.githubusercontent.com/InfiXAI/InfiGUI-G1/main/assets/results_screenspot-v2.png" width="90%" alt="ScreenSpot-V2 Results">
211
  </div>
212
 
213
- ## Citation Information
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
214
 
215
  If you find this work useful, we would be grateful if you consider citing the following papers:
216
 
@@ -243,3 +327,7 @@ If you find this work useful, we would be grateful if you consider citing the fo
243
  year={2025}
244
  }
245
  ```
 
 
 
 
 
3
  - Qwen/Qwen2.5-VL-7B-Instruct
4
  language:
5
  - en
6
+ library_name: transformers
7
  license: apache-2.0
8
+ pipeline_tag: image-text-to-text
9
  tags:
10
  - gui
11
  - agent
12
  - gui-grounding
13
  - reinforcement-learning
 
 
14
  ---
15
 
16
  # InfiGUI-G1-7B
17
 
18
+ **[📚 Paper](https://arxiv.org/abs/2508.05731)** | **[🌐 Project Page](https://osatlas.github.io/)** | **[💻 Code](https://github.com/InfiXAI/InfiGUI-G1)**
19
+
20
  This repository contains the InfiGUI-G1-7B model from the paper **[InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization](https://arxiv.org/abs/2508.05731)**.
21
 
22
  The model is based on `Qwen2.5-VL-7B-Instruct` and is fine-tuned using our proposed **Adaptive Exploration Policy Optimization (AEPO)** framework. AEPO is a novel reinforcement learning method designed to enhance the model's **semantic alignment** for GUI grounding tasks. It overcomes the exploration bottlenecks of standard RLVR methods by integrating a multi-answer generation strategy with a theoretically-grounded adaptive reward function, enabling more effective and efficient learning for complex GUI interactions.
 
93
  # Draw circle
94
  circle_radius = 20
95
  draw.ellipse([original_x - circle_radius, original_y - circle_radius,
96
+ original_x + circle_radius, original_y + circle_radius],\
97
  fill=(255, 0, 0))
98
 
99
  # Draw label
 
127
 
128
  # Prepare model inputs
129
  instruction = "shuffle play the current playlist"
130
+ system_prompt = 'You FIRST think about the reasoning process as an internal monologue and then provide the final answer.
131
+ The reasoning process MUST BE enclosed within <think> </think> tags.'
132
  prompt = f'''The screen's resolution is {new_width}x{new_height}.
133
  Locate the UI element(s) for "{instruction}", output the coordinates using JSON format: [{{"point_2d": [x, y]}}, ...]'''
134
 
 
165
  main()
166
  ```
167
 
 
 
 
 
168
  ## Results
169
 
170
  Our InfiGUI-G1 models, trained with the AEPO framework, establish new state-of-the-art results among open-source models across a diverse and challenging set of GUI grounding benchmarks.
 
209
  <img src="https://raw.githubusercontent.com/InfiXAI/InfiGUI-G1/main/assets/results_screenspot-v2.png" width="90%" alt="ScreenSpot-V2 Results">
210
  </div>
211
 
212
+ ## ⚙️ Evaluation
213
+
214
+ This section provides instructions for reproducing the evaluation results reported in our paper.
215
+
216
+ ### 1. Getting Started
217
+
218
+ Clone the repository and navigate to the project directory:
219
+
220
+ ```bash
221
+ git clone https://github.com/InfiXAI/InfiGUI-G1.git
222
+ cd InfiGUI-G1
223
+ ```
224
+
225
+ ### 2. Environment Setup
226
+
227
+ The evaluation pipeline is built upon the [vLLM](https://github.com/vllm-project/vllm) library for efficient inference. For detailed installation guidance, please refer to the official vLLM repository. The specific versions used to obtain the results reported in our paper are as follows:
228
+
229
+ - **Python**: `3.10.12`
230
+ - **PyTorch**: `2.6.0`
231
+ - **Transformers**: `4.50.1`
232
+ - **vLLM**: `0.8.2`
233
+ - **CUDA**: `12.6`
234
+
235
+ The reported results were obtained on a server equipped with 4 x NVIDIA H800 GPUs.
236
+
237
+ ### 3. Model Download
238
+
239
+ Download the InfiGUI-G1 models from the Hugging Face Hub into the `./models` directory.
240
+
241
+ ```bash
242
+ # Create a directory for models
243
+ mkdir -p ./models
244
+
245
+ # Download InfiGUI-G1-3B
246
+ huggingface-cli download --resume-download InfiX-ai/InfiGUI-G1-3B --local-dir ./models/InfiGUI-G1-3B
247
+
248
+ # Download InfiGUI-G1-7B
249
+ huggingface-cli download --resume-download InfiX-ai/InfiGUI-G1-7B --local-dir ./models/InfiGUI-G1-7B
250
+ ```
251
+
252
+ ### 4. Dataset Download and Preparation
253
+
254
+ Download the required evaluation benchmarks into the `./data` directory.
255
+
256
+ ```bash
257
+ # Create a directory for datasets
258
+ mkdir -p ./data
259
+
260
+ # Download benchmarks
261
+ huggingface-cli download --repo-type dataset --resume-download likaixin/ScreenSpot-Pro --local-dir ./data/ScreenSpot-Pro
262
+ huggingface-cli download --repo-type dataset --resume-download ServiceNow/ui-vision --local-dir ./data/ui-vision
263
+ huggingface-cli download --repo-type dataset --resume-download OS-Copilot/ScreenSpot-v2 --local-dir ./data/ScreenSpot-v2
264
+ huggingface-cli download --repo-type dataset --resume-download OpenGVLab/MMBench-GUI --local-dir ./data/MMBench-GUI
265
+ huggingface-cli download --repo-type dataset --resume-download vaundys/I2E-Bench --local-dir ./data/I2E-Bench
266
+ ```
267
+
268
+ After downloading, some datasets require unzipping compressed image files.
269
+
270
+ ```bash
271
+ # Unzip images for ScreenSpot-v2
272
+ unzip ./data/ScreenSpot-v2/screenspotv2_image.zip -d ./data/ScreenSpot-v2/
273
+
274
+ # Unzip images for MMBench-GUI
275
+ unzip ./data/MMBench-GUI/MMBench-GUI-OfflineImages.zip -d ./data/MMBench-GUI/
276
+ ```
277
+
278
+ ### 5. Running the Evaluation
279
+
280
+ To run the evaluation, use the `eval/eval.py` script. You must specify the path to the model, the benchmark name, and the tensor parallel size.
281
+
282
+ Here is an example command to evaluate the `InfiGUI-G1-3B` model on the `screenspot-pro` benchmark using 4 GPUs:
283
+
284
+ ```bash
285
+ python eval/eval.py \
286
+ ./models/InfiGUI-G1-3B \
287
+ --benchmark screenspot-pro \
288
+ --tensor-parallel 4
289
+ ```
290
+
291
+ - **`model_path`**: The first positional argument specifies the path to the downloaded model directory (e.g., `./models/InfiGUI-G1-3B`).
292
+ - **`--benchmark`**: Specifies the benchmark to evaluate. Available options include `screenspot-pro`, `screenspot-v2`, `ui-vision`, `mmbench-gui`, and `i2e-bench`.
293
+ - **`--tensor-parallel`**: Sets the tensor parallelism size, which should typically match the number of available GPUs.
294
+
295
+ Evaluation results, including detailed logs and performance metrics, will be saved to the `./output/{model_name}/{benchmark}/` directory.
296
+
297
+ ## 📚 Citation Information
298
 
299
  If you find this work useful, we would be grateful if you consider citing the following papers:
300
 
 
327
  year={2025}
328
  }
329
  ```
330
+
331
+ ## 🙏 Acknowledgements
332
+
333
+ We would like to express our gratitude for the following open-source projects: [VERL](https://github.com/volcengine/verl), [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL) and [vLLM](https://github.com/vllm-project/vllm).