InfiX-ai
/

InfiGUI-G1-7B

Image-Text-to-Text

Transformers

Safetensors

reinforcement-learning

conversational

text-generation-inference

Model card Files Files and versions Community

SiriusL

nielsr HF Staff commited on 10 days ago

Commit

8ac16f8

verified ·

1 Parent(s): 5e95c7f

Improve model card: Add prominent links and evaluation instructions (#1)

Browse files

- Improve model card: Add prominent links and evaluation instructions (25e00aaabe2d6bc56b439581892da2f4aeb43810)

Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show

README.md +97 -9

README.md CHANGED Viewed

@@ -3,18 +3,20 @@ base_model:
 - Qwen/Qwen2.5-VL-7B-Instruct
 language:
 - en
 license: apache-2.0
 tags:
 - gui
 - agent
 - gui-grounding
 - reinforcement-learning
-pipeline_tag: image-text-to-text
-library_name: transformers
 ---
 # InfiGUI-G1-7B
 This repository contains the InfiGUI-G1-7B model from the paper **[InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization](https://arxiv.org/abs/2508.05731)**.
 The model is based on `Qwen2.5-VL-7B-Instruct` and is fine-tuned using our proposed **Adaptive Exploration Policy Optimization (AEPO)** framework. AEPO is a novel reinforcement learning method designed to enhance the model's **semantic alignment** for GUI grounding tasks. It overcomes the exploration bottlenecks of standard RLVR methods by integrating a multi-answer generation strategy with a theoretically-grounded adaptive reward function, enabling more effective and efficient learning for complex GUI interactions.
@@ -91,7 +93,7 @@ def visualize_points(original_image: Image.Image, points: list,
         # Draw circle
         circle_radius = 20
         draw.ellipse([original_x - circle_radius, original_y - circle_radius,
-                     original_x + circle_radius, original_y + circle_radius],
                     fill=(255, 0, 0))
         # Draw label
@@ -125,7 +127,8 @@ def main():
     # Prepare model inputs
     instruction = "shuffle play the current playlist"
-    system_prompt = 'You FIRST think about the reasoning process as an internal monologue and then provide the final answer.\nThe reasoning process MUST BE enclosed within <think> </think> tags.'
     prompt = f'''The screen's resolution is {new_width}x{new_height}.
 Locate the UI element(s) for "{instruction}", output the coordinates using JSON format: [{{"point_2d": [x, y]}}, ...]'''
@@ -162,10 +165,6 @@ if __name__ == "__main__":
     main()
 ```
-To reproduce the results in our paper, please refer to our repo for detailed instructions.
-For more details on the methodology and evaluation, please refer to our [paper](https://arxiv.org/abs/2508.05731) and [repository](https://github.com/InfiXAI/InfiGUI-G1).
 ## Results
 Our InfiGUI-G1 models, trained with the AEPO framework, establish new state-of-the-art results among open-source models across a diverse and challenging set of GUI grounding benchmarks.
@@ -210,7 +209,92 @@ On the widely-used ScreenSpot-V2 benchmark, which provides comprehensive coverag
   <img src="https://raw.githubusercontent.com/InfiXAI/InfiGUI-G1/main/assets/results_screenspot-v2.png" width="90%" alt="ScreenSpot-V2 Results">
 </div>
-## Citation Information
 If you find this work useful, we would be grateful if you consider citing the following papers:
@@ -243,3 +327,7 @@ If you find this work useful, we would be grateful if you consider citing the fo
   year={2025}
 }
 ```

 - Qwen/Qwen2.5-VL-7B-Instruct
 language:
 - en
+library_name: transformers
 license: apache-2.0
+pipeline_tag: image-text-to-text
 tags:
 - gui
 - agent
 - gui-grounding
 - reinforcement-learning
 ---
 # InfiGUI-G1-7B
+**[📚 Paper](https://arxiv.org/abs/2508.05731)** | **[🌐 Project Page](https://osatlas.github.io/)** | **[💻 Code](https://github.com/InfiXAI/InfiGUI-G1)**
 This repository contains the InfiGUI-G1-7B model from the paper **[InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization](https://arxiv.org/abs/2508.05731)**.
 The model is based on `Qwen2.5-VL-7B-Instruct` and is fine-tuned using our proposed **Adaptive Exploration Policy Optimization (AEPO)** framework. AEPO is a novel reinforcement learning method designed to enhance the model's **semantic alignment** for GUI grounding tasks. It overcomes the exploration bottlenecks of standard RLVR methods by integrating a multi-answer generation strategy with a theoretically-grounded adaptive reward function, enabling more effective and efficient learning for complex GUI interactions.
         # Draw circle
         circle_radius = 20
         draw.ellipse([original_x - circle_radius, original_y - circle_radius,
+                     original_x + circle_radius, original_y + circle_radius],\
                     fill=(255, 0, 0))
         # Draw label
     # Prepare model inputs
     instruction = "shuffle play the current playlist"
+    system_prompt = 'You FIRST think about the reasoning process as an internal monologue and then provide the final answer.
+The reasoning process MUST BE enclosed within <think> </think> tags.'
     prompt = f'''The screen's resolution is {new_width}x{new_height}.
 Locate the UI element(s) for "{instruction}", output the coordinates using JSON format: [{{"point_2d": [x, y]}}, ...]'''
     main()
 ```
 ## Results
 Our InfiGUI-G1 models, trained with the AEPO framework, establish new state-of-the-art results among open-source models across a diverse and challenging set of GUI grounding benchmarks.
   <img src="https://raw.githubusercontent.com/InfiXAI/InfiGUI-G1/main/assets/results_screenspot-v2.png" width="90%" alt="ScreenSpot-V2 Results">
 </div>
+## ⚙️ Evaluation
+This section provides instructions for reproducing the evaluation results reported in our paper.
+### 1. Getting Started
+Clone the repository and navigate to the project directory:
+```bash
+git clone https://github.com/InfiXAI/InfiGUI-G1.git
+cd InfiGUI-G1
+```
+### 2. Environment Setup
+The evaluation pipeline is built upon the [vLLM](https://github.com/vllm-project/vllm) library for efficient inference. For detailed installation guidance, please refer to the official vLLM repository. The specific versions used to obtain the results reported in our paper are as follows:
+- **Python**: `3.10.12`
+- **PyTorch**: `2.6.0`
+- **Transformers**: `4.50.1`
+- **vLLM**: `0.8.2`
+- **CUDA**: `12.6`
+The reported results were obtained on a server equipped with 4 x NVIDIA H800 GPUs.
+### 3. Model Download
+Download the InfiGUI-G1 models from the Hugging Face Hub into the `./models` directory.
+```bash
+# Create a directory for models
+mkdir -p ./models
+# Download InfiGUI-G1-3B
+huggingface-cli download --resume-download InfiX-ai/InfiGUI-G1-3B --local-dir ./models/InfiGUI-G1-3B
+# Download InfiGUI-G1-7B
+huggingface-cli download --resume-download InfiX-ai/InfiGUI-G1-7B --local-dir ./models/InfiGUI-G1-7B
+```
+### 4. Dataset Download and Preparation
+Download the required evaluation benchmarks into the `./data` directory.
+```bash
+# Create a directory for datasets
+mkdir -p ./data
+# Download benchmarks
+huggingface-cli download --repo-type dataset --resume-download likaixin/ScreenSpot-Pro --local-dir ./data/ScreenSpot-Pro
+huggingface-cli download --repo-type dataset --resume-download ServiceNow/ui-vision --local-dir ./data/ui-vision
+huggingface-cli download --repo-type dataset --resume-download OS-Copilot/ScreenSpot-v2 --local-dir ./data/ScreenSpot-v2
+huggingface-cli download --repo-type dataset --resume-download OpenGVLab/MMBench-GUI --local-dir ./data/MMBench-GUI
+huggingface-cli download --repo-type dataset --resume-download vaundys/I2E-Bench --local-dir ./data/I2E-Bench
+```
+After downloading, some datasets require unzipping compressed image files.
+```bash
+# Unzip images for ScreenSpot-v2
+unzip ./data/ScreenSpot-v2/screenspotv2_image.zip -d ./data/ScreenSpot-v2/
+# Unzip images for MMBench-GUI
+unzip ./data/MMBench-GUI/MMBench-GUI-OfflineImages.zip -d ./data/MMBench-GUI/
+```
+### 5. Running the Evaluation
+To run the evaluation, use the `eval/eval.py` script. You must specify the path to the model, the benchmark name, and the tensor parallel size.
+Here is an example command to evaluate the `InfiGUI-G1-3B` model on the `screenspot-pro` benchmark using 4 GPUs:
+```bash
+python eval/eval.py \
+    ./models/InfiGUI-G1-3B \
+    --benchmark screenspot-pro \
+    --tensor-parallel 4
+```
+- **`model_path`**: The first positional argument specifies the path to the downloaded model directory (e.g., `./models/InfiGUI-G1-3B`).
+- **`--benchmark`**: Specifies the benchmark to evaluate. Available options include `screenspot-pro`, `screenspot-v2`, `ui-vision`, `mmbench-gui`, and `i2e-bench`.
+- **`--tensor-parallel`**: Sets the tensor parallelism size, which should typically match the number of available GPUs.
+Evaluation results, including detailed logs and performance metrics, will be saved to the `./output/{model_name}/{benchmark}/` directory.
+## 📚 Citation Information
 If you find this work useful, we would be grateful if you consider citing the following papers:
   year={2025}
 }
 ```
+## 🙏 Acknowledgements
+We would like to express our gratitude for the following open-source projects: [VERL](https://github.com/volcengine/verl), [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL) and [vLLM](https://github.com/vllm-project/vllm).