SiriusL commited on
Commit
9a420d1
·
verified ·
1 Parent(s): ed2589c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +189 -3
README.md CHANGED
@@ -1,3 +1,189 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - Qwen/Qwen2.5-VL-7B-Instruct
4
+ language:
5
+ - en
6
+ license: apache-2.0
7
+ tags:
8
+ - gui
9
+ - agent
10
+ - gui-grounding
11
+ - reinforcement-learning
12
+ pipeline_tag: image-text-to-text
13
+ library_name: transformers
14
+ ---
15
+
16
+ # InfiGUI-G1-7B
17
+
18
+ This repository contains the InfiGUI-G1-7B model from the paper **[InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization](https://github.com/InfiXAI/InfiGUI-R1)**.
19
+
20
+ The model is based on `Qwen2.5-VL-7B-Instruct` and is fine-tuned using our proposed **Adaptive Exploration Policy Optimization (AEPO)** framework. AEPO is a novel reinforcement learning method designed to enhance the model's **semantic alignment** for GUI grounding tasks. It overcomes the exploration bottlenecks of standard RLVR methods by integrating a multi-answer generation strategy with a theoretically-grounded adaptive reward function, enabling more effective and efficient learning for complex GUI interactions.
21
+
22
+ ## Quick Start
23
+
24
+ ### Installation
25
+ First, install the required dependencies:
26
+ ```bash
27
+ pip install transformers qwen-vl-utils
28
+ ````
29
+
30
+ ### Example
31
+
32
+ ```python
33
+ import json
34
+ import math
35
+ import torch
36
+ import requests
37
+ from io import BytesIO
38
+ from PIL import Image, ImageDraw, ImageFont
39
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
40
+ from qwen_vl_utils import process_vision_info, smart_resize
41
+
42
+ MAX_IMAGE_PIXELS = 5600 * 28 * 28
43
+
44
+
45
+ def resize_image(width: int, height: int, max_pixels: int) -> tuple[int, int]:
46
+ """
47
+ Resize image to fit within max_pixels constraint while maintaining aspect ratio.
48
+ Applies smart_resize for final dimension optimization.
49
+ """
50
+ current_pixels = width * height
51
+
52
+ if current_pixels <= max_pixels:
53
+ target_width, target_height = width, height
54
+ else:
55
+ scale_factor = math.sqrt(max_pixels / current_pixels)
56
+ target_width = round(width * scale_factor)
57
+ target_height = round(height * scale_factor)
58
+
59
+ # Apply smart_resize for final dimensions
60
+ final_height, final_width = smart_resize(target_height, target_width)
61
+
62
+ return final_width, final_height
63
+
64
+
65
+ def load_image(img_path: str) -> Image.Image:
66
+ """Load image from URL or local path."""
67
+ if img_path.startswith("https://"):
68
+ response = requests.get(img_path)
69
+ return Image.open(BytesIO(response.content))
70
+ else:
71
+ return Image.open(img_path)
72
+
73
+
74
+ def visualize_points(original_image: Image.Image, points: list,
75
+ new_width: int, new_height: int,
76
+ original_width: int, original_height: int) -> None:
77
+ """Draw prediction points on original image and save as output.png."""
78
+ output_img = original_image.copy()
79
+ draw = ImageDraw.Draw(output_img)
80
+ font = ImageFont.load_default(size=100)
81
+
82
+ for i, point_data in enumerate(points):
83
+ coords = point_data['point_2d']
84
+
85
+ # Map coordinates from resized image back to original image
86
+ original_x = int(coords[0] / new_width * original_width)
87
+ original_y = int(coords[1] / new_height * original_height)
88
+
89
+ label = str(i + 1)
90
+
91
+ # Draw circle
92
+ circle_radius = 20
93
+ draw.ellipse([original_x - circle_radius, original_y - circle_radius,
94
+ original_x + circle_radius, original_y + circle_radius],
95
+ fill=(255, 0, 0))
96
+
97
+ # Draw label
98
+ draw.text((original_x + 20, original_y - 20), label, fill=(255, 0, 0), font=font)
99
+
100
+ print(f"Point {i+1}: Predicted coordinates {coords} -> Mapped coordinates [{original_x}, {original_y}]")
101
+
102
+ output_img.save("output.png")
103
+ print(f"Visualization with {len(points)} points saved to output.png")
104
+
105
+
106
+ def main():
107
+ # Load model and processor
108
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
109
+ "InfiX-ai/InfiGUI-G1-7B",
110
+ torch_dtype=torch.bfloat16,
111
+ attn_implementation="flash_attention_2",
112
+ device_map="auto"
113
+ )
114
+ processor = AutoProcessor.from_pretrained("InfiX-ai/InfiGUI-G1-7B", padding_side="left")
115
+
116
+ # Load and process image
117
+ img_path = "https://raw.githubusercontent.com/InfiXAI/InfiGUI-G1/main/assets/test_image.png"
118
+ image = load_image(img_path)
119
+
120
+ # Store original image and resize for model input
121
+ original_image = image.copy()
122
+ original_width, original_height = image.size
123
+ new_width, new_height = resize_image(original_width, original_height, MAX_IMAGE_PIXELS)
124
+ resized_image = image.resize((new_width, new_height))
125
+
126
+ # Prepare model inputs
127
+ instruction = "shuffle play the current playlist"
128
+ system_prompt = 'You FIRST think about the reasoning process as an internal monologue and then provide the final answer.\nThe reasoning process MUST BE enclosed within <think> </think> tags.'
129
+ prompt = f'''The screen's resolution is {new_width}x{new_height}.
130
+ Locate the UI element(s) for "{instruction}", output the coordinates using JSON format: [{{"point_2d": [x, y]}}, ...]'''
131
+
132
+ messages = [
133
+ {"role": "system", "content": system_prompt},
134
+ {
135
+ "role": "user",
136
+ "content": [
137
+ {"type": "image", "image": resized_image},
138
+ {"type": "text", "text": prompt}
139
+ ]
140
+ }
141
+ ]
142
+
143
+ # Generate predictions
144
+ text = processor.apply_chat_template([messages], tokenize=False, add_generation_prompt=True)
145
+ image_inputs, video_inputs = process_vision_info([messages])
146
+ inputs = processor(text=text, images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to("cuda")
147
+ generated_ids = model.generate(**inputs, max_new_tokens=512)
148
+ output_text = processor.batch_decode(
149
+ [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)],
150
+ skip_special_tokens=True,
151
+ clean_up_tokenization_spaces=False
152
+ )
153
+
154
+ # Parse and visualize results
155
+ output_text = output_text[0].split("</think>")[-1].replace("```json", "").replace("```", "").strip()
156
+ output = json.loads(output_text)
157
+
158
+ if output:
159
+ visualize_points(original_image, output, new_width, new_height, original_width, original_height)
160
+
161
+ if __name__ == "__main__":
162
+ main()
163
+ ```
164
+
165
+ To reproduce the results in our paper, please refer to our repo for detailed instructions.
166
+
167
+ For more details on the methodology and evaluation, please refer to our [paper](https://github.com/InfiXAI/InfiGUI-R1) and [repository](https://github.com/InfiXAI/InfiGUI-G1).
168
+
169
+ ## Citation Information
170
+
171
+ If you find this work useful, we would be grateful if you consider citing the following papers:
172
+
173
+ ```bibtex
174
+ @article{liu2025infigui,
175
+ title={InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners},
176
+ author={Liu, Yuhang and Li, Pengxiang and Xie, Congkai and Hu, Xavier and Han, Xiaotian and Zhang, Shengyu and Yang, Hongxia and Wu, Fei},
177
+ journal={arXiv preprint arXiv:2504.14239},
178
+ year={2025}
179
+ }
180
+ ```
181
+
182
+ ```bibtex
183
+ @article{liu2025infiguiagent,
184
+ title={InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection},
185
+ author={Liu, Yuhang and Li, Pengxiang and Wei, Zishu and Xie, Congkai and Hu, Xueyu and Xu, Xinchen and Zhang, Shengyu and Han, Xiaotian and Yang, Hongxia and Wu, Fei},
186
+ journal={arXiv preprint arXiv:2501.04575},
187
+ year={2025}
188
+ }
189
+ ```