InfiX-ai
/

InfiGUI-G1-7B

+---
+base_model:
+- Qwen/Qwen2.5-VL-7B-Instruct
+language:
+- en
+license: apache-2.0
+tags:
+- gui
+- agent
+- gui-grounding
+- reinforcement-learning
+pipeline_tag: image-text-to-text
+library_name: transformers
+---
+# InfiGUI-G1-7B
+This repository contains the InfiGUI-G1-7B model from the paper **[InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization](https://github.com/InfiXAI/InfiGUI-R1)**.
+The model is based on `Qwen2.5-VL-7B-Instruct` and is fine-tuned using our proposed **Adaptive Exploration Policy Optimization (AEPO)** framework. AEPO is a novel reinforcement learning method designed to enhance the model's **semantic alignment** for GUI grounding tasks. It overcomes the exploration bottlenecks of standard RLVR methods by integrating a multi-answer generation strategy with a theoretically-grounded adaptive reward function, enabling more effective and efficient learning for complex GUI interactions.
+## Quick Start
+### Installation
+First, install the required dependencies:
+```bash
+pip install transformers qwen-vl-utils
+````
+### Example
+```python
+import json
+import math
+import torch
+import requests
+from io import BytesIO
+from PIL import Image, ImageDraw, ImageFont
+from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
+from qwen_vl_utils import process_vision_info, smart_resize
+MAX_IMAGE_PIXELS = 5600 * 28 * 28
+def resize_image(width: int, height: int, max_pixels: int) -> tuple[int, int]:
+    """
+    Resize image to fit within max_pixels constraint while maintaining aspect ratio.
+    Applies smart_resize for final dimension optimization.
+    """
+    current_pixels = width * height
+    if current_pixels <= max_pixels:
+        target_width, target_height = width, height
+    else:
+        scale_factor = math.sqrt(max_pixels / current_pixels)
+        target_width = round(width * scale_factor)
+        target_height = round(height * scale_factor)
+    # Apply smart_resize for final dimensions
+    final_height, final_width = smart_resize(target_height, target_width)
+    return final_width, final_height
+def load_image(img_path: str) -> Image.Image:
+    """Load image from URL or local path."""
+    if img_path.startswith("https://"):
+        response = requests.get(img_path)
+        return Image.open(BytesIO(response.content))
+    else:
+        return Image.open(img_path)
+def visualize_points(original_image: Image.Image, points: list,
+                    new_width: int, new_height: int,
+                    original_width: int, original_height: int) -> None:
+    """Draw prediction points on original image and save as output.png."""
+    output_img = original_image.copy()
+    draw = ImageDraw.Draw(output_img)
+    font = ImageFont.load_default(size=100)
+    for i, point_data in enumerate(points):
+        coords = point_data['point_2d']
+        # Map coordinates from resized image back to original image
+        original_x = int(coords[0] / new_width * original_width)
+        original_y = int(coords[1] / new_height * original_height)
+        label = str(i + 1)
+        # Draw circle
+        circle_radius = 20
+        draw.ellipse([original_x - circle_radius, original_y - circle_radius,
+                     original_x + circle_radius, original_y + circle_radius],
+                    fill=(255, 0, 0))
+        # Draw label
+        draw.text((original_x + 20, original_y - 20), label, fill=(255, 0, 0), font=font)
+        print(f"Point {i+1}: Predicted coordinates {coords} -> Mapped coordinates [{original_x}, {original_y}]")
+    output_img.save("output.png")
+    print(f"Visualization with {len(points)} points saved to output.png")
+def main():
+    # Load model and processor
+    model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+        "InfiX-ai/InfiGUI-G1-7B",
+        torch_dtype=torch.bfloat16,
+        attn_implementation="flash_attention_2",
+        device_map="auto"
+    )
+    processor = AutoProcessor.from_pretrained("InfiX-ai/InfiGUI-G1-7B", padding_side="left")
+    # Load and process image
+    img_path = "https://raw.githubusercontent.com/InfiXAI/InfiGUI-G1/main/assets/test_image.png"
+    image = load_image(img_path)
+    # Store original image and resize for model input
+    original_image = image.copy()
+    original_width, original_height = image.size
+    new_width, new_height = resize_image(original_width, original_height, MAX_IMAGE_PIXELS)
+    resized_image = image.resize((new_width, new_height))
+    # Prepare model inputs
+    instruction = "shuffle play the current playlist"
+    system_prompt = 'You FIRST think about the reasoning process as an internal monologue and then provide the final answer.\nThe reasoning process MUST BE enclosed within <think> </think> tags.'
+    prompt = f'''The screen's resolution is {new_width}x{new_height}.
+Locate the UI element(s) for "{instruction}", output the coordinates using JSON format: [{{"point_2d": [x, y]}}, ...]'''
+    messages = [
+        {"role": "system", "content": system_prompt},
+        {
+            "role": "user",
+            "content": [
+                {"type": "image", "image": resized_image},
+                {"type": "text", "text": prompt}
+            ]
+        }
+    ]
+    # Generate predictions
+    text = processor.apply_chat_template([messages], tokenize=False, add_generation_prompt=True)
+    image_inputs, video_inputs = process_vision_info([messages])
+    inputs = processor(text=text, images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to("cuda")
+    generated_ids = model.generate(**inputs, max_new_tokens=512)
+    output_text = processor.batch_decode(
+        [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)],
+        skip_special_tokens=True,
+        clean_up_tokenization_spaces=False
+    )
+    # Parse and visualize results
+    output_text = output_text[0].split("</think>")[-1].replace("```json", "").replace("```", "").strip()
+    output = json.loads(output_text)
+    if output:
+        visualize_points(original_image, output, new_width, new_height, original_width, original_height)
+if __name__ == "__main__":
+    main()
+```
+To reproduce the results in our paper, please refer to our repo for detailed instructions.
+For more details on the methodology and evaluation, please refer to our [paper](https://github.com/InfiXAI/InfiGUI-R1) and [repository](https://github.com/InfiXAI/InfiGUI-G1).
+## Citation Information
+If you find this work useful, we would be grateful if you consider citing the following papers:
+```bibtex
+@article{liu2025infigui,
+  title={InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners},
+  author={Liu, Yuhang and Li, Pengxiang and Xie, Congkai and Hu, Xavier and Han, Xiaotian and Zhang, Shengyu and Yang, Hongxia and Wu, Fei},
+  journal={arXiv preprint arXiv:2504.14239},
+  year={2025}
+}
+```
+```bibtex
+@article{liu2025infiguiagent,
+  title={InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection},
+  author={Liu, Yuhang and Li, Pengxiang and Wei, Zishu and Xie, Congkai and Hu, Xueyu and Xu, Xinchen and Zhang, Shengyu and Han, Xiaotian and Yang, Hongxia and Wu, Fei},
+  journal={arXiv preprint arXiv:2501.04575},
+  year={2025}
+}
+```