UI-Venus

This repository contains the UI-Venus model from the report UI-Venus: Building High-performance UI Agents with RFT. UI-Venus is a native UI agent based on the Qwen2.5-VL multimodal large language model, designed to perform precise GUI element grounding and effective navigation using only screenshots as input. It achieves state-of-the-art performance through Reinforcement Fine-Tuning (RFT) with high-quality training data. More inference details and usage guides are available in the GitHub repository. We will continue to update results on standard benchmarks including Screenspot-v2/Pro and AndroidWorld.

License Report GitHub Hugging Face


📈 UI-Venus Benchmark Performance

UI-Venus Performance Across Datasets

Figure: Performance of UI-Venus across multiple benchmark datasets. UI-Venus achieves State-of-the-Art (SOTA) results on key UI understanding and interaction benchmarks, including ScreenSpot-Pro, ScreenSpot-v2, OS-World-G, UI-Vision, and Android World. The results demonstrate its superior capability in visual grounding, UI navigation, cross-platform generalization, and complex task reasoning.

Model Description

UI-Venus is a multimodal UI agent built on Qwen2.5-VL that performs accurate UI grounding and navigation using only screenshots as input. The 7B and 72B variants achieve 94.1%/50.8% and 95.3%/61.9% on Screenspot-V2 and Screenspot-Pro benchmarks, surpassing prior SOTA models such as GTA1 and UI-TARS-1.5. On the AndroidWorld navigation benchmark, they achieve 49.1% and 65.9% success rates, respectively, demonstrating strong planning and generalization capabilities

Key innovations include:

  • SOTA Open-Source UI Agent: Publicly released to advance research in autonomous UI interaction and agent-based systems.
  • Reinforcement Fine-Tuning (RFT): Utilizes carefully designed reward functions for both grounding and navigation tasks
  • Efficient Data Cleaning: Trained on several hundred thousand high-quality samples to ensure robustness.
  • Self-Evolving Trajectory History Alignment & Sparse Action Enhancement: Improves reasoning coherence and action distribution for better long-horizon planning.

Installation

First, install the required dependencies:

pip install transformers==4.49.0 qwen-vl-utils

Quick Start

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from typing import Dict, Tuple, Any
import torch
import os
import re
from qwen_vl_utils import process_vision_info

# -----------------------------
# Model & Tokenizer
# -----------------------------
MODEL_NAME = "inclusionAI/UI-Venus-Navi-7B"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2"
).eval()

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(MODEL_NAME)

GENERATION_CONFIG = {
    "max_new_tokens": 2048,
    "do_sample": False,
    "temperature": 0.0,
}

# -----------------------------
# Prompt Template
# -----------------------------
PROMPT_TEMPLATE = """**You are a GUI Agent.**
Your task is to analyze a given user task, review current screenshot and previous actions, and determine the next action to complete the task.

### User Task
{user_task}

### Previous Actions
{previous_actions}

### Available Actions
Click(box=(x1, y1))
Drag(start=(x1, y1), end=(x2, y2))
Scroll(start=(x1, y1), end=(x2, y2), direction='down/up/right/left')
Type(content='')
Launch(app='')
Wait()
Finished(content='')
CallUser(content='')
LongPress(box=(x1, y1))
PressBack()
PressHome()
PressEnter()
PressRecent()

### Instruction
- Make sure you understand the task goal to avoid wrong actions.
- Examine the screenshot carefully. History may be unreliable.
- For user questions, reply with `CallUser`, then `Finished` if done.
- Explore screen content using scroll in different directions.
- Copy text: select → click `copy`.
- Paste text: long press text box → click `paste`.
- First reason inside <think>, then provide <action>, then summarize in <conclusion>.
"""

# -----------------------------
# Parse action
# -----------------------------
def parse_action(action_str: str) -> Tuple[str, Dict[str, Any]]:
    """Parse action string into action type + params."""
    pattern = r"^(\w+)\((.*)\)$"
    match = re.match(pattern, action_str.strip(), re.DOTALL)
    if not match:
        print(f"Invalid action type: {action_str}")
        return "", {}

    action_type, params_str = match.group(1), match.group(2).strip()
    params = {}

    if params_str:
        try:
            # split by comma not inside parentheses
            param_pairs = re.split(r",(?![^(]*\))", params_str)
            for pair in param_pairs:
                if "=" in pair:
                    key, value = pair.split("=", 1)
                    params[key.strip()] = value.strip().strip("'").strip()
                else:
                    params[pair.strip()] = None
        except Exception as e:
            print(f"Parse param failed: {e}")
            return action_type, {}
    return action_type, params


def extract_tag(content: str, tag: str) -> str:
    """Extract latest <tag>...</tag> content from model output."""
    pattern = fr"<{tag}>(.*?)</{tag}>"
    matches = list(re.finditer(pattern, content, re.DOTALL))
    if not matches:
        print(f"{tag} Not Found")
        return ""
    return matches[-1].group(1).strip()

# -----------------------------
# Inference
# -----------------------------
def inference(image_path: str, goal: str) -> Dict[str, str]:
    if not (os.path.exists(image_path) and os.path.isfile(image_path)):
        raise FileNotFoundError(f"Invalid input image path: {image_path}")

    full_prompt = PROMPT_TEMPLATE.format(user_task=goal, previous_actions="")

    messages = [{
        "role": "user",
        "content": [
            {"type": "text", "text": full_prompt},
            {"type": "image", "image": image_path, "min_pixels": 830000, "max_pixels": 937664},
        ],
    }]

    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    image_inputs, video_inputs = process_vision_info(messages)

    model_inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt"
    ).to(model.device)

    generated_ids = model.generate(**model_inputs, **GENERATION_CONFIG)
    generated_ids_trimmed = [out[len(inp):] for inp, out in zip(model_inputs.input_ids, generated_ids)]
    output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0]

    return {
        "raw_response": output_text,
        "think": extract_tag(output_text, "think"),
        "action": extract_tag(output_text, "action"),
        "conclusion": extract_tag(output_text, "conclusion"),
    }

Usage

⚠️ For action types that include coordinates (e.g., click, scroll),
the current code does not handle coordinate conversion.
You need to map the coordinates back to the original image space using max_pixels and min_pixels before applying them.


Results on AndroidWorld

This is the compressed package of validation trajectories for AndroidWorld, including execution logs and navigation paths.
📥 Download: UI-Venus-androidworld.zip

Models With Planner A11y Tree Screenshot Success Rate (pass@1)
Closed-source Models
GPT-4o 30.6
ScaleTrack 44.0
SeedVL-1.5 62.1
UI-TARS-1.5 64.2
Open-source Models
GUI-Critic-R1-7B 27.6
Qwen2.5-VL-72B* 35.0
UGround 44.0
Aria-UI 44.8
UI-TARS-72B 46.6
GLM-4.5v 57.0
Ours
UI-Venus-Navi-7B 49.1
UI-Venus-Navi-72B 65.9

Table: Performance comparison on AndroidWorld for end-to-end models. Our UI-Venus-Navi-72B achieves state-of-the-art performance, outperforming all baseline methods across different settings.

Results on AndroidControl and GUI-Odyssey

Models AndroidControl-Low
Type Acc.
AndroidControl-Low
Step SR
AndroidControl-High
Type Acc.
AndroidControl-High
Step SR
GUI-Odyssey
Type Acc.
GUI-Odyssey
Step SR
Closed-source Models
GPT-4o 74.3 19.4 66.3 20.8 34.3 3.3
Open Source Models
Qwen2.5-VL-7B 94.1 85.0 75.1 62.9 59.5 46.3
SeeClick 93.0 75.0 82.9 59.1 71.0 53.9
OS-Atlas-7B 93.6 85.2 85.2 71.2 84.5 62.0
Aguvis-7B - 80.5 - 61.5 - -
Aguvis-72B - 84.4 - 66.4 - -
OS-Genesis-7B 90.7 74.2 66.2 44.5 - -
UI-TARS-7B 98.0 90.8 83.7 72.5 94.6 87.0
UI-TARS-72B 98.1 91.3 85.2 74.7 95.4 88.6
GUI-R1-7B 85.2 66.5 71.6 51.7 65.5 38.8
NaviMaster-7B 85.6 69.9 72.9 54.0 - -
UI-AGILE-7B 87.7 77.6 80.1 60.6 - -
AgentCPM-GUI 94.4 90.2 77.7 69.2 90.0 75.0
Ours
UI-Venus-Navi-7B 97.1 92.4 86.5 76.1 87.3 71.5
UI-Venus-Navi-72B 96.7 92.9 85.9 77.2 87.2 72.4

Table: Performance comparison on offline UI navigation datasets including AndroidControl and GUI-Odyssey. Note that models with * are reproduced.

Citation

Please consider citing if you find our work useful:

@misc{gu2025uivenustechnicalreportbuilding,
      title={UI-Venus Technical Report: Building High-performance UI Agents with RFT}, 
      author={Zhangxuan Gu and Zhengwen Zeng and Zhenyu Xu and Xingran Zhou and Shuheng Shen and Yunfei Liu and Beitong Zhou and Changhua Meng and Tianyu Xia and Weizhi Chen and Yue Wen and Jingya Dou and Fei Tang and Jinzhen Lin and Yulin Liu and Zhenlin Guo and Yichen Gong and Heng Jia and Changlong Gao and Yuan Guo and Yong Deng and Zhenyu Guo and Liang Chen and Weiqiang Wang},
      year={2025},
      eprint={2508.10833},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.10833}, 
}
Downloads last month
-
Safetensors
Model size
8.29B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for inclusionAI/UI-Venus-Navi-7B

Quantizations
2 models

Collection including inclusionAI/UI-Venus-Navi-7B