pith. sign in

super hub Mixed citations

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Mixed citation behavior. Most common role is background (50%).

127 Pith papers citing it
Background 50% of classified citations
abstract

This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.

hub tools

citation-role summary

background 16 baseline 12 dataset 1 method 1

citation-polarity summary

claims ledger

  • abstract This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld bench

authors

co-cited works

clear filters

representative citing papers

ProCUA-SFT Technical Report

cs.LG · 2026-06-15 · conditional · novelty 7.0

ProCUA-SFT is a 3.1M-sample SFT dataset from 93K verified synthetic trajectories that lifts UI-TARS 7B OSWorld score from 26.3% to 45%.

A History-Aware Visually Grounded Critic for Computer Use Agents

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

HiViG is a test-time critic that combines macro-action history summarization with visual grounding of execution coordinates to reduce short-sighted and visually erroneous actions in long-horizon GUI agents.

HLL: Can Agents Cross Humanity's Last Line of Verification?

cs.AI · 2026-06-01 · unverdicted · novelty 7.0

HLL is a new benchmark that evaluates eight frontier multimodal agents on closed-loop interactive CAPTCHA solving, showing sharp performance drops under realism stressors and trace validation.

ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis

cs.AI · 2026-05-24 · unverdicted · novelty 7.0

ScaleWoB generates 100+ synthetic interactive GUI environments and 1000+ verifiable tasks as web pages, releasing a 120-task mobile benchmark where state-of-the-art agents achieve 27.92% success (17.82% on long-horizon tasks) versus 92.08% for humans, with synthetic results generalizing to real apps

PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

cs.AI · 2026-05-24 · unverdicted · novelty 7.0

PANDO introduces an online skill-distillation method with a structured library, reflection, demotion, routing, compression, and cache-aware prompting that reaches 58.3% success on 910 VisualWebArena tasks using 58-61% fewer tokens than prior methods.

Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

cs.LG · 2026-05-14 · unverdicted · novelty 7.0 · 2 refs

BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.

MMSkills: Towards Multimodal Skills for General Visual Agents

cs.AI · 2026-05-13 · unverdicted · novelty 7.0 · 3 refs

MMSkills packages multimodal procedural knowledge into state-conditioned skills with text, state cards, and multi-view keyframes, generated from public trajectories via an agentic process and used at inference via branch-loaded inspection to improve visual agents on GUI and game benchmarks.

Learning Agentic Policy from Action Guidance

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

citing papers explorer

Showing 1 of 1 citing paper after filters.