DragOn provides a new drag-grounding benchmark and training dataset for GUI agents, with evaluations suggesting potential improvements on computer-use tasks.
hub Canonical reference
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
Canonical reference. 78% of citing Pith papers cite this work as background.
abstract
Multimodal large language models (MLLMs) are transforming the capabilities of graphical user interface (GUI) agents, facilitating their transition from controlled simulations to complex, real-world applications across various platforms. However, the effectiveness of these agents hinges on the robustness of their grounding capability. Current GUI agents predominantly utilize text-based representations such as HTML or accessibility trees, which, despite their utility, often introduce noise, incompleteness, and increased computational overhead. In this paper, we advocate a human-like embodiment for GUI agents that perceive the environment entirely visually and directly perform pixel-level operations on the GUI. The key is visual grounding models that can accurately map diverse referring expressions of GUI elements to their coordinates on the GUI across different platforms. We show that a simple recipe, which includes web-based synthetic data and slight adaptation of the LLaVA architecture, is surprisingly effective for training such visual grounding models. We collect the largest dataset for GUI visual grounding so far, containing 10M GUI elements and their referring expressions over 1.3M screenshots, and use it to train UGround, a strong universal visual grounding model for GUI agents. Empirical results on six benchmarks spanning three categories (grounding, offline agent, and online agent) show that 1) UGround substantially outperforms existing visual grounding models for GUI agents, by up to 20% absolute, and 2) agents with UGround outperform state-of-the-art agents, despite the fact that existing agents use additional text-based input while ours only uses visual perception. These results provide strong support for the feasibility and promises of GUI agents that navigate the digital world as humans do.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
MMSkills packages multimodal procedural knowledge into state-conditioned skills with text, state cards, and multi-view keyframes, generated from public trajectories via an agentic process and used at inference via branch-loaded inspection to improve visual agents on GUI and game benchmarks.
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.
DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new DynamicGUIBench spanning ten applications.
MAS-Bench introduces 139 tasks, 88 predefined shortcuts, and 9 metrics to evaluate hybrid GUI-shortcut mobile agents, reporting up to 68.3% success and 39% efficiency gains over GUI-only baselines.
GUI-R1 uses reinforcement fine-tuning with GRPO on a small curated dataset to create a generalist vision-language action model that outperforms prior GUI agent methods across mobile, desktop, and web benchmarks using only 0.02% of the data.
MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.
DiagEval applies trajectory-conditioned diagnostic probes to recover 45.6-62.1% of misattributed failures in GUI-agent software evaluation, raising accuracy from 69.9% to 78.3% on WebDevJudge-Unit and 65.0% to 81.6% on RealDevBench.
AutoFocus converts token perplexity into an anisotropic Gaussian uncertainty field to drive region proposals and shape-aware zooming for improved GUI grounding in VLMs.
UI-Zoomer uses uncertainty quantification to trigger and size adaptive zoom-ins only on uncertain GUI grounding predictions, yielding up to 13.4% gains on benchmarks with no training.
EchoTrail-GUI builds an automated memory of successful GUI task trajectories via self-exploration and injects relevant past examples to raise success rates on Android benchmarks.
GTA1 combines test-time scaling for action plan selection with RL-based grounding to achieve SOTA results on GUI agent benchmarks.
Introduces an app-content instrumentation framework and benchmark showing that examined GUI agents suffer 42.0% and 36.1% average misleading rates from third-party content in dynamic and static tests respectively.
InfiGUI-R1 uses Reasoning Injection via spatial distillation followed by Deliberation Enhancement via RL to evolve GUI agents from reactive actors to deliberative reasoners, reporting strong performance on grounding and trajectory tasks.
UI-R1 shows rule-based RL with GRPO on 136 GUI tasks improves a 3B MLLM's action prediction accuracy by 6-22% over its base model and matches larger SFT-trained models.
Aguvis presents a pure vision-based framework for autonomous GUI agents using structured reasoning via inner monologue, a new multimodal dataset, and two-stage training to reach SOTA on offline and online benchmarks.
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
GUI-C² pairs a difficulty-scoring data pipeline with an area-gated coarse-to-fine RL mechanism to improve GUI grounding accuracy and training stability.
SE-GA combines Test-Time Memory Extension for dynamic context retrieval with Memory-Augmented Self-Evolution training to reach 89.0% on ScreenSpot and 75.8% on AndroidControl-High.
DRS-GUI introduces a dynamic region search method with Focus/Shift/Scatter actions and MCTS-based planning that improves GUI grounding accuracy by 14% on ScreenSpot-Pro for both general and GUI-specific MLLMs without any training.
PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a future roadmap.
A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
citing papers explorer
-
Large Language Model-Brained GUI Agents: A Survey
A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.