hub

Mai-ui technical report: Real-world centric foundation gui agents

Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, Steven Hoi · 2025 · arXiv 2512.22047

19 Pith papers cite this work. Polarity classification is still indexing.

19 Pith papers citing it

read on arXiv browse 19 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

baseline 2 background 1

citation-polarity summary

baseline 2 background 1

representative citing papers

DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions

cs.AI · 2026-06-04 · unverdicted · novelty 7.0

DragOn provides a new drag-grounding benchmark and training dataset for GUI agents, with evaluations suggesting potential improvements on computer-use tasks.

Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

cs.LG · 2026-05-14 · unverdicted · novelty 7.0 · 2 refs

BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.

What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs

cs.CV · 2026-05-10 · conditional · novelty 7.0

GUI grounding in VLMs is bottlenecked by prefill-stage candidate selection that decoding cannot fix, so Re-Prefill uses attention to extract and re-inject target tokens for up to 4.3% gains on ScreenSpot-Pro.

OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents

cs.CL · 2026-04-27 · unverdicted · novelty 7.0

OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.

FedGUI: Benchmarking Federated GUI Agents across Heterogeneous Platforms, Devices, and Operating Systems

cs.MA · 2026-04-16 · unverdicted · novelty 7.0

FedGUI is the first comprehensive benchmark for federated GUI agents that studies cross-platform, cross-device, cross-OS, and cross-source heterogeneity, with experiments showing performance gains from cross-platform collaboration and identifying platform and OS as the most influential factors.

Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple Actions

cs.LG · 2026-04-08 · unverdicted · novelty 7.0 · 2 refs

Android Coach improves online agent training efficiency by enabling multiple actions per state via a critic-based coach, process reward model, and group-wise advantage estimation, delivering 7.5-8.3% success rate gains and 1.4x efficiency over PPO/GRPO baselines.

Coding with Eyes: Visual Feedback Unlocks Reliable GUI Code Generating and Debugging

cs.SE · 2026-03-14 · unverdicted · novelty 7.0

VF-Coder raises GUI code success rate from 21.68% to 28.29% and visual score from 0.4284 to 0.5584 on a new 984-task benchmark by adding direct visual perception and interaction.

AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

AQuaUI uses adaptive quadtrees to cut visual tokens in GUI-agent LMMs by up to 29.52% at inference time while retaining 99.06% of full-token accuracy on grounding and navigation benchmarks.

MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

cs.CV · 2026-05-18 · conditional · novelty 6.0

MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.

DocOS: Towards Proactive Document-Guided Actions in GUI Agents

cs.AI · 2026-05-18 · unverdicted · novelty 6.0

Introduces DocOS benchmark to test GUI agents on proactively locating, comprehending, and executing instructions from online documentation in interactive web settings.

Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

Phone-use agents avoid harm more often through inability to act than through deliberate safe choices, so benchmarks must separate unsafe judgment from capability failure.

Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation

cs.MM · 2026-04-10 · unverdicted · novelty 6.0

Personalized soft prompts steer VLM attention to match user-specific gaze patterns, yielding better attention alignment and click prediction in recommendation simulations.

Agent Skills Should Go Beyond Text: The Case for Visual Skills

cs.CV · 2026-05-31 · unverdicted · novelty 5.0

The paper proposes that reusable agent skills should incorporate visual elements alongside text, introduces three forms of visual skills and an automatic conversion system, and reports better performance on GUI and visual-centric tasks.

SE-GA: Memory-Augmented Self-Evolution for GUI Agents

cs.LG · 2026-05-16 · unverdicted · novelty 5.0

SE-GA combines Test-Time Memory Extension for dynamic context retrieval with Memory-Augmented Self-Evolution training to reach 89.0% on ScreenSpot and 75.8% on AndroidControl-High.

GUI Agents with Reinforcement Learning: Toward Digital Inhabitants

cs.AI · 2026-04-30 · unverdicted · novelty 5.0

The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a future roadmap.

How Mobile World Model Guides GUI Agents?

cs.AI · 2026-05-11 · unverdicted · novelty 4.0 · 2 refs

World models trained on delta text, full text, diffusion images, and renderable code achieve SoTA on two benchmarks and improve downstream GUI agent performance on three mobile datasets with modality-specific strengths.

Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability

cs.CL · 2026-05-08 · unverdicted · novelty 4.0

The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment interventions.

ClawMobile: Rethinking Smartphone-Native Agentic Systems

cs.MA · 2026-02-26 · unverdicted · novelty 4.0

ClawMobile proposes a hierarchical system separating probabilistic LLM planning from structured deterministic execution to improve stability and reproducibility of agentic systems on real smartphones.

citing papers explorer

Showing 19 of 19 citing papers.

DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions cs.AI · 2026-06-04 · unverdicted · none · ref 1
DragOn provides a new drag-grounding benchmark and training dataset for GUI agents, with evaluations suggesting potential improvements on computer-use tasks.
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment cs.LG · 2026-05-14 · unverdicted · none · ref 131 · 2 links
BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark cs.CV · 2026-05-12 · unverdicted · none · ref 22
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs cs.CV · 2026-05-10 · conditional · none · ref 6
GUI grounding in VLMs is bottlenecked by prefill-stage candidate selection that decoding cannot fix, so Re-Prefill uses attention to extract and re-inject target tokens for up to 4.3% gains on ScreenSpot-Pro.
OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents cs.CL · 2026-04-27 · unverdicted · none · ref 96
OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
FedGUI: Benchmarking Federated GUI Agents across Heterogeneous Platforms, Devices, and Operating Systems cs.MA · 2026-04-16 · unverdicted · none · ref 5
FedGUI is the first comprehensive benchmark for federated GUI agents that studies cross-platform, cross-device, cross-OS, and cross-source heterogeneity, with experiments showing performance gains from cross-platform collaboration and identifying platform and OS as the most influential factors.
Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple Actions cs.LG · 2026-04-08 · unverdicted · none · ref 5 · 2 links
Android Coach improves online agent training efficiency by enabling multiple actions per state via a critic-based coach, process reward model, and group-wise advantage estimation, delivering 7.5-8.3% success rate gains and 1.4x efficiency over PPO/GRPO baselines.
Coding with Eyes: Visual Feedback Unlocks Reliable GUI Code Generating and Debugging cs.SE · 2026-03-14 · unverdicted · none · ref 41
VF-Coder raises GUI code success rate from 21.68% to 28.29% and visual score from 0.4284 to 0.5584 on a new 984-task benchmark by adding direct visual perception and interaction.
AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees cs.AI · 2026-05-19 · unverdicted · none · ref 37
AQuaUI uses adaptive quadtrees to cut visual tokens in GUI-agent LMMs by up to 29.52% at inference time while retaining 99.06% of full-token accuracy on grounding and navigation benchmarks.
MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents cs.CV · 2026-05-18 · conditional · none · ref 85
MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.
DocOS: Towards Proactive Document-Guided Actions in GUI Agents cs.AI · 2026-05-18 · unverdicted · none · ref 94
Introduces DocOS benchmark to test GUI agents on proactively locating, comprehending, and executing instructions from online documentation in interactive web settings.
Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents cs.CL · 2026-05-08 · unverdicted · none · ref 16
Phone-use agents avoid harm more often through inability to act than through deliberate safe choices, so benchmarks must separate unsafe judgment from capability failure.
Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation cs.MM · 2026-04-10 · unverdicted · none · ref 50
Personalized soft prompts steer VLM attention to match user-specific gaze patterns, yielding better attention alignment and click prediction in recommendation simulations.
Agent Skills Should Go Beyond Text: The Case for Visual Skills cs.CV · 2026-05-31 · unverdicted · none · ref 47
The paper proposes that reusable agent skills should incorporate visual elements alongside text, introduces three forms of visual skills and an automatic conversion system, and reports better performance on GUI and visual-centric tasks.
SE-GA: Memory-Augmented Self-Evolution for GUI Agents cs.LG · 2026-05-16 · unverdicted · none · ref 54
SE-GA combines Test-Time Memory Extension for dynamic context retrieval with Memory-Augmented Self-Evolution training to reach 89.0% on ScreenSpot and 75.8% on AndroidControl-High.
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants cs.AI · 2026-04-30 · unverdicted · none · ref 86
The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a future roadmap.
How Mobile World Model Guides GUI Agents? cs.AI · 2026-05-11 · unverdicted · none · ref 24 · 2 links
World models trained on delta text, full text, diffusion images, and renderable code achieve SoTA on two benchmarks and improve downstream GUI agent performance on three mobile datasets with modality-specific strengths.
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability cs.CL · 2026-05-08 · unverdicted · none · ref 4
The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment interventions.
ClawMobile: Rethinking Smartphone-Native Agentic Systems cs.MA · 2026-02-26 · unverdicted · none · ref 26
ClawMobile proposes a hierarchical system separating probabilistic LLM planning from structured deterministic execution to improve stability and reproducibility of agentic systems on real smartphones.

Mai-ui technical report: Real-world centric foundation gui agents

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer