Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
hub
Mai-ui technical report: Real-world centric foundation gui agents
10 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 10roles
background 1polarities
background 1representative citing papers
GUI grounding in VLMs is bottlenecked by prefill-stage candidate selection that decoding cannot fix, so Re-Prefill uses attention to extract and re-inject target tokens for up to 4.3% gains on ScreenSpot-Pro.
OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
FedGUI is the first comprehensive benchmark for federated GUI agents that studies cross-platform, cross-device, cross-OS, and cross-source heterogeneity, with experiments showing performance gains from cross-platform collaboration and identifying platform and OS as the most influential factors.
Android Coach improves online agent training efficiency by enabling multiple actions per state via a critic-based coach, process reward model, and group-wise advantage estimation, delivering 7.5-8.3% success rate gains and 1.4x efficiency over PPO/GRPO baselines.
Mobile world models in text, image, and code modalities reach state-of-the-art on their benchmarks and improve downstream GUI agent performance, with code best for in-distribution accuracy and text more robust for out-of-distribution use.
Phone-use agents avoid harm more often through inability to act than through deliberate safe choices, so benchmarks must separate unsafe judgment from capability failure.
Personalized soft prompts steer VLM attention to match user-specific gaze patterns, yielding better attention alignment and click prediction in recommendation simulations.
The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a future roadmap.
The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment interventions.
citing papers explorer
-
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
-
What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs
GUI grounding in VLMs is bottlenecked by prefill-stage candidate selection that decoding cannot fix, so Re-Prefill uses attention to extract and re-inject target tokens for up to 4.3% gains on ScreenSpot-Pro.
-
OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents
OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
-
FedGUI: Benchmarking Federated GUI Agents across Heterogeneous Platforms, Devices, and Operating Systems
FedGUI is the first comprehensive benchmark for federated GUI agents that studies cross-platform, cross-device, cross-OS, and cross-source heterogeneity, with experiments showing performance gains from cross-platform collaboration and identifying platform and OS as the most influential factors.
-
Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple Actions
Android Coach improves online agent training efficiency by enabling multiple actions per state via a critic-based coach, process reward model, and group-wise advantage estimation, delivering 7.5-8.3% success rate gains and 1.4x efficiency over PPO/GRPO baselines.
-
How Mobile World Model Guides GUI Agents?
Mobile world models in text, image, and code modalities reach state-of-the-art on their benchmarks and improve downstream GUI agent performance, with code best for in-distribution accuracy and text more robust for out-of-distribution use.
-
Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents
Phone-use agents avoid harm more often through inability to act than through deliberate safe choices, so benchmarks must separate unsafe judgment from capability failure.
-
Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation
Personalized soft prompts steer VLM attention to match user-specific gaze patterns, yielding better attention alignment and click prediction in recommendation simulations.
-
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a future roadmap.
-
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment interventions.