MobileForge adapts Qwen3-VL-8B to 67.2% Pass@3 on AndroidWorld using only automatically generated annotation-free data via MobileGym and HiFPO, with ForgeOwl-8B reaching 77.6%.
hub
Mai-ui technical report: Real-world centric foundation gui agents
35 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 35representative citing papers
DragOn provides a new drag-grounding benchmark and training dataset for GUI agents, with evaluations suggesting potential improvements on computer-use tasks.
AndroidDaily supplies 350 verifiable tasks on 94 closed-source Android apps evaluated by GRADE (87.37% human agreement), with the strongest model achieving 62% success.
ScaleWoB generates 100+ synthetic interactive GUI environments and 1000+ verifiable tasks as web pages, releasing a 120-task mobile benchmark where state-of-the-art agents achieve 27.92% success (17.82% on long-horizon tasks) versus 92.08% for humans, with synthetic results generalizing to real apps
BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
GUI grounding in VLMs is bottlenecked by prefill-stage candidate selection that decoding cannot fix, so Re-Prefill uses attention to extract and re-inject target tokens for up to 4.3% gains on ScreenSpot-Pro.
OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
FedGUI is the first comprehensive benchmark for federated GUI agents that studies cross-platform, cross-device, cross-OS, and cross-source heterogeneity, with experiments showing performance gains from cross-platform collaboration and identifying platform and OS as the most influential factors.
Android Coach improves online agent training efficiency by enabling multiple actions per state via a critic-based coach, process reward model, and group-wise advantage estimation, delivering 7.5-8.3% success rate gains and 1.4x efficiency over PPO/GRPO baselines.
VF-Coder raises GUI code success rate from 21.68% to 28.29% and visual score from 0.4284 to 0.5584 on a new 984-task benchmark by adding direct visual perception and interaction.
Introduces Active Task Driving Memory (ATMem) and STR-GRPO to move GUI agents from passive record storage to actively maintained task states, tested on a new mobile benchmark with progress and scope-aware metrics.
InnerZoom bridges cross-layer evidence in one forward pass to achieve SOTA GUI grounding accuracy on six benchmarks while cutting latency up to 31.8% versus two-pass baselines.
GUICrafter uses curriculum learning on unannotated GUI screenshots for visual grounding followed by RL calibration on limited labels to match or exceed prior GUI agents with far less annotation.
PhoneBuddy combines real-app and mock-app RL after shared SFT, raising real-phone task success from 36.67% to 45.33% and AndroidWorld from 60.3% to 83.2%.
MemGUI-Agent uses Context-as-Action (ConAct) for proactive context management in long-horizon GUI tasks, trained on the MemGUI-3K dataset to achieve top 8B-model results on MemGUI-Bench and MobileWorld.
CLI-based coding agents outperform GUI baselines on AndroidWorld and MobileWorld, with oracles reaching 88.8% and 86.3% solvability and a new CLI-Advantage suite showing CLI superiority in bulk operations, filtering, aggregation, cross-app workflows, and hidden state.
UI-KOBE constructs reusable app knowledge graphs from autonomous exploration to provide runtime guidance that improves lightweight mobile GUI agents.
MobileExplorer reduces on-device GUI agent reasoning steps and latency by 23% via parallel UI exploration, structured memory, and a two-level rollback while maintaining or improving task success rates.
AQuaUI uses adaptive quadtrees to cut visual tokens in GUI-agent LMMs by up to 29.52% at inference time while retaining 99.06% of full-token accuracy on grounding and navigation benchmarks.
MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.
Introduces DocOS benchmark to test GUI agents on proactively locating, comprehending, and executing instructions from online documentation in interactive web settings.
Phone-use agents avoid harm more often through inability to act than through deliberate safe choices, so benchmarks must separate unsafe judgment from capability failure.
Personalized soft prompts steer VLM attention to match user-specific gaze patterns, yielding better attention alignment and click prediction in recommendation simulations.
citing papers explorer
-
MobileForge: Annotation-Free Adaptation for Mobile GUI Agents with Hierarchical Feedback-Guided Policy Optimization
MobileForge adapts Qwen3-VL-8B to 67.2% Pass@3 on AndroidWorld using only automatically generated annotation-free data via MobileGym and HiFPO, with ForgeOwl-8B reaching 77.6%.
-
DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions
DragOn provides a new drag-grounding benchmark and training dataset for GUI agents, with evaluations suggesting potential improvements on computer-use tasks.
-
AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications
AndroidDaily supplies 350 verifiable tasks on 94 closed-source Android apps evaluated by GRADE (87.37% human agreement), with the strongest model achieving 62% success.
-
ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis
ScaleWoB generates 100+ synthetic interactive GUI environments and 1000+ verifiable tasks as web pages, releasing a 120-task mobile benchmark where state-of-the-art agents achieve 27.92% success (17.82% on long-horizon tasks) versus 92.08% for humans, with synthetic results generalizing to real apps
-
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.
-
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
-
What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs
GUI grounding in VLMs is bottlenecked by prefill-stage candidate selection that decoding cannot fix, so Re-Prefill uses attention to extract and re-inject target tokens for up to 4.3% gains on ScreenSpot-Pro.
-
OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents
OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
-
FedGUI: Benchmarking Federated GUI Agents across Heterogeneous Platforms, Devices, and Operating Systems
FedGUI is the first comprehensive benchmark for federated GUI agents that studies cross-platform, cross-device, cross-OS, and cross-source heterogeneity, with experiments showing performance gains from cross-platform collaboration and identifying platform and OS as the most influential factors.
-
Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple Actions
Android Coach improves online agent training efficiency by enabling multiple actions per state via a critic-based coach, process reward model, and group-wise advantage estimation, delivering 7.5-8.3% success rate gains and 1.4x efficiency over PPO/GRPO baselines.
-
Coding with Eyes: Visual Feedback Unlocks Reliable GUI Code Generating and Debugging
VF-Coder raises GUI code success rate from 21.68% to 28.29% and visual score from 0.4284 to 0.5584 on a new 984-task benchmark by adding direct visual perception and interaction.
-
What Memory Do GUI Agents Really Need? From Passive Records to Active Task-Driving States
Introduces Active Task Driving Memory (ATMem) and STR-GRPO to move GUI agents from passive record storage to actively maintained task states, tested on a new mobile benchmark with progress and scope-aware metrics.
-
One Forward Beats Two: InnerZoom for Accurate and Efficient GUI Grounding
InnerZoom bridges cross-layer evidence in one forward pass to achieve SOTA GUI grounding accuracy on six benchmarks while cutting latency up to 31.8% versus two-pass baselines.
-
GUICrafter: Weakly-Supervised GUI Agent Leveraging Massive Unannotated Screenshots
GUICrafter uses curriculum learning on unannotated GUI screenshots for visual grounding followed by RL calibration on limited labels to match or exceed prior GUI agents with far less annotation.
-
PhoneBuddy: Training Open Models for Agentic Phone Use
PhoneBuddy combines real-app and mock-app RL after shared SFT, raising real-phone task success from 36.67% to 45.33% and AndroidWorld from 60.3% to 83.2%.
-
MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management
MemGUI-Agent uses Context-as-Action (ConAct) for proactive context management in long-horizon GUI tasks, trained on the MemGUI-3K dataset to achieve top 8B-model results on MemGUI-Bench and MobileWorld.
-
Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?
CLI-based coding agents outperform GUI baselines on AndroidWorld and MobileWorld, with oracles reaching 88.8% and 86.3% solvability and a new CLI-Advantage suite showing CLI superiority in bulk operations, filtering, aggregation, cross-app workflows, and hidden state.
-
UI-KOBE: Knowledge-Oriented Behavior Exploration for Lightweight Graph-Guided GUI Agents
UI-KOBE constructs reusable app knowledge graphs from autonomous exploration to provide runtime guidance that improves lightweight mobile GUI agents.
-
MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration
MobileExplorer reduces on-device GUI agent reasoning steps and latency by 23% via parallel UI exploration, structured memory, and a two-level rollback while maintaining or improving task success rates.
-
AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees
AQuaUI uses adaptive quadtrees to cut visual tokens in GUI-agent LMMs by up to 29.52% at inference time while retaining 99.06% of full-token accuracy on grounding and navigation benchmarks.
-
MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents
MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.
-
DocOS: Towards Proactive Document-Guided Actions in GUI Agents
Introduces DocOS benchmark to test GUI agents on proactively locating, comprehending, and executing instructions from online documentation in interactive web settings.
-
Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents
Phone-use agents avoid harm more often through inability to act than through deliberate safe choices, so benchmarks must separate unsafe judgment from capability failure.
-
Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation
Personalized soft prompts steer VLM attention to match user-specific gaze patterns, yielding better attention alignment and click prediction in recommendation simulations.
-
AliyunConsoleAgent: Training Web Agents in Real-World Cloud Environments via Distillation and Reinforcement Learning
AliyunConsoleAgent-32B reaches 63.52% success on a 278-task cloud console benchmark, closing to 1.82pp of frontier models at 92% lower cost via SFT distillation and GRPO RL.
-
StainFlow: Entity-Stain Tracking and Evidence Linking for Process Rewards in GUI Agents
StainFlow proposes global entity stain tracking and local stain evidence linking modules to improve process rewards for GUI agents, reporting 3.2% relative gain in online RL success and 1.8% in judgment accuracy on AndroidWorld and OGRBench.
-
MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models
MIRAGE compresses explicit chain-of-thought into latent vectors and adds a generative world model to predict future interface states, matching explicit reasoning performance with 3-5x fewer tokens on Android benchmarks.
-
Agent Skills Should Go Beyond Text: The Case for Visual Skills
The paper proposes that reusable agent skills should incorporate visual elements alongside text, introduces three forms of visual skills and an automatic conversion system, and reports better performance on GUI and visual-centric tasks.
-
STAMP: Training Explicit Memory for Mobile GUI Agents in Controllable and Scalable Virtual Environments
STAMP trains explicit memory for mobile GUI agents via virtual environments with controlled memory injection, achieving SOTA on the new Memory-World benchmark.
-
SE-GA: Memory-Augmented Self-Evolution for GUI Agents
SE-GA combines Test-Time Memory Extension for dynamic context retrieval with Memory-Augmented Self-Evolution training to reach 89.0% on ScreenSpot and 75.8% on AndroidControl-High.
-
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a future roadmap.
-
Xiaomi-GUI-0 Technical Report
Xiaomi-GUI-0 reports 72.0% success on RealMobile and 78.9% on AndroidWorld via real-device closed-loop training with multi-source data and three-stage RL pipeline.
-
How Mobile World Model Guides GUI Agents?
World models trained on delta text, full text, diffusion images, and renderable code achieve SoTA on two benchmarks and improve downstream GUI agent performance on three mobile datasets with modality-specific strengths.
-
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment interventions.
-
ClawMobile: Rethinking Smartphone-Native Agentic Systems
ClawMobile proposes a hierarchical system separating probabilistic LLM planning from structured deterministic execution to improve stability and reproducibility of agentic systems on real smartphones.