{"total":14,"items":[{"citing_arxiv_id":"2605.29486","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PhoneWorld: Scaling Phone-Use Agent Environments","primary_cat":"cs.CL","submitted_at":"2026-05-28T07:14:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PhoneWorld is a pipeline that converts real mobile trajectories into scalable controllable environments, yielding large gains on four benchmarks when used to supplement training data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28775","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents","primary_cat":"cs.LG","submitted_at":"2026-05-27T17:37:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LearnWeak specializes small CUAs via weakness detection by a reference agent, targeted task synthesis, and error-aware training, delivering 11+ point gains on OSWorld.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19769","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OpenComputer: Verifiable Software Worlds for Computer-Use Agents","primary_cat":"cs.AI","submitted_at":"2026-05-19T12:40:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OpenComputer introduces a verifier-grounded framework with state verifiers, self-evolving layers, task synthesis, and auditable evaluation for 33 desktop apps and 1000 tasks to support computer-use AI agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19260","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees","primary_cat":"cs.AI","submitted_at":"2026-05-19T02:13:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AQuaUI uses adaptive quadtrees to cut visual tokens in GUI-agent LMMs by up to 29.52% at inference time while retaining 99.06% of full-token accuracy on grounding and navigation benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18652","ref_index":70,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents","primary_cat":"cs.CV","submitted_at":"2026-05-18T16:57:36+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12501","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Covering Human Action Space for Computer Use: Data Synthesis and Benchmark","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:59:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"OS-Atlas-Base-7B∗[11] 2024-10 18.9 9.0 9.9 15.8 0.0 12.5 10.9 0.0 7.8 InfiGUI-R1-3B [18] 2025-04 45.2 22.0 23.2 23.7 3.1 9.4 7.8 0.0 8.8 UI-Venus-Ground-7B [19] 2025-08 50.8 26.5 24.3 23.7 3.1 18.8 9.4 0.0 11.0 GUI-G2-7B [20] 2025-07 47.5 26.4 21.1 23.7 6.2 15.6 7.8 4.8 11.6 MAI-UI-2B†[22] 2025-12 57.4 30.3 27.1 18.4 3.1 18.8 12.5 9.5 12.5 GUI-Owl-1.5-8B-Think [23] 2026-02 57.6 33.2 24.4 23.7 9.4 18.8 10.9 7.1 14.0 MAI-UI-8B†[22] 2025-12 65.8 40.7 25.1 26.3 18.8 18.8 7.8 4.8 15.3 GUI-Owl-1.5-8B-Instruct [23] 2026-02 71.1 37.4 33.7 23.7 15.6 18.8 9.4 9.5 15.4 UI-Venus-Ground-72B [19] 2025-08 61.9 36.8 25.1 28.9 18.8 18.8 10.9 9.5 17.4 InfiGUI-G1-7B [21] 2025-08 51.9 26.1 25.8 44.7 18.8 37.5 9.4 4.8 23.0 EvoCUA-8B [26] 2026-01 45."},{"citing_arxiv_id":"2605.12481","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents","primary_cat":"cs.AI","submitted_at":"2026-05-12T17:57:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"domain, which we will save for OOD verification. Please reference Appendix C.3 for more details. Baselines and Benchmark.We evaluate ToolCUA against two categories: general-purpose foundation mod- els (e.g., Qwen3.5-Plus [29], Claude-4.5-Sonnet [2], Gemini-3.1-Pro [10] and specialized CUAs including UI-Tars-1.5 [28], the EvoCUA series [48], and GUI-Owl-1.5 [46]. For evaluation, we utilize OSWorld-MCP [12] as our primary benchmark, as it is designed for CUAs under a hybrid action space, which covers typical GUI actions, 150+ tools, and mainstream desktop apps. Following the benchmark setup, we report results on the feasible tasks only. To mitigate environmental stochasticity in the sandbox, we report theaverage@3for all primary metrics, and set the maximum steps per task to"},{"citing_arxiv_id":"2605.10347","ref_index":26,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How Mobile World Model Guides GUI Agents?","primary_cat":"cs.AI","submitted_at":"2026-05-11T10:49:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"World models trained on delta text, full text, diffusion images, and renderable code achieve SoTA on two benchmarks and improve downstream GUI agent performance on three mobile datasets with modality-specific strengths.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"a much higher Time cost than VLMs.Therefore, the additional image modeling from diffusion models is insufficient to compensate for shortcomings in text processing and rendering efficiency. 3.3 How Do Modalities Affect Task Completion? Online End-to-End Evaluation Settings.We evaluate world-model guidance on AndroidWorld [25] using the M3A framework and Mobile-Agent-v3.5 [26] (MA3.5). Each agent samples k= 3 candidate actions; the world model predicts the next states; and Gemini 3-Flash scores the candidates and selects the highest to execute. Benchmark and Metrics.We employ the official AndroidWorld benchmark, which categorizes unseen application tasks into three distinct difficulty levels: (1) Easy and Medium tasks serve"},{"citing_arxiv_id":"2605.12549","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs","primary_cat":"cs.CV","submitted_at":"2026-05-10T07:04:07+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GUI grounding in VLMs is bottlenecked by prefill-stage candidate selection that decoding cannot fix, so Re-Prefill uses attention to extract and re-inject target tokens for up to 4.3% gains on ScreenSpot-Pro.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07630","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents","primary_cat":"cs.CL","submitted_at":"2026-05-08T11:58:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Phone-use agents avoid harm more often through inability to act than through deliberate safe choices, so benchmarks must separate unsafe judgment from capability failure.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07110","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability","primary_cat":"cs.CL","submitted_at":"2026-05-08T01:38:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment interventions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Architecture-lifecycle view General-purpose CUAs across web, desktop, mobile, and cross-application settings An integrated architecture-lifecycle account of capability, risk, privacy, and control Clarifies failure origin versus manifestation, maps risk patterns to intervention points, and outlines deployable control surfaces for live-use CUA settings [3], [45], [46]. Even with screenshots, DOM or accessibility trees, OCR output, memory traces, and tool responses, relevant state may remain hidden, delayed, or already stale: windows change asynchronously, permission prompts appear after a plan is formed, and tool state may become visible only after an invocation succeeds [9], [17], [47]. Three properties follow."},{"citing_arxiv_id":"2604.25380","ref_index":39,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Benchmarking and Improving GUI Agents in High-Dynamic Environments","primary_cat":"cs.CV","submitted_at":"2026-04-28T08:43:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new DynamicGUIBench spanning ten applications.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tion, and instruction following, as exemplified by ShowUI [ 16], CogAgent [11], HATS [28], SeeClick [4], and SimpAgent [2]. RL- based approaches, such as UI-AGILE [15] and GUI-R1 [21], improve decision-making through interaction feedback and reward-driven optimization. Beyond single-stage training, multi-stage or modular pipelines such as GTA1 [ 42], Aguvis [ 39], and related planner- executor systems introduce reasoning, grounding, and verification modules to enhance execution in complex environments. At the same time, another important direction explores stronger foundation backbones for GUI control, including general-purpose VLMs and native GUI-action models such as Qwen-VL [1], Open- CUA [35], GUI-Owl [38], UIPro [14], and UI-TARS [23]."},{"citing_arxiv_id":"2604.13531","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management","primary_cat":"cs.AI","submitted_at":"2026-04-15T06:27:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"arXiv preprint arXiv:2505.13227, 2025. [52] Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040-52094, 2024. [53] Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents.arXiv preprint arXiv:2602.16855, 2026. [54] Kevin Xu, Yeganeh Kordi, Tanay Nayak, Adi Asija, Yizhong Wang, Kate Sanders, Adam Byerly, Jingyu Zhang, Benjamin Van Durme, and Daniel Khashabi."},{"citing_arxiv_id":"2604.11259","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization","primary_cat":"cs.AI","submitted_at":"2026-04-13T10:12:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TIPO applies preference-intensity weighting and padding gating to stabilize preference optimization for privacy personalization in mobile GUI agents, yielding higher alignment and distinction metrics than prior methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"the rapid development of Mobile GUI agents and mobile interaction benchmarks. [22, 28, 31] Early systems such as AppAgent [40] and AppAgent-v2 [14] demonstrate the feasibility of autonomous mo- bile app operation, while benchmarks such as AndroidWorld [26], GUIOdyssey [20], and SPA-Bench [3] make evaluation more re- alistic and systematic. More recent agent systems, including UI- TARS [24] and Mobile-Agent-v3 [37], further push this direction toward stronger grounding, longer-horizon execution, and more practical deployment. Recent studies have also started to examine privacy in mobile agent settings, shifting attention from general task execution to privacy-related risks and protections. However, existing work still mainly focuses on task success, privacy awareness [15], or infor-"}],"limit":50,"offset":0}