{"total":11,"items":[{"citing_arxiv_id":"2607.00333","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"(A)I Sees What You Don't: Exploiting New Attack Surfaces in Third-Party Mobile Agents","primary_cat":"cs.CR","submitted_at":"2026-07-01T02:17:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Identifies Screen Perception and Misused Channel attack surfaces in VLM-powered mobile agents and demonstrates seven attacks enabling arbitrary command execution on five frameworks without privileges.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18652","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents","primary_cat":"cs.CV","submitted_at":"2026-05-18T16:57:36+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07110","ref_index":68,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability","primary_cat":"cs.CL","submitted_at":"2026-05-08T01:38:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment interventions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"1 illustrate how file operations, application macros, shell commands, or code execution can bypass brittle GUI se- quences in complex workflows [45], [46]. GUI-360, MAI- UI, and Step-GUI add complementary evidence that hybrid GUI+API or GUI+MCP action spaces are a recurring part of the current CUA design landscape rather than isolated systems choices [4], [5], [68]. The gain is speed, auditability, and leverage. The trade-off is stronger authority binding: one invocation can modify files, alter configurations, or contact external systems directly. Bundled or macro-action executionsits between those extremes. AppAgentX is a direct example because it evolves recurrent action sequences into higher-level routines that"},{"citing_arxiv_id":"2604.17817","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots","primary_cat":"cs.HC","submitted_at":"2026-04-20T05:15:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A new benchmark shows LLM smartphone agents achieve comparable success with screen text alone as with screenshots, but both fail often due to UI accessibility and reasoning gaps.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. ©2018 Copyright held by the owner/author(s). Publication rights licensed to ACM. Manuscript submitted to ACM Manuscript submitted to ACM 1 arXiv:2604.17817v1 [cs.HC] 20 Apr 2026 2 Zhang et al. with Siri [6]; personalized assistance that makes recommendations based on user preferences and context [24]; and intent interpretation for tasks like booking a restaurant [20]. Powered by recent breakthroughs in Large Language Models (LLMs) like ChatGPT [39] and Claude [5], mobile agents now leverage screenshots as input and use LLMs as the \"brain\" to analyze, predict, and execute complex tasks [4, 20, 40]. Focusing on autonomous systems capable of operating within mobile environments, these agents integrate multimodal"},{"citing_arxiv_id":"2604.14113","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding","primary_cat":"cs.CV","submitted_at":"2026-04-15T17:32:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UI-Zoomer uses uncertainty quantification to trigger and size adaptive zoom-ins only on uncertain GUI grounding predictions, yielding up to 13.4% gains on benchmarks with no training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"crop when candidates disagree on position;vintra ensures the crop is at least as large as the predicted element even when candidates coincide. 5 3.4.3 Crop window. We set the crop radius asr=γσ, where σ =√vinter +v intra. To avoid degenerate crops and aspect-ratio distortions, we impose a minimum side lengthmand squarify: s= max(2r x,2r y,m),[x c 1,y c 1,x c 2,y c 2] = [µx−s 2, µy−s 2, µx + s 2, µy + s 2].(10) If the window extends beyond image boundaries, we shift it inward while preserving its size. 3.4.4 Zoom and map back. We cropI to this window, resize it to the model's resolution budget, and run a single deterministic pass (T=0) to obtain a refined boxˆbin crop coordinates. We map it back to global normalized coordinates via: x= xc 1 + ˆxwc W , y= yc"},{"citing_arxiv_id":"2604.13531","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management","primary_cat":"cs.AI","submitted_at":"2026-04-15T06:27:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025. [17] Youngmin Im, Byeongung Jo, Jaeyoung Wi, Seungwoo Baek, Tae Hoon Min, Joo Hyung Lee, Sangeun Oh, Insik Shin, and Sunjae Lee. Modular and multi-path-aware offline benchmarking for mobile gui agents.arXiv preprint arXiv:2512.12634, 2025. [18] Wenjia Jiang, Yangyang Zhuang, Chenxi Song, Xu Yang, Joey Tianyi Zhou, and Chi Zhang. Ap- pagentx: Evolving gui agents as proficient smartphone users.arXiv preprint arXiv:2503.02268, 2025. [19] Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem AlShikh, and Ruslan Salakhutdinov. Omniact: A dataset and benchmark for enabling multimodal"},{"citing_arxiv_id":"2604.11259","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization","primary_cat":"cs.AI","submitted_at":"2026-04-13T10:12:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TIPO applies preference-intensity weighting and padding gating to stabilize preference optimization for privacy personalization in mobile GUI agents, yielding higher alignment and distinction metrics than prior methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"adjusting system settings, managing emails, and completing shop- ping or service workflows [16, 27, 39]. In recent years, Multimodal Large Language Models (MLLMs) have enabled mobile GUI agents to perform complex tasks on mobile devices [12, 30]. The advance- ment of mobile GUI agents further improved task completion rate on these realistic smartphone tasks [9, 26, 36]. They are moving be- yond proof-of-concept demonstrations toward practical assistants that can act on behalf of users in daily routines. However, from an end-user's perspective,task completionalone does not necessarily implyuser satisfaction[ 11, 29]. Users care not only about whether a task is completed, but also abouthowit is completed and what risks are incurred along the way,e."},{"citing_arxiv_id":"2509.21982","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RISK: A Framework for GUI Agents in E-commerce Risk Management","primary_cat":"cs.AI","submitted_at":"2025-09-26T07:05:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RISK introduces a dataset, benchmark, and R1-style RL fine-tuning for GUI agents that achieve 6.8-8.8% offline gains and 70.5% online task success in e-commerce risk management using 7.2% of baseline parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.06477","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents","primary_cat":"cs.AI","submitted_at":"2025-09-08T09:43:48+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MAS-Bench introduces 139 tasks, 88 predefined shortcuts, and 9 metrics to evaluate hybrid GUI-shortcut mobile agents, reporting up to 68.3% success and 39% efficiency gains over GUI-only baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.19500","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"NaviAgent: Bilevel Planning on Tool Navigation Graph for Large-Scale Orchestration","primary_cat":"cs.AI","submitted_at":"2025-06-24T10:39:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"NaviAgent decouples task planning from tool execution via a Tool World Navigation Model graph to improve scalability and success rates in LLM agents handling large tool ecosystems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2411.18279","ref_index":266,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Large Language Model-Brained GUI Agents: A Survey","primary_cat":"cs.AI","submitted_at":"2024-11-27T12:13:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In addition to multimodal integration, some frameworks focus on parsing intricate web structures and generating exe- cutable code to navigate complex websites. WebAgent [267] employs a two-tiered model approach by combining HTML- T5 for parsing long, complex HTML documents with Flan-U- JOURNAL OF LATEX CLASS FILES, DECEMBER 2024 29 Foundations to Innovations Frameworks Web [17], [199] [266]-[297] Mobile [18], [158] [156], [298] [258], [299]-[325] Computer [19], [161] [162], [304] [326]-[335] Cross- Platform [236], [336] [249], [337] [338]-[346] Data Web [212][347]-[351] Mobile [144], [298] [352]-[365] Computer [366]-[368] Cross- Platform [184] [216], [219][369]-[377] Models Foundation Models [231], [234] [165], [206], [210] [92], [93], [163]"}],"limit":50,"offset":0}