{"total":10,"items":[{"citing_arxiv_id":"2605.17656","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MUIAnno: An Expert-Annotated Dataset and Evaluation Benchmark for Mobile UI Understanding","primary_cat":"cs.HC","submitted_at":"2026-05-17T21:14:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MUIAnno is an expert-annotated dataset of mobile UI screens from iOS apps with structured JSON labels and baseline results for UI element detection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07110","ref_index":74,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability","primary_cat":"cs.CL","submitted_at":"2026-05-08T01:38:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment interventions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"instrumented, the same structural advantage can disappear. Parser-augmented visual agentsrecover part of that structure from rendered interfaces. Systems such as Omni- Parser, ScreenAI, SeeClick, Ferret-UI, TRISHUL, and newer complete-screen parsing approaches add OCR, icon caption- ing, region proposals, or layout parsing before downstream reasoning [50], [51], [60], [71], [73], [74]. This family is attractive because it retains visual generality while recovering some of the handles that make execution easier. Its main failure mode is familiar: the parser becomes a bottleneck, and downstream reasoning can remain confidently wrong when the parsed state is incomplete or distorted. Native end-to-end visual agentspush farther toward porta-"},{"citing_arxiv_id":"2604.28001","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Pattern Language for Resilient Visual Agents","primary_cat":"cs.AI","submitted_at":"2026-04-30T15:24:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Proposes four architectural patterns—Hybrid Affordance Integration, Adaptive Visual Anchoring, Visual Hierarchy Synthesis, and Semantic Scene Graph—to balance non-determinism and latency of foundation models with enterprise requirements for determinism and real-time performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23772","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PageGuide: Browser extension to assist users in navigating a webpage and locating information","primary_cat":"cs.HC","submitted_at":"2026-04-26T15:49:12+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PageGuide grounds LLM answers in webpage DOM elements using visual overlays for find, guide, and hide modes, yielding measurable gains in a 94-user study.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Query: Which stream emerges from Nevado Mismibefore eventually contributing to the Apurimac River system? The stream that emerges from Nevado MismiNevado Mismi[1]is Quebrada CarhuasantaCarhuasanta[2]. It joins Quebrada ApachetaQuebrada Apacheta[3]to form the Río LloquetaRío Lloqueta[4], which then becomes the Río HornillosRío Hornillos[5]before eventually joining the Río ApurímacRío Apurímac[6]. ✨(6 highlighted) The most distant source of the Amazon was thought to be in the Apurímac river drainage for nearly a century. Such studies continued to be published even as recently as 1996,[63]2001,[64] 2007[25]and 2008,[65]where various authors identified the snowcapped 5,597 m (18,363 ft) Nevado Mismi peak, located roughly 160 km (99 mi) west of Lake Titicacaand 700 km (430 mi) southeast of Lima, as the most distant source of the river."},{"citing_arxiv_id":"2604.21375","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation","primary_cat":"cs.CL","submitted_at":"2026-04-23T07:42:37+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Initial results across these benchmarks consistently fall far behind human ex- perts, revealing failure patterns that motivate the design of VLAA-GUI. 2.2 GUI Agents: Models and Frameworks. End-to-end models trained for GUI interaction-UI-TARS [49], AGUVIS [68], ShowUI [40], CogAgent [33], OS-Atlas [66], among others [65,74]-achieve strong grounding without HTML or accessibility trees. Screen-based agents [9,45,56, 62] further explore pixel-space control, and web agents [30,32,81] investigate long-horizon decision making in browser environments. Frontier providers have followed with commercial APIs: Claude Computer Use [4], OpenAI CUA [47], and Seed [12], the latter serving as our dedicated grounding model. Complementary modular frameworks compose MLLMs with planning, mem-"},{"citing_arxiv_id":"2604.08516","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MolmoWeb: Open Visual Web Agent and Open Data for the Open Web","primary_cat":"cs.CV","submitted_at":"2026-04-09T17:54:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"semanticscholar.org/CorpusID:271601072. [58] Wenwen Yu, Zhibo Yang, Jianqiang Wan, Sibo Song, Jun Tang, Wenqing Cheng, Yuliang Liu, and Xiang Bai. Omniparser v2: Structured-points-of-thought for unified visual text parsing and its generality to multimodal large language models.ArXiv, abs/2502.16161, 2025. URL https://api.semanticscholar.org/CorpusID:276575751. [59] Tianlin Shi, Andrej Karpathy, Linxi (Jim) Fan, Josefa Z. Hernández, and Percy Liang. World of bits: An open-domain platform for web-based agents. InInternational Conference on Machine Learning, 2017. URL https://api.semanticscholar.org/CorpusID:34953552. [60] Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement learning on web"},{"citing_arxiv_id":"2602.10139","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Anonymization-Enhanced Privacy Protection for Mobile GUI Agents: Available but Invisible","primary_cat":"cs.CR","submitted_at":"2026-02-08T15:50:04+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An anonymization framework replaces sensitive UI content with deterministic placeholders to protect privacy in mobile GUI agents while preserving task performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.01785","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding","primary_cat":"cs.CL","submitted_at":"2026-02-02T08:10:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"code, datasets, and reproduction scripts. [8] Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Cărbune, Jason Lin, Jindong Chen, and Abhanshu Sharma. 2024. ScreenAI: A Vision-Language Model for UI and Visually-Situated Language Understanding. arXiv:2402.04615 [cs.CV] https://arxiv.org/abs/2402.04615 [9] Jeonghun Baek, Geewook Kim, Junyeop Lee, Sungrae Park, Dongyoon Han, Sangdoo Yun, Seong Joon Oh, and Hwalsuk Lee. 2019. What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis. arXiv:1904.01906 [cs.CV] https://arxiv.org/abs/1904.01906 [10] Shuai Bai, Jinze Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou."},{"citing_arxiv_id":"2509.06477","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents","primary_cat":"cs.AI","submitted_at":"2025-09-08T09:43:48+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MAS-Bench introduces 139 tasks, 88 predefined shortcuts, and 9 metrics to evaluate hybrid GUI-shortcut mobile agents, reporting up to 68.3% success and 39% efficiency gains over GUI-only baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.07972","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments","primary_cat":"cs.AI","submitted_at":"2024-04-11T17:56:05+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"source [9, 15, 27, 37, 64, 46, 62, 66]. However, source code often tends to be verbose, non-intuitive, and filled with noise. In many cases, it is even inaccessible or unavailable for use, making multi- modality or even vision-only perception a must. To take screenshots as input, there are already specialized, optimized multi-modal models available that are suited for tasks on web [4, 12, 18, 23, 43] and mobile devices [17, 63]. Additionally, general-purpose foundation models [5, 26, 31, 67] also demonstrate significant potential for multi-modal digital agents. The development of prompt-based methods [13, 16, 55, 65], as well as visual reasoning paradigms, have also further facilitated the performance of digital agents in web pages, mobile apps, and desktop."}],"limit":50,"offset":0}