{"total":17,"items":[{"citing_arxiv_id":"2605.27761","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications","primary_cat":"cs.CV","submitted_at":"2026-05-26T23:19:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AndroidDaily supplies 350 verifiable tasks on 94 closed-source Android apps evaluated by GRADE (87.37% human agreement), with the strongest model achieving 62% success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16402","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WinDeskGround: A Benchmark for Robust GUI Grounding in Complex Multi-Window Desktop Environments","primary_cat":"cs.CV","submitted_at":"2026-05-13T02:48:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WinDeskGround is a parametrically generated benchmark of 1,356 instruction-target pairs that reveals accuracy declines in state-of-the-art MLLMs under partial occlusion in multi-window GUI settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06365","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work","primary_cat":"cs.AI","submitted_at":"2026-05-07T14:39:37+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"to-Most prompting [8], and scratchpad-style methods [5] all show gains from exposing intermediate computation. Search-based extensions such as Tree-of-Thoughts [15], Language Agent Tree Search [18], Reflexion [16], Self- Refine [17], and structured reflection [19] push farther by revising and exploring alternative reasoning trajectories; earlier work on workflow-guided exploration [20] foreshadows the same interest in reusable action structure. These papers matter here for two reasons. First, they show that intermediate steps are often where performance gains come from. Second, they reveal the limitation of prompt-centric intermediate state: most still represent intermediate work as textual traces tied to a specific execution episode."},{"citing_arxiv_id":"2604.21375","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation","primary_cat":"cs.CL","submitted_at":"2026-04-23T07:42:37+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"cent benchmarks target specific domains or platforms, such as Spider2-V [13] for 4 Q. Han, H. Tu et al. enterprise data-science workflows, ScreenSpot [18] for visual grounding, and ma- cOSWorld [72] for macOS-specific tasks. Parallel efforts extend evaluation to mo- bile [15,21,50,51] and web settings [19,22-24,34,43,54,75,78,83], building upon classic web-interaction benchmarks [41,46,52]. Beyond task-completion bench- marks, recent work evaluates multimodal model robustness and reliability more broadly, including safety and attribute evaluations under out-of-distribution vi- sual inputs [17,37,58], vision-language reward and reinforce learning [16,59]. Initial results across these benchmarks consistently fall far behind human ex-"},{"citing_arxiv_id":"2604.08516","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MolmoWeb: Open Visual Web Agent and Open Data for the Open Web","primary_cat":"cs.CV","submitted_at":"2026-04-09T17:54:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"screenshots, and parsing screenshots into structured representations [56-58]. While we use GUI grounding and QA data as sources of auxiliary supervision, our primary goal is to learn to follow instructions to solve tasks on the web. Evaluation of web agents.Evaluating web agents is challenging. Early evaluation work focused on sandboxed web environments [7, 59-62], desktop environments [63], and multi-turn dialogue navigation datasets [64] where the answer is known or verifiable using oracle knowledge of environment state. Recently, several 13 benchmarks have proposed evaluating on live websites. While some use automatic verifiers [65, 66] or simple text answers that are unlikely to change over time [67], other use a VLM-as-a-judge to verify task completion"},{"citing_arxiv_id":"2604.06126","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Gym-Anything: Turn any Software into an Agent Environment","primary_cat":"cs.LG","submitted_at":"2026-04-07T17:38:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Gym-Anything turns arbitrary software into agent environments via multi-agent setup and auditing, creating CUA-World with 10K+ long-horizon tasks and showing that trajectory distillation plus test-time auditing improves small VLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.05295","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces","primary_cat":"cs.AI","submitted_at":"2026-03-05T15:37:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WebChain supplies the largest open dataset of real human web trajectories with triple-modal alignment and a dual mid-training method that separates grounding from planning to improve web agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.05044","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents","primary_cat":"cs.AI","submitted_at":"2026-03-05T10:51:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WebFactory is a fully automated RL pipeline that compresses LLM-encoded internet knowledge into grounded web agents, achieving performance comparable to human-annotated training but using synthetic data from only 10 websites.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.00933","ref_index":4,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers","primary_cat":"cs.SE","submitted_at":"2026-01-31T23:19:39+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"MCP-Atlas is a new benchmark with 1000 tasks on production MCP servers that uses claim-level scoring to evaluate LLM agents on realistic multi-step tool-use competency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.10371","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management","primary_cat":"cs.AI","submitted_at":"2025-12-11T07:37:38+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AgentProg reframes interaction history as a program with variables and control flow, plus a belief state for partial observability, achieving SOTA success rates on long-horizon GUI benchmarks while baselines degrade.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.22074","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Real-Time Procedural Learning From Experience for AI Agents","primary_cat":"cs.AI","submitted_at":"2025-11-27T03:51:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PRAXIS enables AI agents to acquire procedural knowledge in real time by indexing and retrieving state-action-result experiences, leading to better accuracy, reliability, and efficiency on web browsing benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.23883","ref_index":230,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges","primary_cat":"cs.AI","submitted_at":"2025-10-27T21:48:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey that taxonomizes threats to agentic AI, reviews benchmarks and evaluation methods, discusses technical and governance defenses, and identifies open challenges.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Web - -✓ ✓ Rule- based End-state Unified interface MiniWoB++ [224] Web - - -✓ Rule- based End-state Baseline τ-bench [227] Multi- domain - -✓ ✓ Rule- based pass∧k, pass@k Consistency GTA [228] Multi- domain -✓-✓ Rule- based End-state, step-wise Tool agent OSWorld [229] Computer -✓-✓ Rule- based End-state, Scripted Computer environment OSWorld-Human [230] Computer -✓-✓ Rule- based Weighted Efficiency Score Temporal performance Security-Specific Benchmarks ST-WebAgent- Bench [231] Enterprise✓ ✓ ✓ ✓- CuP + Risk Policy com- pliance AgentHarm [46] Open- domain ✓-✓ ✓- Compliance Jailbreak + competence OS-Harm [232] Computer✓ ✓ ✓ ✓LLM Safety + accuracy Desktop ef- fects R-Judge [233] Meta-eval✓ ✓∼- LLM Risk recog-"},{"citing_arxiv_id":"2411.18279","ref_index":149,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Large Language Model-Brained GUI Agents: A Survey","primary_cat":"cs.AI","submitted_at":"2024-11-27T12:13:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[145] (2017) enabled the training of web-based agents using reinforcement learning (RL). Workflow-guided exploration [146] (2018) improved RL efficiency and task performance. DQT [147] (2024) applied deep reinforcement learning to automate Android GUI testing by preserving widget structures and semantics, while AndroidEnv [148] (2021) offered realistic simulations for agent training on Android. WebShop [149] (2022) illustrated the potential for large-scale web interaction, underscoring the growing sophistication of RL-driven GUI automation. While these machine learning-based approaches were more adaptable than earlier rule-based systems [150], [151], they still struggled to generalize across diverse, unforeseen tasks. Their dependence on predefined workflows and limited"},{"citing_arxiv_id":"2407.17032","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Gymnasium: A Standard Interface for Reinforcement Learning Environments","primary_cat":"cs.LG","submitted_at":"2024-07-24T06:35:05+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Gymnasium establishes a standardized API for RL environments to improve interoperability, reproducibility, and ease of development in reinforcement learning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.07972","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments","primary_cat":"cs.AI","submitted_at":"2024-04-11T17:56:05+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"3 Observation Space The observation space in OSW ORLD contains a complete screenshot of the desktop screen , including the mouse's position and shape, various application windows, files, and folders that are opened in different sizes and orders, maintaining the same perception as a human. Also, to be aligned with previous agent-building web and mobile research [30, 27, 9, 66] that provide and support the use of the webpage's DOM and app's view hierarchy,OSW ORLD also provides XML-format accessibility (a11y) tree (obtained via ATSPI 2 on Ubuntu, via PyWinAuto on Windows,etc.), which can support additional information for modeling. These raw observations allow rich interactions between multiple applications but induce challenges in long-horizon decision-making from high-"},{"citing_arxiv_id":"2401.10935","ref_index":88,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents","primary_cat":"cs.HC","submitted_at":"2024-01-17T08:10:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.05459","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security","primary_cat":"cs.HC","submitted_at":"2024-01-10T09:25:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Another challenge of RL-based task automation is the huge action space and the sparse reward. A typical GUI-grounded task usually involves 5-10 steps, each of which contains 10-100 candidate actions, leading to a search space size of 105-10010. The task is completed only if the correct sequence of actions is taken. In order to tackle such challenge, many frameworks have been proposed. Liu et al. [6] introduced the method to use high-level \"workflows\" to constrain the allowable actions at each time step. The workflows can prune out bad exploration directions, accelerating the agent's ability to discover rewards. Gur et al. [40] decomposed the complicated instruction into multiple smaller ones, and schedule a curriculum for the agents to gradually manage to follow an increasing number of sub-instructions."}],"limit":50,"offset":0}