{"total":17,"items":[{"citing_arxiv_id":"2605.22219","ref_index":7,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval","primary_cat":"cs.AI","submitted_at":"2026-05-21T09:22:48+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SGR-Bench evaluates agentic LLM systems on state-gated retrieval tasks where evidence is only accessible after configuring site-specific states, with the strongest system reaching 66.18% item-level F1 and failures dominated by retrieval-scope drift.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22154","ref_index":6,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-05-21T08:25:17+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"IdleSpec improves LLM agent accuracy by generating and aggregating speculative plans during idle time between tool calls and observations using complementary drafting strategies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19260","ref_index":8,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees","primary_cat":"cs.AI","submitted_at":"2026-05-19T02:13:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AQuaUI uses adaptive quadtrees to cut visual tokens in GUI-agent LMMs by up to 29.52% at inference time while retaining 99.06% of full-token accuracy on grounding and navigation benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18652","ref_index":10,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents","primary_cat":"cs.CV","submitted_at":"2026-05-18T16:57:36+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18597","ref_index":11,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Latent Action Reparameterization for Efficient Agent Inference","primary_cat":"cs.AI","submitted_at":"2026-05-18T16:07:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LAR learns a compact latent action space from trajectories that shortens the effective decision horizon for LLM agents, reducing token count and inference time while preserving task success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14290","ref_index":9,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Web Agents Should Adopt the Plan-Then-Execute Paradigm","primary_cat":"cs.CR","submitted_at":"2026-05-14T02:48:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Web agents should default to planning a complete task program before observing live web content to reduce prompt injection exposure, since WebArena tasks are compatible and 80% need no runtime LLM calls.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12501","ref_index":38,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Covering Human Action Space for Computer Use: Data Synthesis and Benchmark","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:59:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"\"bbox\": [224, 179, 432, 253 ], \"center\": [328, 216], } ] Text Screenshot Annotations [ { \"start text\": \"Unlike the main Philippine…\", \"end text\": \"…are concentrated in urban areas\", \"starting\": [924, 341], \"ending\": [1083, 370] }, …… \"cursor\": [38, 1216], } ] Table Screenshot Annotations [ { \"content\": \"31\", \"col head\": \"MODELLING ATTEMPTED\", \"col id\": 3, \"row head\": \"SL2S\", \"row id\": 3, \"bbox\": [623, 284, 898, 367] }, …… { \"content\": \"77\", …… ] Canvas Screenshot"},{"citing_arxiv_id":"2605.12004","ref_index":10,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Learning Agentic Policy from Action Guidance","primary_cat":"cs.CL","submitted_at":"2026-05-12T11:54:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"arXiv preprint arXiv:2602.14234, 2026. [9] Yanqi Dai, Yuxiang Ji, Xiao Zhang, Yong Wang, Xiangxiang Chu, and Zhiwu Lu. Harder is better: Boosting mathematical reasoning via difficulty-aware GRPO and multi-aspect question reformulation. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=nfURupkdRJ. [10] Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091-28114, 2023. [11] Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang. Open- vlthinker: Complex vision-language reasoning via iterative sft-rl cycles."},{"citing_arxiv_id":"2605.07110","ref_index":8,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability","primary_cat":"cs.CL","submitted_at":"2026-05-08T01:38:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment interventions.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"use is therefore a systems problem as much as a modeling problem, because perception, planning, execution authority, memory, tool use, and oversight interact under live software conditions. The recent expansion of CUA deployment settings makes an integrative survey timely. Benchmarks have moved from bounded website tasks toward visually grounded, enterprise, personalized, and open-environment settings [8]-[17]. At the same time, system-building and evaluation work has diversi- fied across grounding, memory, long-horizon planning, tool- augmented execution, safety evaluation, and open-deployment stacks [1], [2], [18]-[20]. The difficulty is no longer only the lack of evidence about CUA capability. It is also the lack of a common coordinate system for interpreting how capability is"},{"citing_arxiv_id":"2605.06992","ref_index":30,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Why Does Agentic Safety Fail to Generalize Across Tasks?","primary_cat":"cs.LG","submitted_at":"2026-05-07T22:16:03+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstrated in quadcopter and LLM experiments.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"InAdvances in Neural Information Processing Systems (NeurIPS), 1993. [29] Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents.Advances in Neural Information Processing Systems, 37:82895-82920, 2024. [30] Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091-28114, 2023. [31] Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of artificial intelligence research, 13:227-303, 2000."},{"citing_arxiv_id":"2605.06761","ref_index":7,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Weblica: Scalable and Reproducible Training Environments for Visual Web Agents","primary_cat":"cs.AI","submitted_at":"2026-05-07T17:17:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"cached HTTP traffic to create reproducible offline approximations of websites, though limited to simple mini- tasks. WebArena [50] and VisualWebArena [18] evaluate agents on self-hosted websites with programmatic success checking. While reproducible, they suffer from a sim-to-real gap. Benchmarks on real websites include GAIA [22], WebVoyager [11], and Mind2Web [7], which test agents on live web tasks but face reproducibility challenges as websites change over time. WebVoyager additionally suffers from limited task diversity, with up to 51% of tasks solvable via search shortcuts. Online-Mind2Web [41] addresses these issues with a more realistic setup that evaluates agents on live websites using an LLM-as-Judge for task success."},{"citing_arxiv_id":"2604.23781","ref_index":11,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents","primary_cat":"cs.CV","submitted_at":"2026-04-26T16:05:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14448","ref_index":8,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MARCA: A Checklist-Based Benchmark for Multilingual Web Search","primary_cat":"cs.CL","submitted_at":"2026-04-15T21:54:27+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MARCA is a bilingual benchmark using 52 questions and validated checklists to evaluate LLM web-search completeness and correctness in English and Portuguese.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13531","ref_index":6,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management","primary_cat":"cs.AI","submitted_at":"2026-04-15T06:27:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06367","ref_index":10,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks","primary_cat":"cs.CR","submitted_at":"2026-04-07T18:43:21+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.12538","ref_index":49,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Agentic Reasoning for Large Language Models","primary_cat":"cs.AI","submitted_at":"2026-01-18T18:58:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"constraints, and feedback loops, motivating diverse system designs [43, 44] that integrate planning, tool use, search, reflection, memory mechanisms, and multi-agent coordination. On the other hand, the benchmark landscape has emerged to evaluate agentic reasoning, ranging from targeted tests that isolate individual agentic capabilities to application-specific benchmarks that assess end-to-end behavior in domain-specific environments and scenarios [45, 46, 47, 48, 20, 21, 49, 50]. Together, this survey synthesizes agentic reasoning methods into a unified roadmap that bridges reasoning and acting. We systematically characterize these methods across the complementary scopes of foundational, self-evolving, and collective reasoning, while distinguishing between in-context and post-training optimiza- tion modes. We further contextualize this roadmap through representative applications and evaluation"},{"citing_arxiv_id":"2502.12110","ref_index":7,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"A-MEM: Agentic Memory for LLM Agents","primary_cat":"cs.CL","submitted_at":"2025-02-17T18:36:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A-MEM is a dynamic memory system for LLM agents that builds and refines an interconnected network of notes with agent-driven linking and evolution, showing performance gains over prior memory methods on six models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"and extrinsic evaluation measures for machine translation and/or summarization, pages 65-72, 2005. [6] Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. InInternational conference on machine learning, pages 2206-2240. PMLR, 2022. [7] Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091-28114, 2023. [8] Khant Dev and Singh Taranjeet. mem0: The memory layer for ai agents. https://github. com/mem0ai/mem0, 2024. [9] Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven"}],"limit":50,"offset":0}