{"total":19,"items":[{"citing_arxiv_id":"2605.22564","ref_index":37,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations","primary_cat":"cs.CL","submitted_at":"2026-05-21T14:45:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SynAE is a multi-metric framework that evaluates how well synthetic benchmarks replicate real data characteristics for multi-turn tool-calling agent testing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20833","ref_index":49,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MemGym: a Long-Horizon Memory Environment for LLM Agents","primary_cat":"cs.CL","submitted_at":"2026-05-20T07:25:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19260","ref_index":31,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees","primary_cat":"cs.AI","submitted_at":"2026-05-19T02:13:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AQuaUI uses adaptive quadtrees to cut visual tokens in GUI-agent LMMs by up to 29.52% at inference time while retaining 99.06% of full-token accuracy on grounding and navigation benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18652","ref_index":69,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents","primary_cat":"cs.CV","submitted_at":"2026-05-18T16:57:36+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18535","ref_index":66,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Beyond Scaling: Agents Are Heading to the Edge","primary_cat":"cs.LG","submitted_at":"2026-05-18T15:18:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Personal agents require edge deployment to preserve high-fidelity local context and zero-latency loops, as claimed through three structural shifts away from cloud-centric designs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17046","ref_index":49,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"1GC-7RC: One Graphic Card -- Seven Research Challenges! How Good Are AI Agents at Doing Your Job?","primary_cat":"cs.LG","submitted_at":"2026-05-16T15:35:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces the 1GC-7RC benchmark to evaluate AI coding agents on seven diverse ML tasks under single-GPU time and access constraints.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14859","ref_index":55,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Do Coding Agents Understand Least-Privilege Authorization?","primary_cat":"cs.CR","submitted_at":"2026-05-14T14:05:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Coding agents struggle to infer least-privilege file permissions by omitting needed accesses while granting unused or sensitive ones, but Sufficiency-Tightness Decomposition improves sensitive-task success by up to 15.8% and reduces attacks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13193","ref_index":46,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition","primary_cat":"cs.CV","submitted_at":"2026-05-13T08:49:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FIKA-Bench is a leakage-aware benchmark of 311 instances showing that even the best large multimodal models and tool-equipped agents reach only 25.1% accuracy on fine-grained recognition questions that require external evidence search and verification.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12501","ref_index":17,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Covering Human Action Space for Computer Use: Data Synthesis and Benchmark","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:59:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"five modalities: GUI, Text, Table, Canvas, and Natural Image, and includes not only clicking, but also dragging and drawing actions, such as tracing object boundaries in Photoshop for image cutout. We find that performance on CUActSpot differs substantially from conventional GUI grounding benchmarks [14-16, 24], while showing closer agreement with end-to-end agentic results such as OSWorld [17]. This suggests CUActSpot may better reflect real-world computer-use scenarios. We further propose a data synthesis pipeline that obtains screenshots and coordinate-related metadata through code-based rendering, and we find that advanced GPT models can be leveraged to synthesize data for complex operations. Using this approach, we generate 50M samples that can support model"},{"citing_arxiv_id":"2605.12481","ref_index":45,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents","primary_cat":"cs.AI","submitted_at":"2026-05-12T17:57:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"In real computer-use environments, usable tools are difficult to obtain and maintain. Specifically, APIs are often application-specific, incomplete, or unstable, and collecting GUI-Tool data requires expensive environment instrumentation. Existing efforts [49, 54] partly Table 1: Performance comparison between pure GUI and hybrid GUI-Tool action spaces on OSWorld [45]. \"Steps\" is the average number of trajectory steps; \"Tool-calls\" is the average number of tool calls. See details in Appendix C.1. Model Action Accuracy↑Steps↓Tool-calls Qwen3VL-8B GUI 29.0 19.2 - + Tools 28.2(−0.8) 19.3 0.003 Qwen3VL-235B GUI 41.1 25.9 - + Tools 38.1(−2.0) 17.4 6.10 EvoCUA-32B GUI 52.6 25.0 - + Tools 40.5(−12.0) 26.1 7.49 Claude-4-sonnet GUI 47."},{"citing_arxiv_id":"2605.11882","ref_index":42,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment","primary_cat":"cs.AI","submitted_at":"2026-05-12T09:56:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[40] Renxi Wang, Haonan Li, Xudong Han, Yixuan Zhang, and Timothy Baldwin. Learning from failure: Integrating negative examples when fine-tuning large language models as agents.arXiv preprint arXiv:2402.11651, 2024. [41] Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail?Advances in neural information processing systems, 36:80079-80110, 2023. [42] Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040-52094, 2024. [43] An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu,"},{"citing_arxiv_id":"2605.09423","ref_index":92,"ref_count":4,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning","primary_cat":"cs.AI","submitted_at":"2026-05-10T08:51:50+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.","context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"[91] Xiaokang Ye, Jiawei Ren, Yan Zhuang, Xuhong He, Yiming Liang, Yiqing Yang, Mrinaal Dogra, Xianrui Zhong, Eric Liu, Kevin Benavente, Rajiv Mandya Nagaraju, Dhruv Sharma, Ziqiao Ma, Tianmin Shu, Zhiting Hu, and Lianhui Qin. Simworld: An open-ended simulator for agents in physical and social worlds. InAdvances in Neural Information Processing Systems, 2025. [92] Xunjian Yin, Xinyi Wang, Liangming Pan, Li Lin, Xiaojun Wan, and William Yang Wang. Gödel agent: A self-referential agent framework for recursive self-improvement.arXiv preprint arXiv:2410.04444, 2024. URLhttps://arxiv.org/abs/2410.04444. [93] Hong-Xing Yu, Haoyi Duan, Charles Herrmann, William T. Freeman, and Jiajun Wu. Wonderworld: Interactive 3d scene generation from a single image, 2025."},{"citing_arxiv_id":"2605.07110","ref_index":17,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability","primary_cat":"cs.CL","submitted_at":"2026-05-08T01:38:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment interventions.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"problem, because perception, planning, execution authority, memory, tool use, and oversight interact under live software conditions. The recent expansion of CUA deployment settings makes an integrative survey timely. Benchmarks have moved from bounded website tasks toward visually grounded, enterprise, personalized, and open-environment settings [8]-[17]. At the same time, system-building and evaluation work has diversi- fied across grounding, memory, long-horizon planning, tool- augmented execution, safety evaluation, and open-deployment stacks [1], [2], [18]-[20]. The difficulty is no longer only the lack of evidence about CUA capability. It is also the lack of a common coordinate system for interpreting how capability is"},{"citing_arxiv_id":"2605.05701","ref_index":62,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Inference-Time Budget Control for LLM Search Agents","primary_cat":"cs.AI","submitted_at":"2026-05-07T05:45:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A VOI-based controller for dual inference budgets improves multi-hop QA performance by prioritizing search actions and selectively finalizing answers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04808","ref_index":83,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents","primary_cat":"cs.AI","submitted_at":"2026-05-06T11:59:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DTap is a new red-teaming platform for AI agents that uses autonomous exploration across realistic simulations to discover vulnerabilities and creates a verifiable benchmark dataset.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"InForty-second International Conference on Machine Learning. [82] Hongyang Yang, Boyu Zhang, Neng Wang, Cheng Guo, Xiaoli Zhang, Likun Lin, Junlin Wang, Tianyu Zhou, Mao Guan, Runjia Zhang, and Christina Dan Wang. Finrobot: An open- source ai agent platform for financial applications using large language models.arXiv preprint arXiv:2405.14767, 2024. [83] Zendesk, Inc. Zendesk user content and conduct policy, 2024. [84] Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents. InThe Thirteenth International Conference on Learning Representations."},{"citing_arxiv_id":"2604.27488","ref_index":7,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO","primary_cat":"cs.CL","submitted_at":"2026-04-30T06:39:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Skills-Coach optimizes LLM agent skills via task generation, prompt/code tuning, comparative execution, and traceable evaluation, reporting gains on a 48-skill benchmark called Skill-X.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23781","ref_index":5,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents","primary_cat":"cs.CV","submitted_at":"2026-04-26T16:05:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13531","ref_index":52,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management","primary_cat":"cs.AI","submitted_at":"2026-04-15T06:27:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.23883","ref_index":236,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges","primary_cat":"cs.AI","submitted_at":"2025-10-27T21:48:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey that taxonomizes threats to agentic AI, reviews benchmarks and evaluation methods, discusses technical and governance defenses, and identifies open challenges.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"In particular, this is crucial forsecurityevaluations: a system that is safe in expectation but occasionally executes a destructive action is unacceptable for any mission-critical scenarios. On the other hand, for tasks such as code generation, obtaining one correct solution (out of many possible generations) is feasible (and quantified as the pass@k metric [236]). Thus, it is imperative that evaluations and benchmarks for the security of agentic AI frameworks move towards reporting performance viadistributions(e.g., pass ∧1, pass∧k for several k) rather than one single average. Standardizing judges and reducing judge bias.LLM-as-a-judge[ 237] is attractive for scale but can be biased by prompt design, trace format, or model choice."}],"limit":50,"offset":0}