{"total":23,"items":[{"citing_arxiv_id":"2605.23904","ref_index":4,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SkillOpt: Executive Strategy for Self-Evolving Agent Skills","primary_cat":"cs.AI","submitted_at":"2026-05-22T17:59:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SkillOpt introduces a controllable text-space optimizer that evolves agent skills via add/delete/replace edits accepted only on strict held-out validation improvement, reporting consistent gains across 52 model-benchmark-harness combinations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22564","ref_index":40,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations","primary_cat":"cs.CL","submitted_at":"2026-05-21T14:45:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SynAE is a multi-metric framework that evaluates how well synthetic benchmarks replicate real data characteristics for multi-turn tool-calling agent testing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22505","ref_index":54,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Towards Direct Evaluation of Harness Optimizers via Priority Ranking","primary_cat":"cs.AI","submitted_at":"2026-05-21T13:55:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22875","ref_index":25,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"RMA: an Agentic System for Research-Level Mathematical Problems","primary_cat":"cs.AI","submitted_at":"2026-05-20T04:54:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RMA, a multi-agent system with structured memory and iterative feedback loops, solves 8 out of 10 research-level math problems on the new First Proof benchmark and outperforms GPT-5.2R and Aletheia according to expert evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20315","ref_index":41,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-19T17:50:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Mix-Quant quantizes prefilling to NVFP4 and keeps BF16 for decoding in agentic LLMs, achieving up to 3x prefilling speedup while largely preserving task performance on long-context and agentic benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19035","ref_index":69,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Trustworthy Agent Network: Trust in Agent Networks Must Be Baked In, Not Bolted On","primary_cat":"cs.AI","submitted_at":"2026-05-18T18:57:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Argues that trustworthiness in Agent-to-Agent networks requires a new conceptual framework with four design pillars baked in from the beginning, as retrofitting existing single-agent methods is insufficient.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18747","ref_index":57,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Code as Agent Harness","primary_cat":"cs.CL","submitted_at":"2026-05-18T17:59:03+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"concern is not how much history to retain, but which pieces of information are most useful for the next action under a limited context budget. In code agents, working memory often appears as structured prompt regions, state summaries, failed-test records, file lists, or critical stack information. Its purpose is to mitigate context explosion, reduce repeated localization, and preserve the local consistency of an ongoing repair or editing trajectory [57, 182, 183, 45]. From a harness perspective, working memory is the active control surface betweenthemodelandthecodeenvironment: itdetermineswhattheagentobservesbeforechoosingthenext tool call, edit, or verification step. Representative systems such as SWE-agent [57] and RepairAgent [183] show that, even with the same underlying model, repository-level repair performance can vary substantially"},{"citing_arxiv_id":"2605.16819","ref_index":17,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents","primary_cat":"cs.CL","submitted_at":"2026-05-16T05:25:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AgentKernelArena is a new open benchmark that measures complete AI agent workflows on 196 GPU kernel tasks with correctness, performance, and generalization checks to unseen configurations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13438","ref_index":65,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"CogniFold: Always-On Proactive Memory via Cognitive Folding","primary_cat":"cs.AI","submitted_at":"2026-05-13T12:34:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CogniFold extends Complementary Learning Systems theory to three layers with a prefrontal intent layer and uses graph self-organization to build proactive agent memory from continuous event streams.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12913","ref_index":40,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Revisiting DAgger in the Era of LLM-Agents","primary_cat":"cs.LG","submitted_at":"2026-05-13T02:40:28+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12501","ref_index":3,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Covering Human Action Space for Computer Use: Data Synthesis and Benchmark","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:59:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"direction for future work. References [1] Anthropic. Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku. Technical report, Anthropic, October 2024. URL https://www.anthropic.com/news/ 3-5-models-and-computer-use. [2] OpenAI. Computer-Using Agent. Technical report, OpenAI, January 2025. URL https: //openai.com/index/computer-using-agent/. [3] John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528-50652, 2024. [4] Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al."},{"citing_arxiv_id":"2605.09860","ref_index":53,"ref_count":3,"confidence":0.55,"is_internal_anchor":false,"paper_title":"When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-11T01:43:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Learns state-conditioned commitment depth in a 7B vision-language policy that jointly predicts actions and replan intervals, outperforming fixed-depth baselines and larger models on Sliding Puzzle and Sokoban while providing a theoretical dominance result.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"sequence of that depth; and the constraint is that no more than K such decisions are made per episode. Most existing systems simplify this problem by fixing h as a single hand-tuned scalar per task. Such as, action chunking in robot learning [ 55, 9], modern vision-language-action models [25, 22], step-bounded reasoning in coding, web, and embodied VLM systems [ 53, 51, 26]. This is the suboptimality we argue against. In this work, we empirically show that on Sliding Puzzle and Sokoban-two long-horizon visual reasoning tasks, the optimal h should be state-, task-, and budget-dependent, and a scalar cannot track a target that varies across states. We propose a singlemodel-native, unifiedpolicy: one VLM with two heads-a depth head over H and"},{"citing_arxiv_id":"2605.09423","ref_index":93,"ref_count":4,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning","primary_cat":"cs.AI","submitted_at":"2026-05-10T08:51:50+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.","context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"Simworld: An open-ended simulator for agents in physical and social worlds. InAdvances in Neural Information Processing Systems, 2025. [92] Xunjian Yin, Xinyi Wang, Liangming Pan, Li Lin, Xiaojun Wan, and William Yang Wang. Gödel agent: A self-referential agent framework for recursive self-improvement.arXiv preprint arXiv:2410.04444, 2024. URLhttps://arxiv.org/abs/2410.04444. [93] Hong-Xing Yu, Haoyi Duan, Charles Herrmann, William T. Freeman, and Jiajun Wu. Wonderworld: Interactive 3d scene generation from a single image, 2025. URL https: //arxiv.org/abs/2406.09394. [94] Abhay Zala, Jaemin Cho, Han Lin, Jaehong Yoon, and Mohit Bansal. Envgen: Generating and adapting environments via llms for training embodied agents, 2024."},{"citing_arxiv_id":"2605.07725","ref_index":46,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SOD: Step-wise On-policy Distillation for Small Language Model Agents","primary_cat":"cs.CL","submitted_at":"2026-05-08T13:30:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"learning from human feedback (RLHF) [ 42, 43] with PPO [ 44] to more scalable methods like GRPO [20]. For language agents, structured reasoning paradigms such as ReAct [3], Toolformer [4], 2 and FireAct [45] enable tool use but rely on demonstrations rather than online optimization. Recent work extends RL to agent interaction trajectories across code generation [ 46], tool use [47], GUI interaction [48], and web navigation [ 49]. A central challenge is credit assignment under sparse, delayed feedback, addressed via trajectory-level updates and value-free formulations [50, 51]. KL- regularized policy optimization further introduces bias and instability concerns [ 44], amplified in agentic settings by distribution shift and compounding errors."},{"citing_arxiv_id":"2605.06992","ref_index":118,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Why Does Agentic Safety Fail to Generalize Across Tasks?","primary_cat":"cs.LG","submitted_at":"2026-05-07T22:16:03+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstrated in quadcopter and LLM experiments.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Trustworthy reinforcement learning against intrinsic vulnerabilities: Robustness, safety, and generalizability.arXiv preprint arXiv:2209.08025, 2022. [117] Siyuan Xu and Minghui Zhu. Efficient safe meta-reinforcement learning: Provable near-optimality and anytime safety. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [118] John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528-50652, 2024. [119] Tsung-Yen Yang, Michael Y Hu, Yinlam Chow, Peter J Ramadge, and Karthik Narasimhan. Safe"},{"citing_arxiv_id":"2605.06161","ref_index":48,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges","primary_cat":"cs.AI","submitted_at":"2026-05-07T12:49:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05701","ref_index":65,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Inference-Time Budget Control for LLM Search Agents","primary_cat":"cs.AI","submitted_at":"2026-05-07T05:45:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A VOI-based controller for dual inference budgets improves multi-hop QA performance by prioritizing search actions and selectively finalizing answers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23781","ref_index":17,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents","primary_cat":"cs.CV","submitted_at":"2026-04-26T16:05:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05550","ref_index":15,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery","primary_cat":"cs.CL","submitted_at":"2026-04-07T07:52:01+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03088","ref_index":62,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SkVM: Revisiting Language VM for Skills across Heterogenous LLMs and Harnesses","primary_cat":"cs.SE","submitted_at":"2026-04-03T15:11:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SkVM uses capability profiling and compiler-style techniques to make skills portable across LLMs and harnesses, raising task completion rates while cutting token use by up to 40% and delivering up to 3.2x speedup.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.09725","ref_index":72,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Efficient Remote KV Cache Reuse with GPU-native Video Codec","primary_cat":"cs.DC","submitted_at":"2026-02-10T12:29:02+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"KVCodec uses GPU-native video codecs and pipelined fetching to compress and transmit KV caches, delivering up to 3.51x faster TTFT than prior methods while preserving accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.04905","ref_index":22,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Retrieval-Augmented Code Generation: A Survey with Focus on Repository-Level Approaches","primary_cat":"cs.SE","submitted_at":"2025-10-06T15:20:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper organizes repository-level retrieval-augmented code generation into a unified framework covering retrieval substrate, control regime, and evaluation setting while summarizing strategies, datasets, and challenges.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Beyond scalability, retrieval-based techniques improve explainability, controllability, and interpretability by surfacing human-readable artifacts during the generation process. As the field advances, diverse RAG strategies have emerged, including sparse and dense retrieval, graph-based retrieval, hybrid pipelines, and agent-style retrieval that integrates static code analysis, tool invocation, and iterative refinement [19][20][21][22]. In recent literature, this research direction 2 Retrieval-Augmented Code Generation: A Survey with Focus on Repository-Level Approaches is also referred to asRetrieval-Augmented Code Generation (RACG), highlighting its growing importance at the intersection of software engineering and LLM research. As illustrated in Figure 1, existing surveys on LLMs generally proceed from either the perspective of general code"},{"citing_arxiv_id":"2510.03843","ref_index":32,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Smart Paste: Automatically Fixing Copy/Paste for Google Developers","primary_cat":"cs.SE","submitted_at":"2025-10-04T15:43:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Smart Paste applies deep learning to predict and suggest post-paste code edits in Google's IDE, achieving 45% acceptance and contributing over 1% of all code written company-wide after deployment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}