{"total":44,"items":[{"citing_arxiv_id":"2605.22505","ref_index":26,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Towards Direct Evaluation of Harness Optimizers via Priority Ranking","primary_cat":"cs.AI","submitted_at":"2026-05-21T13:55:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22177","ref_index":47,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles","primary_cat":"cs.LG","submitted_at":"2026-05-21T08:47:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Maestro uses outcome-based RL to train a lightweight policy that orchestrates ensembles of frozen expert models and skills, reporting 70.1% average accuracy across ten multimodal benchmarks and outperforming GPT-5 and Gemini-2.5-Pro while generalizing to unseen components.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20833","ref_index":41,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MemGym: a Long-Horizon Memory Environment for LLM Agents","primary_cat":"cs.CL","submitted_at":"2026-05-20T07:25:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20616","ref_index":24,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Auto-Dreamer: Learning Offline Memory Consolidation for Language Agents","primary_cat":"cs.CL","submitted_at":"2026-05-20T02:03:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Auto-Dreamer trains an offline memory consolidator via GRPO on agent performance to abstract cross-session patterns, outperforming baselines by 7 points on ScienceWorld with 12x smaller memory and generalizing to ALFWorld and WebArena.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20548","ref_index":58,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"What Do Agents Communicate? Characterizing Information Exchange in Multi-Agent Systems","primary_cat":"cs.MA","submitted_at":"2026-05-19T22:51:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Systematic study of inter-agent communication in LLM multi-agent systems shows reasoning and verification are critical for performance, with a new augmentation technique recovering 86.2% of failures.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18747","ref_index":247,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Code as Agent Harness","primary_cat":"cs.CL","submitted_at":"2026-05-18T17:59:03+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"inside a controlled environment, and verifies the resulting state before the next transition. The growing engineering ecosystem around agent harnesses reinforces this view: recent curated resources distinguish orchestration, working state, execution substrates, evaluation harnesses, observability, and governance as separable harness layers rather than incidental implementation details [247, 248, 249, 25]. In this view, the harness acts as acybernetic governor: a control layer that observes the effects of agent actions and regulates subsequent state transitions. Rather than merely forwarding error messages to the model, it observes the repository and execution environment through deterministic sensors such as linters, parsers, compilers, type checkers, unit tests, integration tests, static analyzers, fuzzers, runtime monitors, and CI"},{"citing_arxiv_id":"2605.18729","ref_index":28,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Robo-Cortex: A Self-Evolving Embodied Agent via Dual-Grain Cognitive Memory and Autonomous Knowledge Induction","primary_cat":"cs.RO","submitted_at":"2026-05-18T17:52:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Robo-Cortex proposes a self-evolving embodied navigation agent using dual-grain cognitive memory and autonomous knowledge induction from trajectories, reporting SPL gains on IGNav, AR, AEQA and preliminary real-robot tests.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18652","ref_index":47,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents","primary_cat":"cs.CV","submitted_at":"2026-05-18T16:57:36+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18535","ref_index":51,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Beyond Scaling: Agents Are Heading to the Edge","primary_cat":"cs.LG","submitted_at":"2026-05-18T15:18:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Personal agents require edge deployment to preserve high-fidelity local context and zero-latency loops, as claimed through three structural shifts away from cloud-centric designs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18401","ref_index":51,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution","primary_cat":"cs.CL","submitted_at":"2026-05-18T13:44:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SkillsVote is a governance system for agent skills that profiles corpora, recommends via search, and gates updates on successful reusable outcomes, yielding benchmark gains without model changes.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"They also make experience reuse a first-order systems problem: each task yields operational evidence, but that evidence is distributed across low-level traces and must be selected before it can support future tasks. Prior work on experiential agents shows that such traces can be organized into reusable experience or skills that shape later behavior [51, 58, 75]. Raw trajectories, however, are a weak substrate for long-term experience reuse. They are lengthy, noisy, tightly bound to local environments, and often conflate robust strategies with incidental state. Agent Skills provide a more structured schema for distilled experience: a skill can package procedural instructions, scripts, templates, references, dependency boundaries, and applicability conditions in a single artifact."},{"citing_arxiv_id":"2605.17075","ref_index":10,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"A Red Teaming Framework for Evaluating Robustness of AI-enabled Security Orchestration, Automation, and Response Systems","primary_cat":"cs.CR","submitted_at":"2026-05-16T16:46:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A hybrid LLM-RL red teaming framework generates adaptive attack campaigns in simulated enterprise networks to evaluate the robustness of AI-enabled SOAR systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15710","ref_index":22,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SMMBench: A Benchmark for Source-Distributed Multimodal Agent Memory","primary_cat":"cs.CL","submitted_at":"2026-05-15T08:00:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SMMBench is a benchmark evaluating multimodal agents on cross-source reasoning, conflict resolution, preference reasoning, and action prediction, showing current systems struggle with evidence distributed across heterogeneous sources.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15461","ref_index":18,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"DrugSAGE:Self-evolving Agent Experience for Efficient State-of-the-Art Drug Discovery","primary_cat":"cs.LG","submitted_at":"2026-05-14T22:49:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DrugSAGE accumulates cross-task memory of skills, statistical evidence, and recurring errors to let LLM agents achieve top-ranked performance on molecular property prediction tasks with reduced or zero test-time search.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14563","ref_index":28,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation","primary_cat":"cs.SE","submitted_at":"2026-05-14T08:35:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MemDocAgent generates consistent hierarchical repository-level code documentation by combining dependency-aware traversal with memory-guided agent interactions that accumulate work traces.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022. [27] Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634-8652, 2023. [28] Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InThe twelfth international conference on learning representations, 2023. [29] Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize"},{"citing_arxiv_id":"2605.13438","ref_index":52,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"CogniFold: Always-On Proactive Memory via Cognitive Folding","primary_cat":"cs.AI","submitted_at":"2026-05-13T12:34:39+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13369","ref_index":25,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Query-Conditioned Test-Time Self-Training for Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-13T11:27:40+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"QueST adapts LLMs at test time by generating query-specific problem-solution pairs for self-supervised fine-tuning, improving reasoning performance without external data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13037","ref_index":26,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-13T05:46:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-2K dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12755","ref_index":35,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"State-Centric Decision Process","primary_cat":"cs.AI","submitted_at":"2026-05-12T21:09:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Without a state space, sequential decision-making has no surface to operate on. Existing language agents address parts of this gap but none closes it. Reactive agents [55, 34] in- terleave reasoning with action selection yet operate directly on raw observations without construct- ing an explicit state. Reflective agents go further, accumulating verbal lessons or causal memories across episodes [35, 26, 60], but these summaries are open-ended text rather than states linked by certified transitions. Action planners [45, 54, 62] deliberate over candidate action sequences before or during execution, gaining the benefit of lookahead, but the plan entries are things to do rather than conditions to verify, so progress cannot be checked against the environment."},{"citing_arxiv_id":"2605.12741","ref_index":19,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-12T20:46:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout per prompt.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11882","ref_index":38,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment","primary_cat":"cs.AI","submitted_at":"2026-05-12T09:56:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"existing agent refinement and test-time defense baselines using the same backbone, Qwen3-8B- Instruct.Basedenotes the standard native tool-calling agent without additional refinement.ReAct instantiates Qwen3-8B-Instruct with an explicit reasoning-action-observation prompting format [44]. Reflexionfurther augments ReAct with verbal reflections and episodic memory from previous failures [38].Tool FilterandPI Detectorare AgentDojo-style runtime defenses against indirect prompt injection, applied on top of the same base agent [ 9, 12, 24]. Unlike these inference-time or runtime defenses, FATE updates the policy by converting verifier-scored failure trajectories into Pareto-filtered repair supervision. Appendix H reports diagnostic training baselines beyond the"},{"citing_arxiv_id":"2605.15215","ref_index":22,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SkillSmith: Compiling Agent Skills into Boundary-Guided Runtime Interfaces","primary_cat":"cs.AI","submitted_at":"2026-05-12T09:25:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SkillSmith is a boundary-first compiler-runtime system that turns skill packages into minimal executable interfaces, cutting token usage 57%, thinking iterations 43%, and solve time 51% versus raw skill injection on SkillsBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10366","ref_index":16,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"EGL-SCA: Structural Credit Assignment for Co-Evolving Instructions and Tools in Graph Reasoning Agents","primary_cat":"cs.AI","submitted_at":"2026-05-11T11:09:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EGL-SCA co-evolves instructions and tools via structural credit assignment in graph reasoning agents and reports 92% average success on four benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"reasoning methods [3, 19] can produce plausible solutions on small graph instances, but they are brittle when exact structural recovery and verifiable outputs are required.2 Tool-using agents [21, 14, 12, 25, 26] improve execution, yet they usually assume a fixed toolbox and therefore cannot accumulate new graph-specific algorithmic tools during training. 3 Reflection-based self-improvement methods [16, 24] can refine language behavior through feedback, but they typically treat failure as a single undifferentiated signal. In graph reasoning, such undifferentiated feedback is inadequate. A failure caused by an incorrectly reconstructed constraint should improve the parsing behavior; a failure caused by choosing an unsuitable solver should improve retrieval or selection; and a failure caused"},{"citing_arxiv_id":"2605.18799","ref_index":32,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning","primary_cat":"cs.LG","submitted_at":"2026-05-11T09:22:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ReCrit frames critic interaction as a correctness-transition problem and uses quadrant-based RL rewards to improve LLM performance on scientific reasoning benchmarks by rewarding corrections and robustness while penalizing sycophancy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10064","ref_index":15,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs","primary_cat":"cs.AI","submitted_at":"2026-05-11T06:39:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MAGE uses a four-subgraph co-evolutionary knowledge graph plus dual bandits to externalize and retrieve experience for stable self-evolution of frozen language-model agents, showing gains on nine diverse benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"to open-ended agents. VOYAGER[ 17] grows a code-based skill library, SEAGENT[ 16] learns software use through an auto-generated curriculum, and proposer-solver systems such as AGENT0 and ABSOLUTEZEROconstruct tasks through self-play or adversarial generation. In parallel, prompting methods such as self-consistency [ 18], ReAct [ 28], and Reflexion [ 15] improve inference-time behavior through sampling, tool use, or verbal self-reflection. These lines highlight the importance of deciding what experience to expose to the model, but their prompts, memories, or curricula are usually fixed, locally retrieved, or generated without an explicit coverage guarantee over task types. Position of MAGE.MAGEcombines these threads by treating the agent's persistent state as a"},{"citing_arxiv_id":"2605.10057","ref_index":23,"ref_count":3,"confidence":0.55,"is_internal_anchor":false,"paper_title":"STAR: Failure-Aware Markovian Routing for Multi-Agent Spatiotemporal Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-11T06:34:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"STAR presents a failure-aware routing framework using a state-conditioned transition policy and an agent routing matrix combining expert routes with learned recoveries from execution traces to improve multi-agent spatiotemporal reasoning.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"STB24) [9] (11 task types), and ST-Bench (2026) (henceforthSTB26) [15] (4 task types). We use eight backbone LLMs spanning proprietary and open-source families from 3B to 20B parameters. The primary metric is exact match (EM) with numeric tolerance, while regression tasks are evaluated with RMSE/MAE. Baselines include LLM-only, LLM-only with extended reasoning, Reflexion [23], ReAct [31], Tree-of-Thought [30], Graph-of-Thought [2, 32], and function-calling. Unless otherwise noted, the routing matrix uses α= 0.3 and τ= 0.4 . Full implementation details, prompts, and per-task results are deferred to Appendices D-F. 4.2 Main Results Table 1: STARvs. prompting and agent baselines (Qwen3-8B, EM-eligible queries; 95% Wilson"},{"citing_arxiv_id":"2605.09879","ref_index":30,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models","primary_cat":"cs.AI","submitted_at":"2026-05-11T02:05:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"M2A uses null-space model merging to combine mathematical and agentic reasoning in LLMs, raising SWE-Bench Verified performance from 44.0% to 51.2% on Qwen3-8B without retraining.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"However, reasoning gains do not always generalize beyond the training domain [ 2, 11]. Although mathematical reasoning is widely viewed as a general source of reasoning capability, math-specialized models often bring limited benefits to broader tasks such as scientific QA, coding, and agent planning [11, 2]. Recent work therefore explores broader reasoning data and multi-domain RL objectives [30, 22, 25, 27, 33], but the resulting generalization remains sensitive to the target behavior. This limitation is especially pronounced for agentic reasoning. Prior studies show that math reasoning does not consistently improve broader LLM capabilities [ 11]; multi-task RL can improve several reasoning-intensive domains but does not reliably enhance agent performance [33];"},{"citing_arxiv_id":"2605.09330","ref_index":37,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"The Trap of Trajectory: Towards Understanding and Mitigating Spurious Correlations in Agentic Memory","primary_cat":"cs.LG","submitted_at":"2026-05-10T05:04:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Agentic memory improves clean reasoning but worsens performance when spurious patterns are present in stored trajectories; CAMEL calibration reduces this reliance while preserving clean performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[35] Bernhard Schölkopf. Causality for machine learning. InProbabilistic and causal inference: The works of Judea Pearl, pages 765-804. 2022. [36] Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning.Proceedings of the IEEE, 109(5):612-634, 2021. [37] Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634-8652, 2023. [38] Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive"},{"citing_arxiv_id":"2605.09278","ref_index":56,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium","primary_cat":"cs.AI","submitted_at":"2026-05-10T03:04:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"While shared memory boosts long-horizon reasoning, it also opens a critical vulnerability: a corrupted memory state, which can subsequently contaminate all downstream memory-augmented reasoning [66]. The corruption arises because individual agents are inherently imperfect: LLMs hallucinate, agree sycophantically [ 6, 7, 55], or confidently assert incorrect claims [56, 86, 89]. These failures do not cancel under debate. Across recent works, three failure patterns present (Figure 1): (i) an over- confident contributor pushes a hallucinated entry past hedged auditors who defer to its confidence rather than challenge it [ 17, 72, 82], producing a corrupted memory that reads like established fact; (ii) over-confident auditors veto a tentative but correct contribution [ 14, 93], producing an"},{"citing_arxiv_id":"2605.10990","ref_index":25,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries","primary_cat":"cs.SE","submitted_at":"2026-05-09T11:41:53+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round repair success from 10% to 78%.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"ent lifecycle questions: skill acquisition and usefulness, pre-load security auditing, or governance. SKILLGUARDstudies post-deployment maintenance: whether a previously valid skill has become stale because its environment-facing assumptions no longer hold. Reactive repair and localization.Automated program repair localizes faults and patches code after failures are observed [4, 5, 24]. Self-refine [20] and Reflexion [25] similarly improve model outputs through iterative feedback. These methods are reactive: they require a failed execution, test, or trajectory before repair is attempted. SKILLGUARDinstead targets proactive maintenance. It detects violated environment contracts before relying on an execution failure, and it uses the failed contract as a localized repair signal."},{"citing_arxiv_id":"2605.08715","ref_index":50,"ref_count":4,"confidence":0.55,"is_internal_anchor":false,"paper_title":"AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems","primary_cat":"cs.CL","submitted_at":"2026-05-09T05:55:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.","context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"[48] Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634-8652, 2023. [49] Yoo Yeon Sung, Hannah Kim, and Dan Zhang. Verila: A human-centered evaluation framework for interpretable verification of llm agent failures.arXiv preprint arXiv:2503.12651, 2025. [50] Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426-9439, 2024. [51] Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi"},{"citing_arxiv_id":"2605.08704","ref_index":33,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"AgentPSO: Evolving Agent Reasoning Skill via Multi-agent Particle Swarm Optimization","primary_cat":"cs.AI","submitted_at":"2026-05-09T05:38:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AgentPSO evolves reusable multi-agent reasoning skills via PSO-inspired natural-language updates, outperforming static agents and test-time multi-agent baselines on math and general reasoning tasks with cross-benchmark transfer.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"We compare AgentPSO with baselines from three main categories. The implementation details of all baselines are described in Appendix C: • Vanilla single-agent methods.We include Chain-of-Thought Prompting (CoT) [41] and Step-Back Prompting [50] as basic single-agent reasoning baselines. • Advanced single-agent methods.We include Self-Refine [ 30], Reflection [ 33], Self- Consistency [39], and Tree-of-Thoughts [46], which improve single-agent reasoning through refinement, reflection, sampling, or structured search. • Multi-agent methods.We compare against MAD [ 10], MoA [38], and DMAD [28], which leverage multiple agents to improve reasoning through debate, aggregation, or diversity. Implementation Details.We implement AgentPSO with ChatGPT (gpt-5."},{"citing_arxiv_id":"2605.08703","ref_index":23,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"RewardHarness: Self-Evolving Agentic Post-Training","primary_cat":"cs.AI","submitted_at":"2026-05-09T05:32:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"arXiv preprint arXiv:2603.12698, 2026. [22] Samin Mahdizadeh Sani, Max Ku, Nima Jamali, Matina Mahdizadeh Sani, Paria Khoshtab, Wei-Chieh Sun, Parnian Fazel, Zhi Rui Tam, Thomas Chong, Edisy Kin Wai Chan, et al. Imagenworld: Stress-testing image generation models with explainable human evaluation on open-ended real-world tasks.arXiv preprint arXiv:2603.27862, 2026. [23] Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634-8652, 2023. [24] Theodore Sumers, Shunyu Yao, Karthik R Narasimhan, and Thomas L Griffiths. Cognitive architectures for language agents.Transactions on Machine Learning Research, 2023."},{"citing_arxiv_id":"2605.07594","ref_index":27,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents","primary_cat":"cs.RO","submitted_at":"2026-05-08T11:07:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MemCompiler reframes memory use as state-conditioned compilation, delivering relevant guidance via text and latent channels to improve embodied agent performance up to 129% and cut latency 60% versus static injection.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"12s, as the executor only processes compact, state-relevant compiled memory. 2. Related Work General Agent Memory Systems.Most general-purpose agent memory systems assume astatic, up- front memory injectionparadigm, where retrieved or summarized content is provided once and remains unchanged throughout execution. Foundational systems such as Generative Agents [25, 26, 3], Reflex- ion [27], MemGPT [28], and ExpeL [29] store and retrieve experience in natural language, injecting it into the input context during execution. Subsequentworkexploresmorestructuredmemoryorganization, includingknowledgegraphindexing as in HippoRAG [30], LLM-driven memory management as exemplified by Mem0 [13], Zettelkasten- style [31, 32] linking proposed in A-Mem [33], and agent-controlled hot-path updates introduced in"},{"citing_arxiv_id":"2605.07461","ref_index":13,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance","primary_cat":"cs.CL","submitted_at":"2026-05-08T09:08:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. [12] Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, pages 1279-1297, 2025. [13] Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634-8652, 2023. [14] Encheng Su, Jianyu Wu, Chen Tang, Lintao Wang, Pengze Li, Aoran Wang, Jinouwen Zhang, Yizhou Wang, Yuan Meng, Xinzhu Ma, et al. Sciif: Benchmarking scientific instruction following towards rigorous"},{"citing_arxiv_id":"2605.05701","ref_index":45,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Inference-Time Budget Control for LLM Search Agents","primary_cat":"cs.AI","submitted_at":"2026-05-07T05:45:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A VOI-based controller for dual inference budgets improves multi-hop QA performance by prioritizing search actions and selectively finalizing answers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05413","ref_index":28,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"From History to State: Constant-Context Skill Learning for LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-05-06T20:13:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Constant-context skill learning trains reusable task-family modules for LLM agents using a deterministic state block for progress tracking and subgoal rewards, achieving 89.6% unseen success on ALFWorld, 76.8% on WebShop, and 66.4% on SciWorld with Qwen3-8B while reducing prompt tokens 2-7x.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27488","ref_index":20,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO","primary_cat":"cs.CL","submitted_at":"2026-04-30T06:39:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Skills-Coach optimizes LLM agent skills via task generation, prompt/code tuning, comparative execution, and traceable evaluation, reporting gains on a 48-skill benchmark called Skill-X.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26805","ref_index":41,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations","primary_cat":"cs.AI","submitted_at":"2026-04-29T15:35:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Bian Que is an agentic framework using a unified operational paradigm, flexible Skill Arrangement, and self-evolving mechanism to automate O&M tasks, achieving 75% alert reduction and over 50% MTTR cut in production deployment.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Advances in neural information processing systems, 36:8634-8652, 2023. [40] Joon Sung Park, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1-22, 2023. [41] Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025. [42] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback."},{"citing_arxiv_id":"2604.18292","ref_index":86,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence","primary_cat":"cs.AI","submitted_at":"2026-04-20T14:01:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14475","ref_index":21,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Evo-MedAgent: Beyond One-Shot Diagnosis with Agents That Remember, Reflect, and Improve","primary_cat":"cs.AI","submitted_at":"2026-04-15T23:12:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Evo-MedAgent adds three evolving memory stores to LLM agents for chest X-ray diagnosis, raising MCQ accuracy from 0.68 to 0.79 on GPT-5-mini and 0.76 to 0.87 on Gemini-3 Flash without any training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14399","ref_index":29,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SpaceMind: A Modular and Self-Evolving Embodied Vision-Language Agent Framework for Autonomous On-orbit Servicing","primary_cat":"cs.RO","submitted_at":"2026-04-15T20:27:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SpaceMind is a self-evolving modular VLM agent framework that achieves 90-100% navigation success in nominal conditions and recovers from failures via experience distillation, with zero-code transfer to physical robots for on-orbit tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10800","ref_index":49,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Verify Before You Fix: Agentic Execution Grounding for Trustworthy Cross-Language Code Analysis","primary_cat":"cs.SE","submitted_at":"2026-04-12T20:22:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A framework combining universal AST normalization, hybrid graph-LLM embeddings, and strict execution-grounded validation achieves 89-92% intra-language accuracy and 74-80% cross-language F1 while resolving 70% of vulnerabilities at 12% failure rate.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08206","ref_index":8,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"\"Theater of Mind\" for LLMs: A Cognitive Architecture Based on Global Workspace Theory","primary_cat":"cs.MA","submitted_at":"2026-04-09T13:06:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Global Workspace Agents (GWA) is proposed as an active, event-driven cognitive architecture for LLMs featuring an entropy-based intrinsic drive and dual-layer memory to enable sustained self-directed agency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02674","ref_index":52,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Do Agent Societies Develop Intellectual Elites? The Hidden Power Laws of Collective Cognition in LLM Multi-Agent Systems","primary_cat":"cs.MA","submitted_at":"2026-04-03T03:08:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLM agent societies develop power-law coordination cascades and intellectual elites through an integration bottleneck that grows with system size.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15174-15186, 2024. [51] Chen Qian, Zihao Xie, Yifei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, et al. Scaling large language model-based multi-agent collabora- tion.arXiv preprint arXiv:2406.07155, 2024. [52] Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634-8652, 2023. [53] Nickolay Smirnov. Table for estimating the goodness of fit of empirical distributions.The annals of mathematical statistics, 19(2):279-281, 1948."}],"limit":50,"offset":0}