{"total":14,"items":[{"citing_arxiv_id":"2606.30306","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Always-OnAgents:A Survey of Persistent Memory, State, and Governance in LLMAgents","primary_cat":"cs.MA","submitted_at":"2026-06-29T13:47:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Survey mapping persistent state in LLM agents along six axes and proposing the AOEP-v0 protocol to evaluate governance and recovery obligations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02461","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AgentCL: Toward Rigorous Evaluation of Continual Learning in Language Agents","primary_cat":"cs.AI","submitted_at":"2026-06-01T16:32:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AgentCL constructs controlled task streams with intentional reusability and introduces MemProbe to evaluate non-parametric memory designs for continual learning in language agents across coding, research, and reasoning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07632","ref_index":190,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evaluation of ML Resource Utilization Requires Model Life Cycle Assessment","primary_cat":"cs.LG","submitted_at":"2026-05-31T05:58:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper calls for life cycle assessment to capture embodied hardware costs and full pipeline operational costs in AI development and deployment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27366","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation","primary_cat":"cs.AI","submitted_at":"2026-05-26T17:59:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MUSE-Autoskill introduces a skill-centric framework for self-evolving LLM agents through a unified lifecycle of skill creation, memory, management, evaluation, and refinement.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20625","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AlphaMemo: Structured Search-Process Memory for Self-Evolving Alpha Mining Agents","primary_cat":"cs.AI","submitted_at":"2026-05-26T15:48:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AlphaMemo equips LLM alpha-mining agents with AST-diff motif memory, residual learning, and asymmetric veto control to improve out-of-sample factor discovery on CSI 500 and S&P 500.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21463","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mem-$\\pi$: Adaptive Memory through Learning When and What to Generate","primary_cat":"cs.CL","submitted_at":"2026-05-20T17:51:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mem-π is a framework using a dedicated model and decision-content decoupled RL to generate context-specific guidance on demand for LLM agents, outperforming retrieval baselines by over 30% on web navigation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18565","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MINTEval: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems","primary_cat":"cs.CL","submitted_at":"2026-05-18T15:43:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MINTEval benchmark shows current memory-augmented systems average 27.9% accuracy on long-horizon interference tasks, limited by retrieval and memory construction with degradation from intervening updates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08013","ref_index":76,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning CLI Agents with Structured Action Credit under Selective Observation","primary_cat":"cs.AI","submitted_at":"2026-05-08T17:02:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CLI agents trained with RL benefit from selective observation via σ-Reveal and structured credit assignment via A³ that leverages AST action sub-chains and trajectory margins.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"labels are produced by independent Claude Opus 4.7 audits [4] with live filesystem interaction and aggregated by majority vote. Public-corpus instances repeatedly judged hallucinatory or incorrectly labeled are excluded. 4.2 Experimental Configuration All experiments are conducted on four NVIDIA H200 accelerators with Qwen3-14B [41] as the policy model and SGLang [76] for both training rollouts and inference. Training uses group size 4, learning rate 5×10−7, train and validation batch sizes of 16, mini-batch size 16, and a maximum context length of 32,768. The environment horizon is 6 steps, sandbox execution times out after 10 seconds, and the reward combines answer reward and progress reward with weights3 and 0."},{"citing_arxiv_id":"2605.06365","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work","primary_cat":"cs.AI","submitted_at":"2026-05-07T14:39:37+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tity, or partial downstream invalidation the organizing abstraction of the whole system. 2.7 Benchmarks and Reproducible Agent Environments Another relevant literature studies how to evaluate agents in realistic yet reproducible environments. Benchmarks such as AgentBench [38], WebArena [39], VisualWebArena [40], WorkArena [41], AndroidWorld [42], OSWorld [43], AppWorld [44], GAIA [45], and LifelongAgentBench [46] increasingly evaluate agents in stateful, tool-using settings rather than static prompts. Recent 2025-2026 benchmark work sharpens the memory and persistence angle in particular. MemoryAgentBench [47], Evo-Memory [48], and Mem2ActBench [49] all argue that static one-shot evaluation misses the harder problem of incremental accumulation, selective forgetting, and task-conditioned reuse."},{"citing_arxiv_id":"2604.17091","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)","primary_cat":"cs.CL","submitted_at":"2026-04-18T17:59:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GenericAgent outperforms other LLM agents on long-horizon tasks by maximizing context information density with fewer tokens via minimal tools, on-demand memory, trajectory-to-SOP evolution, and compression.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"knowledge, such as user preference, tool behaviors and effective action patterns, is not available at the outset. It emerges only through repeated trial and failure during actual task execution. This exploration is a natural and necessary process. The key question is whether the lessons learned can be retained and reused when similar tasks arise later. Without such a mechanism, agents repeat the same failure patterns across sessions [9, 14]. Successful strategies, once discovered, are forgotten upon context expiration. Token expenditure scales linearly with task count, yet effective capability remains flat-a stagnation loop with no return on accumulated interaction. Existing agent frameworks largely fail to address this. Most treat each task episode as stateless, with no persistent memory across sessions [9, 15]."},{"citing_arxiv_id":"2604.15597","ref_index":96,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLMs Corrupt Your Documents When You Delegate","primary_cat":"cs.CL","submitted_at":"2026-04-17T00:33:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03295","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Scaling Teams or Scaling Time? Memory Enabled Lifelong Learning in LLM Multi-Agent Systems","primary_cat":"cs.MA","submitted_at":"2026-03-27T19:34:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLMA-Mem improves long-horizon performance in LLM multi-agent systems over baselines while reducing cost and shows non-monotonic scaling where memory-enabled smaller teams can beat larger ones.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.20857","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory","primary_cat":"cs.CL","submitted_at":"2025-11-25T21:08:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Evo-Memory is a new streaming benchmark and evaluation framework for self-evolving memory in LLM agents, unifying over ten memory modules and introducing the ReMem pipeline for continual improvement on multi-turn and reasoning datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.21046","ref_index":100,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence","primary_cat":"cs.AI","submitted_at":"2025-07-28T17:59:05+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}