{"total":10,"items":[{"citing_arxiv_id":"2605.20833","ref_index":29,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MemGym: a Long-Horizon Memory Environment for LLM Agents","primary_cat":"cs.CL","submitted_at":"2026-05-20T07:25:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18747","ref_index":205,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Code as Agent Harness","primary_cat":"cs.CL","submitted_at":"2026-05-18T17:59:03+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"to become a useful asset for repository-level repair. 3.2.4. Long-Term Memory When coding trajectories become longer, working memory and semantic memory alone are insufficient, because the system must also cope with memory growth, compression-induced evidence distortion, and long- term drift. This makes long-term retrieval planning and memory control an increasingly important research direction [202, 203, 204, 205, 206]. The focus therefore shifts from memory capacity to memory governance. Representative systems such as MemGPT [207] and MemoryOS [208] move the discussion from what to store toward when to write, when to compress, when to retrieve, and how to avoid contamination. Recent code- centricstudiesfurthergroundthislineofworkinsoftwareengineeringworkflows. MemCoder[ 190]leverages"},{"citing_arxiv_id":"2605.15710","ref_index":17,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SMMBench: A Benchmark for Source-Distributed Multimodal Agent Memory","primary_cat":"cs.CL","submitted_at":"2026-05-15T08:00:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SMMBench is a benchmark evaluating multimodal agents on cross-source reasoning, conflict resolution, preference reasoning, and action prediction, showing current systems struggle with evidence distributed across heterogeneous sources.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14498","ref_index":20,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations","primary_cat":"cs.CL","submitted_at":"2026-05-14T07:38:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GroupMemBench is a new benchmark exposing that LLM agent memory systems fail on group conversation properties like speaker-grounded tracking and audience-adapted responses, with top systems at 46% accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13941","ref_index":17,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents","primary_cat":"cs.LG","submitted_at":"2026-05-13T17:12:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EvolveMem enables autonomous self-evolution of LLM memory retrieval configurations via LLM diagnosis and safeguards, delivering 25.7% gains over strong baselines on LoCoMo and 18.9% on MemBench with positive cross-benchmark transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13438","ref_index":33,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"CogniFold: Always-On Proactive Memory via Cognitive Folding","primary_cat":"cs.AI","submitted_at":"2026-05-13T12:34:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CogniFold extends Complementary Learning Systems theory to three layers with a prefrontal intent layer and uses graph self-organization to build proactive agent memory from continuous event streams.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12260","ref_index":14,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents","primary_cat":"cs.CL","submitted_at":"2026-05-12T15:28:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PRISM is a new inference-time retrieval system that achieves higher accuracy than baselines on long-horizon agent tasks while using an order of magnitude less context by combining hierarchical graph search, intent-based costing, compression, and adaptive routing over structured memory.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"than every same-protocol baseline at an order-of-magnitude smaller context budget, occupying a previously empty corner of the accuracy-context-cost frontier and demonstrating a superior balance between answer quality and retrieval efficiency. 1 Introduction In long-horizon agentic tasks involving multi-session conversations and lifelong assistants [ 14], the effectiveness of large language model (LLM) agents is fundamentally constrained by what information can be surfaced into the answer model's context at query time. Because the attention window is finite and degrades on long inputs [13], an externalmemory systemis required to store past experience and to retrieve the relevant pieces when a new query arrives."},{"citing_arxiv_id":"2605.09330","ref_index":19,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"The Trap of Trajectory: Towards Understanding and Mitigating Spurious Correlations in Agentic Memory","primary_cat":"cs.LG","submitted_at":"2026-05-10T05:04:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Agentic memory improves clean reasoning but worsens performance when spurious patterns are present in stored trajectories; CAMEL calibration reduces this reliance while preserving clean performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"RQ2:How can agentic memory be calibrated to reduce influences of spurious correlations? This work.To answer these questions, this work makes the following contributions: I. Benchmarking and diagnosing spurious correlations in agentic memory Existing agentic-memory benchmarks evaluate retrieval accuracy or reasoning quality but do not account for spurious correlations [ 19, 38, 48, 59]. To address RQ1, we fill this gap with a new benchmark spanning four datasets. For each dataset, we recover a causal structure over the agent's observations and memory states, then identify variable pairs that are statistically associated yet lack any directed causal path between them. We target the discovery of three canonical types of spurious"},{"citing_arxiv_id":"2605.09278","ref_index":39,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium","primary_cat":"cs.AI","submitted_at":"2026-05-10T03:04:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"checks whether new edges are consistent with the local relational neighborhood and with existing multi-hop reasoning paths. In both cases, entries are admitted with trust-discounted weights, so low-confidence content remains available but exerts less influence on downstream retrieval. We evaluate EQUIMEMon reasoning- and action-intensive benchmarks [ 39, 57, 87, 88] across three MAD frameworks [37, 49, 79] and four memory architectures [3, 69, 89, 94]. Three findings emerge. (i)State-aware calibration beats isolated checks:EQUIMEMranks first on every benchmark- framework-memory configuration, confirming that errors missed by per-entry scoring are caught only when updates are evaluated against the global memory state."},{"citing_arxiv_id":"2605.01970","ref_index":47,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration","primary_cat":"cs.CR","submitted_at":"2026-05-03T17:07:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper defines and evaluates Trojan Hippo attacks on LLM agent memory, showing 85-100% success in data exfiltration across backends and reduced rates with defenses at varying utility costs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}