MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for personalized healthcare.
hub
Memoryarena: Benchmarking agent memory in interdependent multi-session agentic tasks
22 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 22roles
background 3polarities
background 3representative citing papers
The supersession gap in LLM agents—failing to use current facts and discard superseded ones—is a distinct failure not fixed by scale or memory size, but improvable via RL training on a new environment.
The paper delivers the first systems characterization of agent memory, with a four-axis taxonomy, phase-aware profiler, evaluation of ten systems on two benchmarks, and ten design recommendations.
AgingBench demonstrates multi-dimensional degradation in deployed AI agents through four aging mechanisms diagnosed by temporal graphs and counterfactual probes across hundreds of runs.
MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.
π-Bench is a new benchmark for evaluating proactive personal assistant agents on 100 multi-turn tasks that include hidden intents, inter-task dependencies, and cross-session continuity.
GroupMemBench is a new benchmark exposing that LLM agent memory systems fail on group conversation properties like speaker-grounded tracking and audience-adapted responses, with top systems at 46% accuracy.
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
Introduces MPT benchmark and PRefine method that models user preferences as evolving hypotheses to improve personalized tool calling accuracy with 1.24% of full-history token cost.
PERMA is a new benchmark using temporally ordered events, text variability, and linguistic alignment to evaluate LLM memory agents on persona consistency beyond simple retrieval.
MemDelta shows agent memory evaluations are confounded by LLM family and embedding model, with RAG often matching full context and self-memory underperforming basic retrieval.
Presents M³Exam benchmark for multimodal conversational memory in user-agent settings and M³Proctor method that raises accuracy 13% while cutting construction time and tokens over 70%.
MemTrace turns LLM memory operations into executable evolution graphs for error tracing, builds a benchmark across systems like RAG and Mem0, and uses attribution to optimize prompts, improving task performance by up to 7.62%.
BoostAPR boosts automated program repair by training a sequence-level assessor and line-level credit allocator from execution outcomes, then applying them in PPO to reach 40.7% on SWE-bench Verified.
In LLM agents, memory routing circuits emerge at 0.6B scale while content circuits appear only at 4B, and write/read operations recruit a pre-existing late-layer context hub instead of creating a new one, enabling a 76% accurate unsupervised failure diagnostic.
CICL scores and compresses context evidence for LLM agents via action-shift and outcome-uplift metrics, lifting hit@1 from 0.58 to 0.78 on 50 SWE-bench retrieval tasks.
This survey defines execution provenance as a typed graph of agent execution and evidence tracing as its projection onto evidence-support relations, then reviews methods, taxonomy, benchmarks, and challenges for auditable LLM agents.
An agentic harness letting the LLM self-manage flat text-file storage via tool calls outperforms eight prior memory systems on cross-scenario generality across QA, chat, trajectory, stress-test, and long-horizon tasks.
Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axis taxonomy and design principles.
citing papers explorer
-
MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for personalized healthcare.
-
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
-
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
BoostAPR boosts automated program repair by training a sequence-level assessor and line-level credit allocator from execution outcomes, then applying them in PPO to reach 40.7% on SWE-bench Verified.