MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for personalized healthcare.
hub Canonical reference
Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning
Canonical reference. 100% of citing Pith papers cite this work as background.
abstract
Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of NLP tasks, but they remain fundamentally stateless, constrained by limited context windows that hinder long-horizon reasoning. Recent efforts to address this limitation often augment LLMs with an external memory bank, yet most existing pipelines are static and heuristic-driven, lacking a learned mechanism for deciding what to store, update, or retrieve. We present Memory-R1, a reinforcement learning (RL) framework that equips LLMs with the ability to actively manage and utilize external memory through two specialized agents: a Memory Manager that learns structured operations, including ADD, UPDATE, DELETE, and NOOP; and an Answer Agent that pre-selects and reasons over relevant entries. Both agents are fine-tuned with outcome-driven RL (PPO and GRPO), enabling adaptive memory management with minimal supervision. With only 152 training QA pairs, Memory-R1 outperforms strong baselines and generalizes across diverse question types, three benchmarks (LoCoMo, MSC, LongMemEval), and multiple model scales (3B-14B).
hub tools
citation-role summary
citation-polarity summary
roles
background 10polarities
background 10representative citing papers
User facts are internalized as surgical local edits to a hash-keyed Engram memory table with reasoning skill held in a shared adapter, claimed to match LoRA recall, improve indirect reasoning 5.6x on average, and compose across users with 33,000x smaller footprint than per-user adapters.
Framework estimates context-dependent marginal utility of candidate skills via reward gaps in matched base vs. skill-augmented rollouts to filter skills and co-train policy as generator.
MemTrain introduces two coupled self-supervised proxy tasks on Wikipedia corpora to train general context-memory capabilities in LLMs, reporting gains of up to 17.67 points on long-text and search-based QA benchmarks over direct post-training.
ElasticMem enables LLM agents to learn adaptive latent memory retrieval and elastic budget allocation, improving QA accuracy by 24-26% and ALFWorld success by 27-66% over baselines with lower token cost.
Introduces PerMemBench benchmark for personalized memory and shows session-level gating yields retention gains under perfect decisions but accurate gating is an open challenge.
MemMark enables snapshot-only attribution for agent long-term memory by embedding signals via keyed distribution-preserving sampling at memory-write decisions, recovering 40-bit payloads with near-baseline utility.
EvolveMem enables autonomous self-evolution of LLM memory retrieval configurations via LLM diagnosis and safeguards, delivering 25.7% gains over strong baselines on LoCoMo and 18.9% on MemBench with positive cross-benchmark transfer.
R^2-Mem distills rubric-scored experiences from high- and low-quality search trajectories to guide LLM agents, raising F1 by up to 22.6% while cutting tokens 12.9% and iterations 20.2%.
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
DeepRefine refines agent-compiled knowledge bases via multi-turn abductive diagnosis and RL training with a GBD reward, yielding consistent downstream task gains.
MemCompiler reframes memory use as state-conditioned compilation, delivering relevant guidance via text and latent channels to improve embodied agent performance up to 129% and cut latency 60% versus static injection.
BeliefMem is a probabilistic memory architecture for LLM agents that retains multiple candidate conclusions with probabilities updated by Noisy-OR, achieving superior average performance over deterministic baselines on LoCoMo and ALFWorld.
MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
SAGER equips LLM recommendation agents with per-user evolving policy skills via two-representation architecture, contrastive CoT diagnosis, and skill-augmented listwise reasoning, yielding SOTA gains orthogonal to memory accumulation.
PERMA is a new benchmark using temporally ordered events, text variability, and linguistic alignment to evaluate LLM memory agents on persona consistency beyond simple retrieval.
MM-Mem distills video input through a hierarchical memory of sensory buffer, episodic stream, and symbolic schema, optimized by a semantic information bottleneck and SIB-GRPO, to achieve SOTA on long-horizon video benchmarks.
AutoMem automates memory structure revision and proficiency training in LLMs, delivering 2x-4x performance gains on long-horizon games without altering task-action behavior.
ECHO is a selective turn-memory framework for agentic RL that compresses turns into indexed records, selects them for bounded contexts, and uses source indices to assign outcome credit to supporting evidence, reaching 43.4% accuracy on BrowseComp-Plus versus 28.9% for GRPO and 36.1% for SUPO.
ProReviewer is an MDP-formulated proactive peer review agent trained with SFT and RL on an 8B model that outperforms larger frontier LLMs on review quality metrics.
TMEM lets LLM agents evolve their policy mid-episode by absorbing distilled supervision into online LoRA updates, outperforming summary and retrieval baselines on several long-context benchmarks.
HMARS introduces a hierarchical multi-agent memory system that outperforms standard retrieval and other baselines on long-document and multi-turn reasoning tasks through improved evidence coverage.
RefMem-Bench benchmarks reflective memory in dialogue with 26K instances across eight dimensions, and REMIND improves model accuracy via hierarchical evidence retrieval, grounding, and abstraction.
citing papers explorer
-
ConMem: Structured Memory-Guided Adaptation in Training-Free Multi-Agent Systems
ConMem distills agent trajectories into structured memory cards organized in a relation-aware graph to enable training-free, relation-coordinated adaptation in LLM-based multi-agent systems.
-
EMBER: Efficient Memory via Budgeted Evidence Retention for Long-Horizon Agents
EMBER learns to retain source-backed evidence capsules under a fixed token budget, improving F1, Retain-Recall, and Read-Recall on LongMemEval-RR over budgeted baselines.
-
Exploring Cross-Scenario Generality of Agentic Memory Systems: Diagnostics and a Strong Baseline
An agentic harness letting the LLM self-manage flat text-file storage via tool calls outperforms eight prior memory systems on cross-scenario generality across QA, chat, trajectory, stress-test, and long-horizon tasks.
-
SaliMory: Orchestrating Cognitive Memory for Conversational Agents
SALIMORY trains an LM to orchestrate cognitive memory operations via stage-wise process rewards, cutting memory failures by one-third and more than doubling good personalization rates.
-
Learning User-Aware Recall: Personalized Retrieval in Long-Term Conversational Memory
PPRO improves user-aware memory retrieval in conversational agents by using derived user profiles for ranking and training a query rewriter via Group Relative Policy Optimization, with reported gains on LoCoMo and LongMemEval-S benchmarks.
-
AlphaMemo: Structured Search-Process Memory for Self-Evolving Alpha Mining Agents
AlphaMemo equips LLM alpha-mining agents with AST-diff motif memory, residual learning, and asymmetric veto control to improve out-of-sample factor discovery on CSI 500 and S&P 500.
-
Is Agent Memory a Database? Rethinking Data Foundations for Long-Term AI Agent Memory
Proposes Governed Evolving Memory (GEM) as a state-trajectory workload for long-term AI agent memory using four operators and six correctness conditions that record-level systems cannot satisfy.
-
Dynamic Mixture of Latent Memories for Self-Evolving Agents
MoLEM achieves a 10.40% average accuracy improvement in continual learning tasks across math, science, and code by using dynamic latent memory experts with a frozen base model and stage-specific autoencoders for routing.
-
Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents
Memory-R2 proposes LoGo-GRPO to fix unfair trajectory comparisons in RL training of memory-augmented LLM agents by combining global end-to-end rewards with local rerollouts from identical memory states.
-
PyraVid: Hierarchical Multimodal Memory for Long-Horizon Video Reasoning
PyraVid is a hierarchical multimodal memory system that structures long videos into pyramids to improve long-horizon reasoning and evidence aggregation.
-
DimMem: Dimensional Structuring for Efficient Long-Term Agent Memory
DimMem introduces typed dimensional memory units that improve accuracy to 81.43% and 78.20% on two long-term agent benchmarks while cutting token cost by 24% and enabling small models to match larger extractors.
-
Improving Multi-turn Dialogue Consistency with Self-Recall Thinking
SRT framework improves multi-turn dialogue F1 by 4.7% and cuts end-to-end latency by 14.7% via dependency construction, capability initialization, and reasoning improvement with recall tokens.
-
Reinforced Collaboration in Multi-Agent Flow Networks
MANGO optimizes multi-agent LLM workflows via flow networks, RL, and textual gradients, delivering up to 12.8% higher performance and 47.4% better efficiency while generalizing to new domains.
-
Intermediate Artifacts as First-Class Citizens: A Data Model for Durable Intermediate Artifacts in Agentic Systems
A systems-level data model for preserving typed, addressable, versioned, and dependency-aware intermediate artifacts in agentic AI systems to improve long-term inspectability and maintainability.
-
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
-
Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction
Web2BigTable introduces a bi-level multi-agent system that achieves new state-of-the-art results on wide-coverage and deep web-to-table search benchmarks through orchestration, coordination, and closed-loop reflection.
-
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
-
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.
-
TRUSTMEM: Learning Trustworthy Memory Consolidation for LLM Agents with Long-Term Memory
TrustMem introduces a verifier for memory update transitions and preference-guided RL to cut omission, corruption, and hallucination rates in LLM agent memory while reaching SOTA on MemoryAgentBench and HaluMem.
-
Toward Self-Evolution-Ready Workflow Harnesses: A Reversible Migration Path and Convertibility Taxonomy for Expert LLM Pipelines
Introduces a migration framework and A/B/C taxonomy to convert static expert LLM workflows into self-evolution-ready harnesses.
-
Improving Sparse Memory Finetuning
Sparse memory modules with KL-based surprising-token selection let retrofitted LLMs acquire new factual knowledge while largely preserving held-out capabilities.
-
Agentic Reasoning for Large Language Models
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.
-
Rethinking Agentic Reinforcement Learning In Large Language Models
The paper reviews conceptual foundations, methodological innovations, effective designs, critical challenges, and future directions for LLM-based Agentic Reinforcement Learning.
- Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents