hub

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li · 2025 · cs.CL · arXiv 2508.19828

29 Pith papers cite this work. Polarity classification is still indexing.

29 Pith papers citing it

open full Pith review browse 29 citing papers arXiv PDF

abstract

Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of NLP tasks, but they remain fundamentally stateless, constrained by limited context windows that hinder long-horizon reasoning. Recent efforts to address this limitation often augment LLMs with an external memory bank, yet most existing pipelines are static and heuristic-driven, lacking a learned mechanism for deciding what to store, update, or retrieve. We present Memory-R1, a reinforcement learning (RL) framework that equips LLMs with the ability to actively manage and utilize external memory through two specialized agents: a Memory Manager that learns structured operations, including ADD, UPDATE, DELETE, and NOOP; and an Answer Agent that pre-selects and reasons over relevant entries. Both agents are fine-tuned with outcome-driven RL (PPO and GRPO), enabling adaptive memory management with minimal supervision. With only 152 training QA pairs, Memory-R1 outperforms strong baselines and generalizes across diverse question types, three benchmarks (LoCoMo, MSC, LongMemEval), and multiple model scales (3B-14B).

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare

cs.AI · 2026-05-12 · conditional · novelty 8.0

MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for personalized healthcare.

R^2-Mem: Reflective Experience for Memory Search

cs.CL · 2026-05-13 · conditional · novelty 7.0

R^2-Mem distills rubric-scored experiences from high- and low-quality search trajectories to guide LLM agents, raising F1 by up to 22.6% while cutting tokens 12.9% and iterations 20.2%.

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.

DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

DeepRefine refines agent-compiled knowledge bases via multi-turn abductive diagnosis and RL training with a GBD reward, yielding consistent downstream task gains.

MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents

cs.RO · 2026-05-08 · unverdicted · novelty 7.0

MemCompiler introduces state-conditioned memory compilation that dynamically selects and compiles relevant memory into text and latent guidance, yielding up to 129% gains over no-memory baselines and 60% lower latency across multiple embodied benchmarks.

Belief Memory: Agent Memory Under Partial Observability

cs.AI · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

BeliefMem is a probabilistic memory architecture for LLM agents that retains multiple candidate conclusions with probabilities updated by Noisy-OR, achieving superior average performance over deterministic baselines on LoCoMo and ALFWorld.

Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory

cs.CL · 2026-05-01 · unverdicted · novelty 7.0

MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.

OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory

cs.CL · 2026-04-29 · unverdicted · novelty 7.0

OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.

SAGER: Self-Evolving User Policy Skills for Recommendation Agent

cs.IR · 2026-04-16 · unverdicted · novelty 7.0

SAGER equips LLM recommendation agents with per-user evolving policy skills via two-representation architecture, contrastive CoT diagnosis, and skill-augmented listwise reasoning, yielding SOTA gains orthogonal to memory accumulation.

SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

SAGE is a self-evolving agentic graph-memory engine that dynamically constructs and refines structured memory graphs via writer-reader feedback, yielding performance gains on multi-hop QA, open-domain retrieval, and long-term agent benchmarks.

Tree-based Credit Assignment for Multi-Agent Memory System

cs.MA · 2026-05-06 · unverdicted · novelty 6.0

TreeMem assigns credit to agents in multi-agent memory systems by expanding outputs into a tree and using Monte Carlo averaging of final rewards to optimize each agent's policy.

What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis

cs.AI · 2026-05-05 · unverdicted · novelty 6.0

In LLM agents, memory routing circuits emerge at 0.6B scale while content circuits appear only at 4B, and write/read operations recruit a pre-existing late-layer context hub instead of creating a new one, enabling a 76% accurate unsupervised failure diagnostic.

MemRouter: Memory-as-Embedding Routing for Long-Term Conversational Agents

cs.CL · 2026-05-01 · unverdicted · novelty 6.0

A lightweight supervised router using frozen-LLM embeddings for memory admission decisions outperforms LLM-based memory managers in both F1 score and latency on the LoCoMo benchmark.

Learning When to Remember: Risk-Sensitive Contextual Bandits for Abstention-Aware Memory Retrieval in LLM-Based Coding Agents

cs.CL · 2026-04-30 · unverdicted · novelty 6.0

RSCB-MC is a risk-sensitive contextual bandit memory controller for LLM coding agents that chooses safe actions including abstention, achieving 60.5% proxy success with 0% false positives and low latency in 200-case validation.

MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search

cs.IR · 2026-04-19 · unverdicted · novelty 6.0

MemSearch-o1 mitigates memory dilution in agentic LLM search through reasoning-aligned token-level memory growth, retracing with a contribution function, and path reorganization, improving reasoning activation on benchmarks.

Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents

cs.AI · 2026-04-17 · conditional · novelty 6.0

The Experience Compression Spectrum unifies memory, skills, and rules in LLM agents along increasing compression levels and identifies the absence of adaptive cross-level compression as the missing diagonal.

POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.

Trust Your Memory: Verifiable Control of Smart Homes through Reinforcement Learning with Multi-dimensional Rewards

cs.AI · 2026-04-11 · unverdicted · novelty 6.0

Introduces MemHome benchmark and RL with multi-dimensional rewards for memory-driven smart home device control.

AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

AnomalyAgent uses tool-augmented reinforcement learning with self-reflection to generate realistic industrial anomalies, achieving better metrics than zero-shot methods on MVTec-AD.

TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context Distillation

cs.CL · 2026-04-09 · unverdicted · novelty 6.0

TSUBASA improves long-horizon personalization in LLMs via dynamic memory evolution for writing and context-distillation self-learning for reading, outperforming Mem0 and Memory-R1 on Qwen-3 benchmarks while reducing token use.

Decocted Experience Improves Test-Time Inference in LLM Agents

cs.AI · 2026-04-06 · unverdicted · novelty 6.0

Decocted experience—extracting and organizing the essence from accumulated interactions—enables more effective context construction that improves test-time inference in LLM agents on math, web, and software tasks.

MemFactory: Unified Inference & Training Framework for Agent Memory

cs.CL · 2026-03-31 · unverdicted · novelty 6.0

MemFactory is a new unified modular framework for memory-augmented LLM agent inference and training that integrates GRPO and reports up to 14.8% relative gains on MemAgent evaluations.

Reinforced Collaboration in Multi-Agent Flow Networks

cs.LG · 2026-05-13 · unverdicted · novelty 5.0

MANGO optimizes multi-agent LLM workflows via flow networks, RL, and textual gradients, delivering up to 12.8% higher performance and 47.4% better efficiency while generalizing to new domains.

Intermediate Artifacts as First-Class Citizens: A Data Model for Durable Intermediate Artifacts in Agentic Systems

cs.AI · 2026-05-12 · unverdicted · novelty 5.0

A systems-level data model for preserving typed, addressable, versioned, and dependency-aware intermediate artifacts in agentic AI systems to improve long-term inspectability and maintainability.

citing papers explorer

Showing 29 of 29 citing papers.

MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare cs.AI · 2026-05-12 · conditional · none · ref 38 · internal anchor
MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for personalized healthcare.
R^2-Mem: Reflective Experience for Memory Search cs.CL · 2026-05-13 · conditional · none · ref 4 · internal anchor
R^2-Mem distills rubric-scored experiences from high- and low-quality search trajectories to guide LLM agents, raising F1 by up to 22.6% while cutting tokens 12.9% and iterations 20.2%.
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues cs.CL · 2026-05-12 · unverdicted · none · ref 102 · internal anchor
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning cs.CL · 2026-05-11 · unverdicted · none · ref 26 · internal anchor
DeepRefine refines agent-compiled knowledge bases via multi-turn abductive diagnosis and RL training with a GBD reward, yielding consistent downstream task gains.
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents cs.RO · 2026-05-08 · unverdicted · none · ref 9 · internal anchor
MemCompiler introduces state-conditioned memory compilation that dynamically selects and compiles relevant memory into text and latent guidance, yielding up to 129% gains over no-memory baselines and 60% lower latency across multiple embodied benchmarks.
Belief Memory: Agent Memory Under Partial Observability cs.AI · 2026-05-07 · unverdicted · none · ref 17 · 2 links · internal anchor
BeliefMem is a probabilistic memory architecture for LLM agents that retains multiple candidate conclusions with probabilities updated by Noisy-OR, achieving superior average performance over deterministic baselines on LoCoMo and ALFWorld.
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory cs.CL · 2026-05-01 · unverdicted · none · ref 168 · internal anchor
MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory cs.CL · 2026-04-29 · unverdicted · none · ref 25 · internal anchor
OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
SAGER: Self-Evolving User Policy Skills for Recommendation Agent cs.IR · 2026-04-16 · unverdicted · none · ref 22 · internal anchor
SAGER equips LLM recommendation agents with per-user evolving policy skills via two-representation architecture, contrastive CoT diagnosis, and skill-augmented listwise reasoning, yielding SOTA gains orthogonal to memory accumulation.
SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory cs.AI · 2026-05-12 · unverdicted · none · ref 256 · internal anchor
SAGE is a self-evolving agentic graph-memory engine that dynamically constructs and refines structured memory graphs via writer-reader feedback, yielding performance gains on multi-hop QA, open-domain retrieval, and long-term agent benchmarks.
Tree-based Credit Assignment for Multi-Agent Memory System cs.MA · 2026-05-06 · unverdicted · none · ref 16 · internal anchor
TreeMem assigns credit to agents in multi-agent memory systems by expanding outputs into a tree and using Monte Carlo averaging of final rewards to optimize each agent's policy.
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis cs.AI · 2026-05-05 · unverdicted · none · ref 15 · internal anchor
In LLM agents, memory routing circuits emerge at 0.6B scale while content circuits appear only at 4B, and write/read operations recruit a pre-existing late-layer context hub instead of creating a new one, enabling a 76% accurate unsupervised failure diagnostic.
MemRouter: Memory-as-Embedding Routing for Long-Term Conversational Agents cs.CL · 2026-05-01 · unverdicted · none · ref 22 · internal anchor
A lightweight supervised router using frozen-LLM embeddings for memory admission decisions outperforms LLM-based memory managers in both F1 score and latency on the LoCoMo benchmark.
Learning When to Remember: Risk-Sensitive Contextual Bandits for Abstention-Aware Memory Retrieval in LLM-Based Coding Agents cs.CL · 2026-04-30 · unverdicted · none · ref 8 · internal anchor
RSCB-MC is a risk-sensitive contextual bandit memory controller for LLM coding agents that chooses safe actions including abstention, achieving 60.5% proxy success with 0% false positives and low latency in 200-case validation.
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search cs.IR · 2026-04-19 · unverdicted · none · ref 16 · internal anchor
MemSearch-o1 mitigates memory dilution in agentic LLM search through reasoning-aligned token-level memory growth, retracing with a contribution function, and path reorganization, improving reasoning activation on benchmarks.
Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents cs.AI · 2026-04-17 · conditional · none · ref 27 · internal anchor
The Experience Compression Spectrum unifies memory, skills, and rules in LLM agents along increasing compression levels and identifies the absence of adaptive cross-level compression as the missing diagonal.
POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch cs.CV · 2026-04-15 · unverdicted · none · ref 52 · internal anchor
POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.
Trust Your Memory: Verifiable Control of Smart Homes through Reinforcement Learning with Multi-dimensional Rewards cs.AI · 2026-04-11 · unverdicted · none · ref 23 · internal anchor
Introduces MemHome benchmark and RL with multi-dimensional rewards for memory-driven smart home device control.
AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning cs.CV · 2026-04-09 · unverdicted · none · ref 37 · internal anchor
AnomalyAgent uses tool-augmented reinforcement learning with self-reflection to generate realistic industrial anomalies, achieving better metrics than zero-shot methods on MVTec-AD.
TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context Distillation cs.CL · 2026-04-09 · unverdicted · none · ref 76 · internal anchor
TSUBASA improves long-horizon personalization in LLMs via dynamic memory evolution for writing and context-distillation self-learning for reading, outperforming Mem0 and Memory-R1 on Qwen-3 benchmarks while reducing token use.
Decocted Experience Improves Test-Time Inference in LLM Agents cs.AI · 2026-04-06 · unverdicted · none · ref 3 · internal anchor
Decocted experience—extracting and organizing the essence from accumulated interactions—enables more effective context construction that improves test-time inference in LLM agents on math, web, and software tasks.
MemFactory: Unified Inference & Training Framework for Agent Memory cs.CL · 2026-03-31 · unverdicted · none · ref 15 · internal anchor
MemFactory is a new unified modular framework for memory-augmented LLM agent inference and training that integrates GRPO and reports up to 14.8% relative gains on MemAgent evaluations.
Reinforced Collaboration in Multi-Agent Flow Networks cs.LG · 2026-05-13 · unverdicted · none · ref 50 · internal anchor
MANGO optimizes multi-agent LLM workflows via flow networks, RL, and textual gradients, delivering up to 12.8% higher performance and 47.4% better efficiency while generalizing to new domains.
Intermediate Artifacts as First-Class Citizens: A Data Model for Durable Intermediate Artifacts in Agentic Systems cs.AI · 2026-05-12 · unverdicted · none · ref 21 · internal anchor
A systems-level data model for preserving typed, addressable, versioned, and dependency-aware intermediate artifacts in agentic AI systems to improve long-term inspectability and maintainability.
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work cs.AI · 2026-05-07 · conditional · none · ref 27 · internal anchor
Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction cs.AI · 2026-04-29 · unverdicted · none · ref 24 · internal anchor
Web2BigTable introduces a bi-level multi-agent system that achieves new state-of-the-art results on wide-coverage and deep web-to-table search benchmarks through orchestration, coordination, and closed-loop reflection.
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering cs.SE · 2026-04-09 · accept · none · ref 171 · internal anchor
LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
Improving Sparse Memory Finetuning cs.LG · 2026-04-06 · unverdicted · none · ref 11 · internal anchor
Sparse memory modules with KL-based surprising-token selection let retrofitted LLMs acquire new factual knowledge while largely preserving held-out capabilities.
A Brief Overview: Agentic Reinforcement Learning In Large Language Models cs.AI · 2026-04-30 · unverdicted · none · ref 110 · 2 links · internal anchor
The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-reflection into LLM-based agents.

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer