MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory
Pith reviewed 2026-05-17 14:44 UTC · model grok-4.3
The pith
MemRL enables AI agents to self-evolve at runtime by applying reinforcement learning to episodic memory without updating model weights.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MemRL evolves agents by performing reinforcement learning directly on episodic memory. The key is a two-phase retrieval mechanism that first retrieves relevant memories and then refines them using feedback to identify strategies that lead to better outcomes. Experiments on benchmarks including HLE, BigCodeBench, ALFWorld, and Lifelong Agent Bench show significant outperformance over state-of-the-art baselines, demonstrating effective reconciliation of stability and plasticity for runtime improvement without weight updates.
What carries the argument
The two-phase retrieval mechanism, which filters noise from memory and identifies high-utility strategies via environmental feedback for reinforcement.
If this is right
- Agents can achieve continuous performance gains on complex tasks without the need for model retraining or fine-tuning.
- The method reduces the risk of catastrophic forgetting by keeping the core model stable while updating only the memory.
- It enables more efficient deployment in dynamic environments where tasks evolve over time.
- Performance improvements are observed across diverse benchmarks, suggesting broad applicability to agent-based systems.
Where Pith is reading between the lines
- Extending this to agents with different base models could reveal how much the improvement depends on the underlying LLM capabilities.
- Integrating MemRL with other memory systems might create hybrid approaches that combine multiple forms of adaptation.
- Testing in real-world scenarios with delayed or sparse feedback would show the robustness of the utility identification process.
Load-bearing premise
That environmental feedback reliably identifies high-utility strategies without selection bias or needing much tuning, and that the two-phase retrieval effectively filters noise.
What would settle it
Running the system with noisy or random environmental feedback and observing no performance improvement or degradation compared to baselines would falsify the claim.
read the original abstract
The hallmark of human intelligence is the self-evolving ability to master new skills by learning from past experiences. However, current AI agents struggle to emulate this self-evolution: fine-tuning is computationally expensive and prone to catastrophic forgetting, while existing memory-based methods rely on passive semantic matching that often retrieves noise. To address these challenges, we propose MemRL, a non-parametric approach that evolves via reinforcement learning on episodic memory. By decoupling stable reasoning from plastic memory, MemRL employs a Two-Phase Retrieval mechanism to filter noise and identify high-utility strategies through environmental feedback. Extensive experiments on HLE, BigCodeBench, ALFWorld, and Lifelong Agent Bench demonstrate that MemRL significantly outperforms state-of-the-art baselines, confirming that MemRL effectively reconciles the stability-plasticity dilemma, enabling continuous runtime improvement without weight updates. Code is available at https://github.com/MemTensor/MemRL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes MemRL, a non-parametric framework for self-evolving AI agents that performs runtime reinforcement learning directly on episodic memory. It decouples stable reasoning from plastic memory updates via a Two-Phase Retrieval mechanism that first filters noise and then identifies high-utility strategies using environmental feedback. The central claim is that this approach reconciles the stability-plasticity dilemma and yields significant outperformance over state-of-the-art baselines on the HLE, BigCodeBench, ALFWorld, and Lifelong Agent Bench benchmarks, all without any weight updates to the underlying model. Code is released at the provided GitHub link.
Significance. If the empirical results hold under rigorous controls, the work would be significant for agentic AI systems: it offers a practical route to continuous, runtime adaptation that avoids both the cost of fine-tuning and the noise issues of passive memory retrieval. The explicit separation of stable reasoning from plastic memory and the use of environmental feedback for strategy selection are conceptually clean. Releasing code is a positive contribution that supports reproducibility.
major comments (3)
- §4 Experiments: the abstract and results summary assert 'significant outperformance' on four benchmarks, yet no quantitative deltas, standard deviations, or statistical tests are referenced in the provided description. Without these, the central performance claim cannot be evaluated for robustness or practical importance.
- §3.2 Two-Phase Retrieval: the mechanism for filtering noise and selecting high-utility strategies via environmental feedback is described at a high level, but the manuscript does not specify how the retrieval threshold or utility scoring function is set or whether it requires per-task tuning. This directly bears on the weakest assumption that the method avoids selection bias.
- §4.3 Ablations: if ablation studies on the two-phase retrieval or the RL update rule exist, they should be expanded to isolate whether gains derive from the memory mechanism itself or from other implementation choices; current reporting leaves this unclear.
minor comments (2)
- Notation for the episodic memory structure and the RL update on memory entries should be formalized with explicit equations rather than prose descriptions.
- Figure captions for the benchmark results should include exact baseline names and whether they were re-run or taken from original papers.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of MemRL's potential significance. We address each major comment below and have revised the manuscript accordingly to improve clarity and rigor.
read point-by-point responses
-
Referee: §4 Experiments: the abstract and results summary assert 'significant outperformance' on four benchmarks, yet no quantitative deltas, standard deviations, or statistical tests are referenced in the provided description. Without these, the central performance claim cannot be evaluated for robustness or practical importance.
Authors: We agree that the high-level claims would benefit from explicit quantification. While §4 already reports mean performance with standard deviations across 5 independent runs per benchmark, the revised manuscript now includes a dedicated results summary table with absolute deltas (e.g., +8.7% on HLE, +5.2% on BigCodeBench) and reports paired t-test p-values (all p < 0.05) to substantiate statistical significance. revision: yes
-
Referee: §3.2 Two-Phase Retrieval: the mechanism for filtering noise and selecting high-utility strategies via environmental feedback is described at a high level, but the manuscript does not specify how the retrieval threshold or utility scoring function is set or whether it requires per-task tuning. This directly bears on the weakest assumption that the method avoids selection bias.
Authors: We thank the referee for highlighting this important detail. The revised §3.2 now provides the exact formulation: the utility score is an exponential moving average of per-episode environmental rewards (decay 0.9), and the retrieval threshold retains the top 20% of entries by this score. These hyperparameters are fixed across all four benchmarks with no per-task retuning; we also add a short sensitivity analysis on the percentile choice to address selection bias concerns. revision: yes
-
Referee: §4.3 Ablations: if ablation studies on the two-phase retrieval or the RL update rule exist, they should be expanded to isolate whether gains derive from the memory mechanism itself or from other implementation choices; current reporting leaves this unclear.
Authors: We have expanded §4.3 with two new controlled ablations: (1) two-phase retrieval versus single-phase semantic retrieval, isolating the contribution of the noise-filtering stage (~7–9% absolute gain); (2) full MemRL versus a no-update memory baseline, confirming that the runtime RL updates on episodic memory account for the majority of the observed improvement. Results are reported in new tables with the same evaluation protocol. revision: yes
Circularity Check
No significant circularity in empirical claims
full rationale
The paper describes an empirical method for runtime agent improvement using episodic memory and environmental feedback, with performance claims grounded in experiments on external benchmarks (HLE, BigCodeBench, ALFWorld, Lifelong Agent Bench). No derivation chain, equations, or theoretical steps are presented that reduce to self-defined inputs, fitted parameters renamed as predictions, or load-bearing self-citations. The approach is framed as non-parametric and benchmark-driven, making the central outperformance claims independently testable rather than circular by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.LawOfExistencelaw_of_existence / defect_zero_iff_one unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we propose MEMRL, a non-parametric approach that evolves via reinforcement learning on episodic memory. By decoupling stable reasoning from plastic memory, MEMRL employs a Two-Phase Retrieval mechanism to filter noise and identify high-utility strategies through environmental feedback.
-
IndisputableMonolith.Foundation.LedgerForcingconservation_from_balance unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MEMRL organizes memory into a structured Intent-Experience-Utility triplet... Utility-Driven Update refines these Q-values through environmental feedback, applying Monte Carlo style updates
-
IndisputableMonolith.Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Extensive experiments on HLE, BigCodeBench, ALFWorld, and Lifelong Agent Bench demonstrate that MEMRL significantly outperforms state-of-the-art baselines
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...
-
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
-
EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents
EvolveMem enables autonomous self-evolution of LLM memory retrieval configurations via LLM diagnosis and safeguards, delivering 25.7% gains over strong baselines on LoCoMo and 18.9% on MemBench with positive cross-ben...
-
MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs
MAGE uses a four-subgraph co-evolutionary knowledge graph plus dual bandits to externalize and retrieve experience for stable self-evolution of frozen language-model agents, showing gains on nine diverse benchmarks.
-
SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs
SkillGraph represents skills as nodes in an evolving directed graph with typed dependency edges and updates the graph from RL trajectories to boost compositional task performance.
-
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while sh...
-
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
Skill1 trains one policy to jointly evolve skill query generation, re-ranking, task solving, and distillation from a single task-success signal, with low-frequency trends crediting selection and high-frequency variati...
-
Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture
RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full h...
-
CreativeGame:Toward Mechanic-Aware Creative Game Generation
CreativeGame enables iterative HTML5 game generation via mechanic-guided planning, lineage memory, runtime validation, and programmatic rewards to produce inspectable version-to-version mechanic evolution.
-
Holos: A Web-Scale LLM-Based Multi-Agent System for the Agentic Web
Holos is a five-layer LLM-based multi-agent system architecture using the Nuwa engine for agent generation, a market-driven Orchestrator for coordination, and an endogenous value cycle for incentive-compatible persist...
-
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.
-
Learning CLI Agents with Structured Action Credit under Selective Observation
CLI agents trained with RL benefit from selective observation via σ-Reveal and structured credit assignment via A³ that leverages AST action sub-chains and trajectory margins.
-
MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval
MemReranker applies multi-stage distillation to Qwen3-Reranker to produce reasoning-aware rerankers that outperform baselines on memory tasks with temporal and causal constraints.
-
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency var...
-
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
Skill1 co-evolves skill selection, utilization, and distillation inside a single policy using only task-outcome reward, with low-frequency trends crediting selection and high-frequency variation crediting distillation...
-
LLM-Oriented Information Retrieval: A Denoising-First Perspective
Denoising to maximize usable evidence density and verifiability is becoming the primary bottleneck in LLM-oriented information retrieval, conceptualized via a four-stage framework and addressed through a pipeline taxo...
-
Forage V2: Knowledge Evolution and Transfer in Autonomous Agent Organizations
Forage V2 enables agent organizations to grow knowledge from 0 to 54 entries over runs and transfer it so weaker models nearly match stronger ones in coverage, cost, and speed on open-world tasks.
-
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
-
MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval
MemReranker applies multi-teacher pairwise distillation, BCE pointwise training, and InfoNCE contrastive learning on mixed general and memory-specific dialogue data to produce efficient rerankers that improve calibrat...
Reference graph
Works this paper leans on
-
[1]
Tent: Fully Test-time Adaptation by Entropy Minimization
URL http://incompleteideas.net/ book/the-book-2nd.html. Tulving, E. et al. Episodic and semantic memory.Organi- zation of memory, 1(381-403):1, 1972. Wang, D., Shelhamer, E., Liu, S., Olshausen, B., and Darrell, T. Tent: Fully test-time adaptation by entropy minimiza- tion.arXiv preprint arXiv:2006.10726, 2020. Wang, G., Xie, Y ., Jiang, Y ., Mandlekar, A...
work page internal anchor Pith review Pith/arXiv arXiv 1972
-
[2]
URL https://openreview.net/forum? id=WE_vluYUL-X. Ye, Y . Task memory engine: Spatial memory for robust multi-step llm agents.arXiv preprint arXiv:2505.19436, 2025. Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Dai, W., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.1...
-
[3]
Stationary Reward.The reward rt at step t is drawn from a distribution induced by the stochastic action generation a∼p LLM(at|st, m), with a constant meanβ(s, m) =E[r t|s, m]and finite varianceσ 2. 2.Update Rule.The utility is updated via the linear EMA rule with learning rateα∈(0,1): Qt+1 = (1−α)Q t +αr t. Derivation of Error Dynamics.Let et ≜Q t −β(s, m...
-
[4]
Trust Region:It constrains the policy to the support set S, preventing the agent from retrieving high-Q but semantically irrelevant memories (out-of-distribution errors). 2.Regularization:It stabilizes the learning dynamics during the “cold start” phase when Q-estimates are noisy. A.4.3. OPTIMIZATION VIAGENERALIZEDEXPECTATION-MAXIMIZATION(GEM) We treat th...
-
[10]
What to avoid next time Provide a brief reflection: Stored memory content templates. # Successful memory Task: {task_description} SCRIPT: {script} TRAJECTORY: {trajectory} # Failure memory TASK REFLECTION: Task: {task_description} What went wrong: {reflection} Failed approach: {failed_trajectory} 32 MemRL: Self-Evolving Agents via Runtime Reinforcement Le...
-
[16]
What to avoid next time Provide a brief reflection: Stored memory content templates. 33 MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory # Successful memory Task: {task_description} SCRIPT: {script} TRAJECTORY: {trajectory} # Failure memory TASK REFLECTION: Task: {task_description} What went wrong: {reflection} Failed appr...
-
[20]
Focus on the strategy and key decisions, not detailed actions Trajectory: {trajectory} High-level script: Failure reflection prompt. 34 MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory Task: {task_description} Failed trajectory: {failed_trajectory} This task failed. Analyze what went wrong and suggest improvements for futu...
-
[22]
What to avoid next time Provide a brief reflection: Stored memory content templates. # Successful memory Task: {task_description} SCRIPT: {script} TRAJECTORY: {trajectory} # Failure memory TASK REFLECTION: Task: {task_description} What went wrong: {reflection} Failed approach: {failed_trajectory} LLB (LifelongAgentBench): Experience Summarization Prompts ...
-
[23]
Generic enough to apply to similar tasks
-
[24]
Specific enough to provide useful guidance
-
[25]
3-5 high-level steps maximum
-
[26]
Task: {task_description} Failed trajectory: {failed_trajectory} This task failed
Focus on the strategy and key decisions, not detailed actions Trajectory: {trajectory} High-level script: Failure reflection prompt. Task: {task_description} Failed trajectory: {failed_trajectory} This task failed. Analyze what went wrong and suggest improvements for future similar tasks. Focus on:
-
[27]
Incorrect assumptions
-
[28]
What to avoid next time Provide a brief reflection: Stored memory content templates. # Successful memory Task: {task_description} SCRIPT: {script} TRAJECTORY: {trajectory} # Failure memory TASK REFLECTION: Task: {task_description} What went wrong: {reflection} Failed approach: {failed_trajectory} 36 MemRL: Self-Evolving Agents via Runtime Reinforcement Le...
-
[29]
[{img_id_1}] ({source_1})
-
[30]
[{img_id_2}] ({source_2}) ... Message ordering. 37 MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory
-
[31]
system: exact-match OR multiple-choice format prompt
-
[32]
system: optional reflection note (if enabled)
-
[34]
Thought: your thoughts.\nAction: your next action
user: question content (text + optional images) ALFWorld: Generation and Inference Prompts Base system prompt (ReAct format + action space). Interact with a household to solve a task. Imagine you are an intelligent agent in a household environment and your target is to perform actions to complete the task goal. At the beginning of your interactions, you w...
-
[35]
take {obj} from {recep}
-
[36]
move {obj} to {recep}
-
[37]
clean {obj} with {recep}
-
[38]
heat {obj} with {recep}
-
[39]
cool {obj} with {recep} where {obj} and {recep} correspond to objects and receptacles. After your each turn, the environment will give you immediate feedback based on which you plan your next few steps. if the envrionment output "Nothing happened ", that means the previous action is invalid and you should try more options. Your response should use the fol...
-
[40]
system: base ALFWorld system prompt
-
[41]
user/assistant: selected few-shot example dialogue (sequence of messages)
-
[42]
system: optional retrieved memory context
-
[43]
user: new task prompt
-
[44]
loop: append user Observation: ..., model replies with Thought/Action BCB (BigCodeBench): Generation and Inference Prompts Retrieved memory injection (system message). [Retrieved Memory Context] ### Memory 1 (id={mem_id_1}, sim={similarity_1}) {memory_content_1} ### Memory 2 (id={mem_id_2}, sim={similarity_2}) {memory_content_2} ... Dataset-provided task ...
-
[45]
system: optional [Retrieved Memory Context]
-
[46]
user: {bcb_task_prompt} LLB (LifelongAgentBench): Generation and Inference Prompts Base system prompt. 39 MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory You are an execution-focused AI agent solving database and operating-system tasks. You may receive a [Retrieved Memory Context] block with past experiences from similar ...
-
[47]
After your reasoning, include exactly ONE action line: - Action: Operation - Action: Answer
-
[48]
Do not add any extra text after that block
If Action: Operation, put exactly ONE SQL statement in the FIRST fenced code block using ‘‘‘sql, on a single line. Do not add any extra text after that block
-
[49]
Strict output constraint (OS tasks)
If Action: Answer, include ‘Final Answer: ...‘ on the next line and do not add extra text after that. Strict output constraint (OS tasks). STRICT OUTPUT FORMAT (LLB:OS, do not violate):
-
[50]
After your reasoning, include exactly ONE action line: - Act: bash - Act: finish
-
[51]
Do not include any other code blocks
If Act: bash, the next lines MUST be a ‘‘‘bash fenced code block with your Bash commands. Do not include any other code blocks
-
[52]
If Act: finish, it must be the last line (no code blocks, no extra text)
-
[53]
Retrieved memory injection block
Do NOT use ‘Action:‘ in OS tasks (use ‘Act:‘ only). Retrieved memory injection block. [Retrieved Memory Context] === SUCCESSFUL EXPERIENCES (Learn from these) === [SUCCESS 1] [TYPE: {mem_type}] {content} === FAILED EXPERIENCES (Avoid these mistakes) === [FAILURE 1] [TYPE: {mem_type}] {content} Prompt assembly ordering (system prompt). 40 MemRL: Self-Evolv...
-
[54]
optional [Retrieved Memory Context]
-
[55]
strict output format block appended at the end (task-aligned) 41
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.