arxiv: 2601.03192 · v2 · pith:LNQRTXUTnew · submitted 2026-01-06 · 💻 cs.CL

MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory

Shengtao Zhang , Jiaqian Wang , Ruiwen Zhou , Junwei Liao , Yuchen Feng , Zhuo Li , Yujie Zheng , Weinan Zhang

show 6 more authors

Ying Wen Zhiyu Li Feiyu Xiong Yutao Qi Bo Tang Muning Wen

This is my paper

Pith reviewed 2026-05-17 14:44 UTC · model grok-4.3

classification 💻 cs.CL

keywords self-evolving agentsepisodic memoryruntime reinforcement learningstability-plasticity dilemmanon-parametric learningtwo-phase retrievalAI agentslifelong learning

0 comments

The pith

MemRL enables AI agents to self-evolve at runtime by applying reinforcement learning to episodic memory without updating model weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MemRL as a non-parametric method for agents to learn from past experiences by decoupling stable reasoning from plastic memory updates. It uses a two-phase retrieval process to filter out noise and select high-utility strategies based on environmental feedback through reinforcement learning. This approach allows continuous improvement on tasks like coding and environment navigation. Readers would care because it solves the problem of catastrophic forgetting and high computational costs associated with fine-tuning while enabling lifelong adaptation.

Core claim

MemRL evolves agents by performing reinforcement learning directly on episodic memory. The key is a two-phase retrieval mechanism that first retrieves relevant memories and then refines them using feedback to identify strategies that lead to better outcomes. Experiments on benchmarks including HLE, BigCodeBench, ALFWorld, and Lifelong Agent Bench show significant outperformance over state-of-the-art baselines, demonstrating effective reconciliation of stability and plasticity for runtime improvement without weight updates.

What carries the argument

The two-phase retrieval mechanism, which filters noise from memory and identifies high-utility strategies via environmental feedback for reinforcement.

If this is right

Agents can achieve continuous performance gains on complex tasks without the need for model retraining or fine-tuning.
The method reduces the risk of catastrophic forgetting by keeping the core model stable while updating only the memory.
It enables more efficient deployment in dynamic environments where tasks evolve over time.
Performance improvements are observed across diverse benchmarks, suggesting broad applicability to agent-based systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending this to agents with different base models could reveal how much the improvement depends on the underlying LLM capabilities.
Integrating MemRL with other memory systems might create hybrid approaches that combine multiple forms of adaptation.
Testing in real-world scenarios with delayed or sparse feedback would show the robustness of the utility identification process.

Load-bearing premise

That environmental feedback reliably identifies high-utility strategies without selection bias or needing much tuning, and that the two-phase retrieval effectively filters noise.

What would settle it

Running the system with noisy or random environmental feedback and observing no performance improvement or degradation compared to baselines would falsify the claim.

read the original abstract

The hallmark of human intelligence is the self-evolving ability to master new skills by learning from past experiences. However, current AI agents struggle to emulate this self-evolution: fine-tuning is computationally expensive and prone to catastrophic forgetting, while existing memory-based methods rely on passive semantic matching that often retrieves noise. To address these challenges, we propose MemRL, a non-parametric approach that evolves via reinforcement learning on episodic memory. By decoupling stable reasoning from plastic memory, MemRL employs a Two-Phase Retrieval mechanism to filter noise and identify high-utility strategies through environmental feedback. Extensive experiments on HLE, BigCodeBench, ALFWorld, and Lifelong Agent Bench demonstrate that MemRL significantly outperforms state-of-the-art baselines, confirming that MemRL effectively reconciles the stability-plasticity dilemma, enabling continuous runtime improvement without weight updates. Code is available at https://github.com/MemTensor/MemRL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MemRL gives a direct non-parametric way to improve agents at runtime by running RL on episodic memory instead of weights, but the reported gains rest on thin experimental detail.

read the letter

The central move here is to treat episodic memory as the plastic part of the system and run reinforcement learning on it during operation. The base model stays frozen, which sidesteps catastrophic forgetting, while a two-phase retrieval step first pulls semantically related memories and then scores them with real environmental feedback to keep only the high-utility ones. That combination is presented as new enough to test on HLE, BigCodeBench, ALFWorld, and the Lifelong Agent Bench, with the claim that it beats existing baselines without any weight updates. Releasing the code is a clear positive; anyone can check the implementation of the retrieval filter and the RL update rule on memory entries. The framing around the stability-plasticity trade-off is also straightforward and matches a real pain point for long-running agents. The soft spots are mostly in the evidence. The abstract states outperformance but gives no numbers, no error bars, and no ablation that isolates how much the second retrieval phase or the RL signal actually moves the needle versus simpler memory lookup. Without those controls it is hard to rule out selection bias in the feedback loop or sensitivity to how the environment labels success. The full paper may contain the missing tables, but on the current description the performance story feels preliminary rather than definitive. This work is aimed at people already building memory-augmented agents who need something that can keep adapting after deployment. It is not a paradigm shift, but the concrete mechanism and the public code make it worth a careful read for that group. I would send it to peer review; the idea is clean and the problem is practical, even if the evaluation needs tightening before it can be taken as settled.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes MemRL, a non-parametric framework for self-evolving AI agents that performs runtime reinforcement learning directly on episodic memory. It decouples stable reasoning from plastic memory updates via a Two-Phase Retrieval mechanism that first filters noise and then identifies high-utility strategies using environmental feedback. The central claim is that this approach reconciles the stability-plasticity dilemma and yields significant outperformance over state-of-the-art baselines on the HLE, BigCodeBench, ALFWorld, and Lifelong Agent Bench benchmarks, all without any weight updates to the underlying model. Code is released at the provided GitHub link.

Significance. If the empirical results hold under rigorous controls, the work would be significant for agentic AI systems: it offers a practical route to continuous, runtime adaptation that avoids both the cost of fine-tuning and the noise issues of passive memory retrieval. The explicit separation of stable reasoning from plastic memory and the use of environmental feedback for strategy selection are conceptually clean. Releasing code is a positive contribution that supports reproducibility.

major comments (3)

§4 Experiments: the abstract and results summary assert 'significant outperformance' on four benchmarks, yet no quantitative deltas, standard deviations, or statistical tests are referenced in the provided description. Without these, the central performance claim cannot be evaluated for robustness or practical importance.
§3.2 Two-Phase Retrieval: the mechanism for filtering noise and selecting high-utility strategies via environmental feedback is described at a high level, but the manuscript does not specify how the retrieval threshold or utility scoring function is set or whether it requires per-task tuning. This directly bears on the weakest assumption that the method avoids selection bias.
§4.3 Ablations: if ablation studies on the two-phase retrieval or the RL update rule exist, they should be expanded to isolate whether gains derive from the memory mechanism itself or from other implementation choices; current reporting leaves this unclear.

minor comments (2)

Notation for the episodic memory structure and the RL update on memory entries should be formalized with explicit equations rather than prose descriptions.
Figure captions for the benchmark results should include exact baseline names and whether they were re-run or taken from original papers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of MemRL's potential significance. We address each major comment below and have revised the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: §4 Experiments: the abstract and results summary assert 'significant outperformance' on four benchmarks, yet no quantitative deltas, standard deviations, or statistical tests are referenced in the provided description. Without these, the central performance claim cannot be evaluated for robustness or practical importance.

Authors: We agree that the high-level claims would benefit from explicit quantification. While §4 already reports mean performance with standard deviations across 5 independent runs per benchmark, the revised manuscript now includes a dedicated results summary table with absolute deltas (e.g., +8.7% on HLE, +5.2% on BigCodeBench) and reports paired t-test p-values (all p < 0.05) to substantiate statistical significance. revision: yes
Referee: §3.2 Two-Phase Retrieval: the mechanism for filtering noise and selecting high-utility strategies via environmental feedback is described at a high level, but the manuscript does not specify how the retrieval threshold or utility scoring function is set or whether it requires per-task tuning. This directly bears on the weakest assumption that the method avoids selection bias.

Authors: We thank the referee for highlighting this important detail. The revised §3.2 now provides the exact formulation: the utility score is an exponential moving average of per-episode environmental rewards (decay 0.9), and the retrieval threshold retains the top 20% of entries by this score. These hyperparameters are fixed across all four benchmarks with no per-task retuning; we also add a short sensitivity analysis on the percentile choice to address selection bias concerns. revision: yes
Referee: §4.3 Ablations: if ablation studies on the two-phase retrieval or the RL update rule exist, they should be expanded to isolate whether gains derive from the memory mechanism itself or from other implementation choices; current reporting leaves this unclear.

Authors: We have expanded §4.3 with two new controlled ablations: (1) two-phase retrieval versus single-phase semantic retrieval, isolating the contribution of the noise-filtering stage (~7–9% absolute gain); (2) full MemRL versus a no-update memory baseline, confirming that the runtime RL updates on episodic memory account for the majority of the observed improvement. Results are reported in new tables with the same evaluation protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical claims

full rationale

The paper describes an empirical method for runtime agent improvement using episodic memory and environmental feedback, with performance claims grounded in experiments on external benchmarks (HLE, BigCodeBench, ALFWorld, Lifelong Agent Bench). No derivation chain, equations, or theoretical steps are presented that reduce to self-defined inputs, fitted parameters renamed as predictions, or load-bearing self-citations. The approach is framed as non-parametric and benchmark-driven, making the central outperformance claims independently testable rather than circular by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5498 in / 1056 out tokens · 38535 ms · 2026-05-17T14:44:15.915908+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.LawOfExistence law_of_existence / defect_zero_iff_one unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose MEMRL, a non-parametric approach that evolves via reinforcement learning on episodic memory. By decoupling stable reasoning from plastic memory, MEMRL employs a Two-Phase Retrieval mechanism to filter noise and identify high-utility strategies through environmental feedback.
IndisputableMonolith.Foundation.LedgerForcing conservation_from_balance unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MEMRL organizes memory into a structured Intent-Experience-Utility triplet... Utility-Driven Update refines these Q-values through environmental feedback, applying Monte Carlo style updates
IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Extensive experiments on HLE, BigCodeBench, ALFWorld, and Lifelong Agent Bench demonstrate that MEMRL significantly outperforms state-of-the-art baselines

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
cs.AI 2026-05 conditional novelty 8.0

MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
cs.AI 2026-05 conditional novelty 7.0

ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents
cs.LG 2026-05 unverdicted novelty 7.0

EvolveMem enables autonomous self-evolution of LLM memory retrieval configurations via LLM diagnosis and safeguards, delivering 25.7% gains over strong baselines on LoCoMo and 18.9% on MemBench with positive cross-ben...
MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs
cs.AI 2026-05 unverdicted novelty 7.0

MAGE uses a four-subgraph co-evolutionary knowledge graph plus dual bandits to externalize and retrieve experience for stable self-evolution of frozen language-model agents, showing gains on nine diverse benchmarks.
SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs
cs.CL 2026-05 unverdicted novelty 6.0

SkillGraph represents skills as nodes in an evolving directed graph with typed dependency edges and updates the graph from RL trajectories to boost compositional task performance.
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while sh...
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

Skill1 trains one policy to jointly evolve skill query generation, re-ranking, task solving, and distillation from a single task-success signal, with low-frequency trends crediting selection and high-frequency variati...
Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture
cs.SE 2026-05 unverdicted novelty 6.0

RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full h...
CreativeGame:Toward Mechanic-Aware Creative Game Generation
cs.AI 2026-04 unverdicted novelty 6.0

CreativeGame enables iterative HTML5 game generation via mechanic-guided planning, lineage memory, runtime validation, and programmatic rewards to produce inspectable version-to-version mechanic evolution.
Holos: A Web-Scale LLM-Based Multi-Agent System for the Agentic Web
cs.AI 2026-01 unverdicted novelty 6.0

Holos is a five-layer LLM-based multi-agent system architecture using the Nuwa engine for agent generation, a market-driven Orchestrator for coordination, and an endogenous value cycle for incentive-compatible persist...
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
cs.AI 2026-05 unverdicted novelty 5.0 partial

Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.
Learning CLI Agents with Structured Action Credit under Selective Observation
cs.AI 2026-05 unverdicted novelty 5.0

CLI agents trained with RL benefit from selective observation via σ-Reveal and structured credit assignment via A³ that leverages AST action sub-chains and trajectory margins.
MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval
cs.CL 2026-05 unverdicted novelty 5.0

MemReranker applies multi-stage distillation to Qwen3-Reranker to produce reasoning-aware rerankers that outperform baselines on memory tasks with temporal and causal constraints.
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 5.0

Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency var...
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 5.0

Skill1 co-evolves skill selection, utilization, and distillation inside a single policy using only task-outcome reward, with low-frequency trends crediting selection and high-frequency variation crediting distillation...
LLM-Oriented Information Retrieval: A Denoising-First Perspective
cs.IR 2026-05 unverdicted novelty 5.0

Denoising to maximize usable evidence density and verifiability is becoming the primary bottleneck in LLM-oriented information retrieval, conceptualized via a four-stage framework and addressed through a pipeline taxo...
Forage V2: Knowledge Evolution and Transfer in Autonomous Agent Organizations
cs.AI 2026-04 unverdicted novelty 5.0

Forage V2 enables agent organizations to grow knowledge from 0 to 54 entries over runs and transfer it so weaker models nearly match stronger ones in coverage, cost, and speed on open-world tasks.
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
cs.SE 2026-04 accept novelty 5.0

LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval
cs.CL 2026-05 unverdicted novelty 4.0

MemReranker applies multi-teacher pairwise distillation, BCE pointwise training, and InfoNCE contrastive learning on mixed general and memory-specific dialogue data to produce efficient rerankers that improve calibrat...

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 16 Pith papers · 1 internal anchor

[1]

Tent: Fully Test-time Adaptation by Entropy Minimization

URL http://incompleteideas.net/ book/the-book-2nd.html. Tulving, E. et al. Episodic and semantic memory.Organi- zation of memory, 1(381-403):1, 1972. Wang, D., Shelhamer, E., Liu, S., Olshausen, B., and Darrell, T. Tent: Fully test-time adaptation by entropy minimiza- tion.arXiv preprint arXiv:2006.10726, 2020. Wang, G., Xie, Y ., Jiang, Y ., Mandlekar, A...

work page internal anchor Pith review Pith/arXiv arXiv 1972
[2]

URL https://openreview.net/forum? id=WE_vluYUL-X. Ye, Y . Task memory engine: Spatial memory for robust multi-step llm agents.arXiv preprint arXiv:2505.19436, 2025. Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Dai, W., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.1...

work page arXiv 2025
[3]

2.Update Rule.The utility is updated via the linear EMA rule with learning rateα∈(0,1): Qt+1 = (1−α)Q t +αr t

Stationary Reward.The reward rt at step t is drawn from a distribution induced by the stochastic action generation a∼p LLM(at|st, m), with a constant meanβ(s, m) =E[r t|s, m]and finite varianceσ 2. 2.Update Rule.The utility is updated via the linear EMA rule with learning rateα∈(0,1): Qt+1 = (1−α)Q t +αr t. Derivation of Error Dynamics.Let et ≜Q t −β(s, m...

work page
[4]

cold start

Trust Region:It constrains the policy to the support set S, preventing the agent from retrieving high-Q but semantically irrelevant memories (out-of-distribution errors). 2.Regularization:It stabilizes the learning dynamics during the “cold start” phase when Q-estimates are noisy. A.4.3. OPTIMIZATION VIAGENERALIZEDEXPECTATION-MAXIMIZATION(GEM) We treat th...

work page arXiv 1998
[10]

role": ...,

What to avoid next time Provide a brief reflection: Stored memory content templates. # Successful memory Task: {task_description} SCRIPT: {script} TRAJECTORY: {trajectory} # Failure memory TASK REFLECTION: Task: {task_description} What went wrong: {reflection} Failed approach: {failed_trajectory} 32 MemRL: Self-Evolving Agents via Runtime Reinforcement Le...

work page
[16]

status":

What to avoid next time Provide a brief reflection: Stored memory content templates. 33 MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory # Successful memory Task: {task_description} SCRIPT: {script} TRAJECTORY: {trajectory} # Failure memory TASK REFLECTION: Task: {task_description} What went wrong: {reflection} Failed appr...

work page
[20]

34 MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory Task: {task_description} Failed trajectory: {failed_trajectory} This task failed

Focus on the strategy and key decisions, not detailed actions Trajectory: {trajectory} High-level script: Failure reflection prompt. 34 MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory Task: {task_description} Failed trajectory: {failed_trajectory} This task failed. Analyze what went wrong and suggest improvements for futu...

work page
[22]

What to avoid next time Provide a brief reflection: Stored memory content templates. # Successful memory Task: {task_description} SCRIPT: {script} TRAJECTORY: {trajectory} # Failure memory TASK REFLECTION: Task: {task_description} What went wrong: {reflection} Failed approach: {failed_trajectory} LLB (LifelongAgentBench): Experience Summarization Prompts ...

work page
[23]

Generic enough to apply to similar tasks

work page
[24]

Specific enough to provide useful guidance

work page
[25]

3-5 high-level steps maximum

work page
[26]

Task: {task_description} Failed trajectory: {failed_trajectory} This task failed

Focus on the strategy and key decisions, not detailed actions Trajectory: {trajectory} High-level script: Failure reflection prompt. Task: {task_description} Failed trajectory: {failed_trajectory} This task failed. Analyze what went wrong and suggest improvements for future similar tasks. Focus on:

work page
[27]

Incorrect assumptions

work page
[28]

What to avoid next time Provide a brief reflection: Stored memory content templates. # Successful memory Task: {task_description} SCRIPT: {script} TRAJECTORY: {trajectory} # Failure memory TASK REFLECTION: Task: {task_description} What went wrong: {reflection} Failed approach: {failed_trajectory} 36 MemRL: Self-Evolving Agents via Runtime Reinforcement Le...

work page
[29]

[{img_id_1}] ({source_1})

work page
[30]

Message ordering

[{img_id_2}] ({source_2}) ... Message ordering. 37 MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory

work page
[31]

system: exact-match OR multiple-choice format prompt

work page
[32]

system: optional reflection note (if enabled)

work page
[34]

Thought: your thoughts.\nAction: your next action

user: question content (text + optional images) ALFWorld: Generation and Inference Prompts Base system prompt (ReAct format + action space). Interact with a household to solve a task. Imagine you are an intelligent agent in a household environment and your target is to perform actions to complete the task goal. At the beginning of your interactions, you w...

work page
[35]

take {obj} from {recep}

work page
[36]

move {obj} to {recep}

work page
[37]

clean {obj} with {recep}

work page
[38]

heat {obj} with {recep}

work page
[39]

Nothing happened

cool {obj} with {recep} where {obj} and {recep} correspond to objects and receptacles. After your each turn, the environment will give you immediate feedback based on which you plan your next few steps. if the envrionment output "Nothing happened ", that means the previous action is invalid and you should try more options. Your response should use the fol...

work page
[40]

system: base ALFWorld system prompt

work page
[41]

user/assistant: selected few-shot example dialogue (sequence of messages)

work page
[42]

system: optional retrieved memory context

work page
[43]

user: new task prompt

work page
[44]

instruct

loop: append user Observation: ..., model replies with Thought/Action BCB (BigCodeBench): Generation and Inference Prompts Retrieved memory injection (system message). [Retrieved Memory Context] ### Memory 1 (id={mem_id_1}, sim={similarity_1}) {memory_content_1} ### Memory 2 (id={mem_id_2}, sim={similarity_2}) {memory_content_2} ... Dataset-provided task ...

work page
[45]

system: optional [Retrieved Memory Context]

work page
[46]

39 MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory You are an execution-focused AI agent solving database and operating-system tasks

user: {bcb_task_prompt} LLB (LifelongAgentBench): Generation and Inference Prompts Base system prompt. 39 MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory You are an execution-focused AI agent solving database and operating-system tasks. You may receive a [Retrieved Memory Context] block with past experiences from similar ...

work page
[47]

After your reasoning, include exactly ONE action line: - Action: Operation - Action: Answer

work page
[48]

Do not add any extra text after that block

If Action: Operation, put exactly ONE SQL statement in the FIRST fenced code block using ‘‘‘sql, on a single line. Do not add any extra text after that block

work page
[49]

Strict output constraint (OS tasks)

If Action: Answer, include ‘Final Answer: ...‘ on the next line and do not add extra text after that. Strict output constraint (OS tasks). STRICT OUTPUT FORMAT (LLB:OS, do not violate):

work page
[50]

After your reasoning, include exactly ONE action line: - Act: bash - Act: finish

work page
[51]

Do not include any other code blocks

If Act: bash, the next lines MUST be a ‘‘‘bash fenced code block with your Bash commands. Do not include any other code blocks

work page
[52]

If Act: finish, it must be the last line (no code blocks, no extra text)

work page
[53]

Retrieved memory injection block

Do NOT use ‘Action:‘ in OS tasks (use ‘Act:‘ only). Retrieved memory injection block. [Retrieved Memory Context] === SUCCESSFUL EXPERIENCES (Learn from these) === [SUCCESS 1] [TYPE: {mem_type}] {content} === FAILED EXPERIENCES (Avoid these mistakes) === [FAILURE 1] [TYPE: {mem_type}] {content} Prompt assembly ordering (system prompt). 40 MemRL: Self-Evolv...

work page
[54]

optional [Retrieved Memory Context]

work page
[55]

strict output format block appended at the end (task-aligned) 41

work page