pith. sign in

arxiv: 2604.20572 · v2 · pith:FW7ADB3Rnew · submitted 2026-04-22 · 💻 cs.CL

Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents

Pith reviewed 2026-05-10 00:05 UTC · model grok-4.3

classification 💻 cs.CL
keywords lifelong learningproactive retrievalexperience basereinforcement learningagent memoryretrieval policyonline evolution
0
0 comments X

The pith

Lifelong agents learn an explicit policy for retrieving past experience only when it improves the next decision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that current agents retrieve memories and skills too passively, often at fixed points like task start. ProactAgent instead treats retrieval itself as a learnable action inside a reinforcement-learning loop. By running paired branches from the same interaction prefix—one with retrieval and one without—the system supplies direct step-level reward signals that teach the agent when retrieval is worth the cost. This produces both higher task success and lower retrieval overhead across three environments. The result is an experience base that agents refine online while deciding on the fly whether to consult it.

Core claim

ProactAgent organizes past interactions into factual memory, episodic memory, and behavioral skills, then trains a retrieval policy through Proactive Reinforcement Learning-based Retrieval (ProactRL). ProactRL compares two continuations that start from the identical state: one branch receives retrieved content and the other does not. The difference in eventual task outcome or efficiency supplies the reward that updates the retrieval decision. Combined with Experience-Enhanced Online Evolution that updates both the main policy and the memory store, the framework yields success rates of 73.50 percent on SciWorld and 71.28 percent on AlfWorld while cutting retrieval calls.

What carries the argument

ProactRL, the reinforcement-learning policy that decides both when and what to retrieve by comparing paired branches from the same prefix and using the outcome difference as step-level supervision.

If this is right

  • Agents reach higher success rates on SciWorld and AlfWorld while issuing far fewer retrieval requests than passive baselines.
  • The same framework produces results competitive with proprietary models on the StuLife benchmark.
  • Memory and policy continue to improve together because retrieval decisions feed back into both the experience base and the main behavior.
  • Retrieval overhead drops because the policy learns to skip retrieval on steps where past experience adds no value.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The paired-branch technique could be applied to decide other costly internal actions, such as calling external tools or planning subgoals.
  • If the experience base grows very large, the same reward signal might be used to prune low-value entries rather than only to select among them.
  • Environments with noisy or conflicting memories would require an additional consistency check before the retrieval reward is computed.

Load-bearing premise

Comparing continuations from identical prefixes with and without retrieval gives an unbiased signal about whether retrieval is helpful at that exact step.

What would settle it

Run the paired-branch comparison on a held-out set of steps; if the branch that receives retrieval shows no consistent gain in final success or efficiency over the branch that skips retrieval, the supervision signal for the policy is invalid.

Figures

Figures reproduced from arXiv: 2604.20572 by Bo Zhang, Jie Zhou, Liang He, Qin Chen, Wei Li, Xin Li, Yuxuan Cai.

Figure 1
Figure 1. Figure 1: Comparison of retrieval strategies for online lifelong agents. Static initialization provides memory once at task [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of PROACTAGENT. (a) Experience-Enhanced Online Evolution (EXPONEVO) closes the loop between acting, experience accumulation, and policy optimization. (b) EXPERIENCE BASE partitions experience into five typed stores (Mf , Me , S +, S −, S ∆), so a single query returns complementary evidence and behavioral guidance. (c) Proactive Reinforcement Learning-based Retrieval (PROACTRL) replays the shared p… view at source ↗
Figure 3
Figure 3. Figure 3: Inference efficiency and training dynamics on SciWorld. Left: PROACTAGENT achieves higher success rates with fewer interaction rounds and lower token consumption than all baselines, where bubble area indicates average prompt tokens per episode. Right: PROACTAGENT consistently outperforms GRPO throughout training, converging to a substantially higher final accuracy. Experience ablation. As shown in table 3,… view at source ↗
Figure 4
Figure 4. Figure 4: Case studies across SciWorld, ALFWorld, and StuLife. Each panel contrasts a query branch (green) against a matched no-query branch (red) from the same interaction prefix or task instance. In all three cases, a single targeted retrieval at the action-critical decision point leads to immediate success, while the no-query branch drifts into invalid actions, wrong-object selection, or stalled interaction and f… view at source ↗
read the original abstract

Online lifelong learning agents must decide not only how to act but also when to consult prior experience to continually improve on long-horizon tasks. Existing methods typically retrieve memories passively, such as at task initialization or after each step, and therefore miss knowledge gaps that arise during interaction. We propose ProactAgent, an experience-driven lifelong learning framework for proactive retrieval over a structured Experience Base. ProactAgent continually improves through ExpOnEvo, which jointly updates policies and refines memory, organizing past interactions into factual, episodic, and skill repositories. It further introduces ProactRL, which treats retrieval as an explicit policy action and learns when and what to retrieve. By comparing paired continuations from identical interaction prefixes with and without retrieval, ProactRL provides step-level process rewards that encourage retrieval only when it improves task outcomes or efficiency. Experiments on SciWorld, AlfWorld, and StuLife show that ProactAgent consistently outperforms all baselines, achieving up to 32% relative improvement in success rate and over 33% reduction in interaction rounds. Our code will be publicly available at GitHub.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce ProactAgent, a framework for experience-driven lifelong agents that performs proactive retrieval from a structured base (factual memory, episodic memory, behavioral skills) rather than passive triggering. It proposes Experience-Enhanced Online Evolution (ExpOnEvo) for joint policy and memory refinement, and Proactive RL-based Retrieval (ProactRL) that treats retrieval as a policy action trained via paired-branch process rewards: continuations from identical interaction prefixes are compared with and without retrieval to supply step-level supervision that encourages retrieval only when it improves outcomes or efficiency. Experiments on SciWorld, AlfWorld, and StuLife report success rates of 73.50% and 71.28% on the first two environments, reduced retrieval overhead, and performance competitive with proprietary models on the third.

Significance. If the results hold after addressing the supervision-signal concerns, the work would offer a concrete mechanism for reducing unnecessary retrieval while improving long-horizon performance, which is a practical advance for memory-augmented agents. The multi-environment evaluation and explicit comparison to proprietary models provide useful empirical grounding; the structured experience base and online evolution component also supply reusable design patterns.

major comments (2)
  1. [ProactRL / §3] ProactRL description (abstract and §3): the paired-branch comparison that supplies process rewards assumes the without-retrieval continuation is an unbiased counterfactual. The manuscript does not detail prefix selection criteria (e.g., uncertainty thresholds), whether the two branches use identical temperature/stochasticity, or how cached states are avoided. This risks selection bias or reward hacking and directly affects the central claim that the policy learns to 'ask only when needed.'
  2. [Experiments] Experimental section (results on SciWorld/AlfWorld): success-rate gains are reported without accompanying statistical tests, variance across seeds, or ablation isolating the contribution of ProactRL versus ExpOnEvo alone. Given that the training signal depends on downstream outcomes, these omissions make it difficult to assess whether the reported 73.50% and 71.28% figures are robust or partly attributable to post-hoc tuning.
minor comments (2)
  1. [Abstract] The abstract states 'substantially reducing retrieval overhead' but does not quantify the reduction (e.g., average retrievals per episode or percentage decrease); adding a concrete metric would strengthen the efficiency claim.
  2. [Experience base] Notation for the three memory types (factual, episodic, behavioral skills) is introduced without a compact table or diagram showing their retrieval interfaces; a small summary table would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications on our design and commitments to strengthen the manuscript.

read point-by-point responses
  1. Referee: [ProactRL / §3] ProactRL description (abstract and §3): the paired-branch comparison that supplies process rewards assumes the without-retrieval continuation is an unbiased counterfactual. The manuscript does not detail prefix selection criteria (e.g., uncertainty thresholds), whether the two branches use identical temperature/stochasticity, or how cached states are avoided. This risks selection bias or reward hacking and directly affects the central claim that the policy learns to 'ask only when needed.'

    Authors: We appreciate the referee's careful reading of the ProactRL mechanism. The paired-branch process is designed to provide direct step-level supervision by comparing outcomes from identical prefixes. To address potential bias, prefix selection is performed based on the agent's internal uncertainty estimate at each step, both branches are run with matching stochasticity settings (same temperature and seed), and the without-retrieval branch is executed in a reset environment state to prevent any carry-over from caching. These measures aim to make the counterfactual as unbiased as possible. We will revise §3 to explicitly document these implementation choices, including the exact criteria and procedures used, to eliminate ambiguity around selection bias and reward hacking. revision: yes

  2. Referee: [Experiments] Experimental section (results on SciWorld/AlfWorld): success-rate gains are reported without accompanying statistical tests, variance across seeds, or ablation isolating the contribution of ProactRL versus ExpOnEvo alone. Given that the training signal depends on downstream outcomes, these omissions make it difficult to assess whether the reported 73.50% and 71.28% figures are robust or partly attributable to post-hoc tuning.

    Authors: We acknowledge that the current experimental presentation lacks statistical tests, seed variance, and clear ablations, which limits the assessment of robustness. In the revised version, we will include standard deviations from multiple random seeds and conduct appropriate statistical significance tests (e.g., t-tests) for the reported success rates. Additionally, we will expand the experimental section with dedicated ablations that isolate the effect of ProactRL from ExpOnEvo by comparing the full ProactAgent against a baseline using only ExpOnEvo with passive retrieval. These ablations demonstrate the specific contribution of the proactive retrieval policy. While the reward signal is derived from downstream task outcomes, the paired-branch comparison provides granular, step-wise supervision that reduces reliance on post-hoc adjustments. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external task outcomes and benchmark experiments

full rationale

The paper's core claims rest on introducing ExpOnEvo for memory refinement and ProactRL for learning a retrieval policy via paired-branch comparisons that assign rewards from downstream task success rates and efficiency on SciWorld, AlfWorld, and StuLife. These are not self-definitional, as the supervision signal derives from independent environment outcomes rather than re-using fitted parameters or prior self-citations as the sole justification. No equations or sections reduce the reported success rates (73.50% on SciWorld, 71.28% on AlfWorld) to inputs by construction; the method is falsifiable against external benchmarks and does not invoke uniqueness theorems or ansatzes from overlapping prior work. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Review performed on abstract only; ledger is therefore incomplete. The framework assumes a well-structured experience base and that paired-branch comparisons yield reliable supervision signals.

axioms (2)
  • domain assumption Retrieval decisions can be supervised by comparing task outcomes from identical prefixes with and without retrieval
    Core of ProactRL training signal; appears in abstract description of paired-branch process rewards.
  • domain assumption Organizing memory into factual, episodic, and skill repositories enables both evidence and actionable guidance
    Stated as the basis for the experience base design.
invented entities (2)
  • ProactRL no independent evidence
    purpose: Models retrieval as an explicit policy action learned via paired-branch rewards
    New component introduced to enable proactive decisions; no independent evidence outside the paper.
  • ExpOnEvo no independent evidence
    purpose: Enables continual improvement through policy updates and memory refinement
    Framework component for experience-driven evolution; no external validation provided.

pith-pipeline@v0.9.0 · 5568 in / 1373 out tokens · 42218 ms · 2026-05-10T00:05:19.666653+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning

    cs.CL 2026-05 unverdicted novelty 7.0

    SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.