arxiv: 2604.20572 · v1 · submitted 2026-04-22 · 💻 cs.CL

Recognition: unknown

Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents

Yuxuan Cai , Jie Zhou , Qin Chen , Liang He

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:05 UTC · model grok-4.3

classification 💻 cs.CL

keywords lifelong learningproactive retrievalexperience basereinforcement learningagent memoryretrieval policyonline evolution

0 comments

The pith

Lifelong agents learn an explicit policy for retrieving past experience only when it improves the next decision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that current agents retrieve memories and skills too passively, often at fixed points like task start. ProactAgent instead treats retrieval itself as a learnable action inside a reinforcement-learning loop. By running paired branches from the same interaction prefix—one with retrieval and one without—the system supplies direct step-level reward signals that teach the agent when retrieval is worth the cost. This produces both higher task success and lower retrieval overhead across three environments. The result is an experience base that agents refine online while deciding on the fly whether to consult it.

Core claim

ProactAgent organizes past interactions into factual memory, episodic memory, and behavioral skills, then trains a retrieval policy through Proactive Reinforcement Learning-based Retrieval (ProactRL). ProactRL compares two continuations that start from the identical state: one branch receives retrieved content and the other does not. The difference in eventual task outcome or efficiency supplies the reward that updates the retrieval decision. Combined with Experience-Enhanced Online Evolution that updates both the main policy and the memory store, the framework yields success rates of 73.50 percent on SciWorld and 71.28 percent on AlfWorld while cutting retrieval calls.

What carries the argument

ProactRL, the reinforcement-learning policy that decides both when and what to retrieve by comparing paired branches from the same prefix and using the outcome difference as step-level supervision.

If this is right

Agents reach higher success rates on SciWorld and AlfWorld while issuing far fewer retrieval requests than passive baselines.
The same framework produces results competitive with proprietary models on the StuLife benchmark.
Memory and policy continue to improve together because retrieval decisions feed back into both the experience base and the main behavior.
Retrieval overhead drops because the policy learns to skip retrieval on steps where past experience adds no value.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The paired-branch technique could be applied to decide other costly internal actions, such as calling external tools or planning subgoals.
If the experience base grows very large, the same reward signal might be used to prune low-value entries rather than only to select among them.
Environments with noisy or conflicting memories would require an additional consistency check before the retrieval reward is computed.

Load-bearing premise

Comparing continuations from identical prefixes with and without retrieval gives an unbiased signal about whether retrieval is helpful at that exact step.

What would settle it

Run the paired-branch comparison on a held-out set of steps; if the branch that receives retrieval shows no consistent gain in final success or efficiency over the branch that skips retrieval, the supervision signal for the policy is invalid.

Figures

Figures reproduced from arXiv: 2604.20572 by Jie Zhou, Liang He, Qin Chen, Yuxuan Cai.

**Figure 1.** Figure 1: Comparison of retrieval strategies for online lifelong agents. Static initialization provides memory once at task [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of PROACTAGENT. (a) Experience-Enhanced Online Evolution (EXPONEVO) closes the loop between acting, experience accumulation, and policy optimization. (b) EXPERIENCE BASE partitions experience into five typed stores (Mf , Me , S +, S −, S ∆), so a single query returns complementary evidence and behavioral guidance. (c) Proactive Reinforcement Learning-based Retrieval (PROACTRL) replays the shared p… view at source ↗

**Figure 3.** Figure 3: Inference efficiency and training dynamics on SciWorld. Left: PROACTAGENT achieves higher success rates with fewer interaction rounds and lower token consumption than all baselines, where bubble area indicates average prompt tokens per episode. Right: PROACTAGENT consistently outperforms GRPO throughout training, converging to a substantially higher final accuracy. Experience ablation. As shown in table 3,… view at source ↗

**Figure 4.** Figure 4: Case studies across SciWorld, ALFWorld, and StuLife. Each panel contrasts a query branch (green) against a matched no-query branch (red) from the same interaction prefix or task instance. In all three cases, a single targeted retrieval at the action-critical decision point leads to immediate success, while the no-query branch drifts into invalid actions, wrong-object selection, or stalled interaction and f… view at source ↗

read the original abstract

Online lifelong learning enables agents to accumulate experience across interactions and continually improve on long-horizon tasks. However, existing methods typically treat retrieval from past experience as a passive operation, triggering it only at task initialization or after completing a step. Consequently, agents often fail to identify knowledge gaps during interaction and proactively retrieve the most useful experience for the current decision. To address this limitation, we present ProactAgent, an experience-driven lifelong learning framework for proactive retrieval over a structured experience base. We first introduce Experience-Enhanced Online Evolution (ExpOnEvo), which enables continual improvement through both policy updates and memory refinement. The experience base organizes historical interactions into typed repositories, including factual memory, episodic memory, and behavioral skills, so that retrieval can provide both relevant evidence and actionable guidance. On top of this, we propose Proactive Reinforcement Learning-based Retrieval (ProactRL), which models retrieval as an explicit policy action and learns when and what to retrieve via paired-branch process rewards. By comparing continuations from identical interaction prefixes with and without retrieval, ProactRL provides step-level supervision for retrieval decisions, encouraging retrieval only when it leads to better task outcomes or higher efficiency. Experiments on SciWorld, AlfWorld, and StuLife show that ProactAgent consistently improves lifelong agent performance, achieving success rates of 73.50\% on SciWorld and 71.28\% on AlfWorld while substantially reducing retrieval overhead, and attains performance competitive with proprietary models on StuLife.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core idea—treating retrieval as a learnable policy action trained on paired with/without continuations—produces reported gains on three environments, but the supervision signal may not be as unbiased as claimed.

read the letter

The colleague should know two things up front. First, the work makes retrieval timing an explicit decision the agent learns rather than a fixed trigger at the start of a step or task. Second, it does this by comparing what happens next when retrieval is allowed versus blocked from the same prefix, then using the downstream outcome or efficiency difference as the training signal. That paired-branch approach is the main technical move beyond the passive retrieval baselines they cite. They pair it with typed memory stores (facts, episodes, skills) and an online evolution loop that updates both policy and memory. The experiments claim consistent lifts—73.5% success on SciWorld, 71.3% on AlfWorld, competitive with proprietary models on StuLife—while cutting retrieval calls. Those numbers are the strongest evidence the paper offers that proactive timing helps on long-horizon tasks. The setup looks practical for robotics-style agents that need to decide when to consult past experience without constant overhead. The soft spot is exactly the one the stress-test flags. The without-retrieval branch has to be a clean counterfactual, but the abstract gives no detail on how prefixes are chosen for comparison or whether both branches use identical sampling temperature and stochasticity. If prefixes are selected only at uncertain steps or if the no-retrieval path reuses cached states, the reward signal can favor over-retrieval or reward hacking. Without those controls shown, the claim that the policy learns to retrieve “only when needed” rests on an unverified assumption. The abstract also omits statistical tests, full ablation tables, and exact baseline implementations, so it is hard to judge how much of the reported improvement is due to the new retrieval policy versus the typed memory or the evolution procedure. This paper is for people working on memory-augmented lifelong agents in interactive settings. It has a clear, implementable idea and concrete numbers, so it deserves a serious referee even if the training signal needs tighter validation. I would send it to review.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce ProactAgent, a framework for experience-driven lifelong agents that performs proactive retrieval from a structured base (factual memory, episodic memory, behavioral skills) rather than passive triggering. It proposes Experience-Enhanced Online Evolution (ExpOnEvo) for joint policy and memory refinement, and Proactive RL-based Retrieval (ProactRL) that treats retrieval as a policy action trained via paired-branch process rewards: continuations from identical interaction prefixes are compared with and without retrieval to supply step-level supervision that encourages retrieval only when it improves outcomes or efficiency. Experiments on SciWorld, AlfWorld, and StuLife report success rates of 73.50% and 71.28% on the first two environments, reduced retrieval overhead, and performance competitive with proprietary models on the third.

Significance. If the results hold after addressing the supervision-signal concerns, the work would offer a concrete mechanism for reducing unnecessary retrieval while improving long-horizon performance, which is a practical advance for memory-augmented agents. The multi-environment evaluation and explicit comparison to proprietary models provide useful empirical grounding; the structured experience base and online evolution component also supply reusable design patterns.

major comments (2)

[ProactRL / §3] ProactRL description (abstract and §3): the paired-branch comparison that supplies process rewards assumes the without-retrieval continuation is an unbiased counterfactual. The manuscript does not detail prefix selection criteria (e.g., uncertainty thresholds), whether the two branches use identical temperature/stochasticity, or how cached states are avoided. This risks selection bias or reward hacking and directly affects the central claim that the policy learns to 'ask only when needed.'
[Experiments] Experimental section (results on SciWorld/AlfWorld): success-rate gains are reported without accompanying statistical tests, variance across seeds, or ablation isolating the contribution of ProactRL versus ExpOnEvo alone. Given that the training signal depends on downstream outcomes, these omissions make it difficult to assess whether the reported 73.50% and 71.28% figures are robust or partly attributable to post-hoc tuning.

minor comments (2)

[Abstract] The abstract states 'substantially reducing retrieval overhead' but does not quantify the reduction (e.g., average retrievals per episode or percentage decrease); adding a concrete metric would strengthen the efficiency claim.
[Experience base] Notation for the three memory types (factual, episodic, behavioral skills) is introduced without a compact table or diagram showing their retrieval interfaces; a small summary table would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications on our design and commitments to strengthen the manuscript.

read point-by-point responses

Referee: [ProactRL / §3] ProactRL description (abstract and §3): the paired-branch comparison that supplies process rewards assumes the without-retrieval continuation is an unbiased counterfactual. The manuscript does not detail prefix selection criteria (e.g., uncertainty thresholds), whether the two branches use identical temperature/stochasticity, or how cached states are avoided. This risks selection bias or reward hacking and directly affects the central claim that the policy learns to 'ask only when needed.'

Authors: We appreciate the referee's careful reading of the ProactRL mechanism. The paired-branch process is designed to provide direct step-level supervision by comparing outcomes from identical prefixes. To address potential bias, prefix selection is performed based on the agent's internal uncertainty estimate at each step, both branches are run with matching stochasticity settings (same temperature and seed), and the without-retrieval branch is executed in a reset environment state to prevent any carry-over from caching. These measures aim to make the counterfactual as unbiased as possible. We will revise §3 to explicitly document these implementation choices, including the exact criteria and procedures used, to eliminate ambiguity around selection bias and reward hacking. revision: yes
Referee: [Experiments] Experimental section (results on SciWorld/AlfWorld): success-rate gains are reported without accompanying statistical tests, variance across seeds, or ablation isolating the contribution of ProactRL versus ExpOnEvo alone. Given that the training signal depends on downstream outcomes, these omissions make it difficult to assess whether the reported 73.50% and 71.28% figures are robust or partly attributable to post-hoc tuning.

Authors: We acknowledge that the current experimental presentation lacks statistical tests, seed variance, and clear ablations, which limits the assessment of robustness. In the revised version, we will include standard deviations from multiple random seeds and conduct appropriate statistical significance tests (e.g., t-tests) for the reported success rates. Additionally, we will expand the experimental section with dedicated ablations that isolate the effect of ProactRL from ExpOnEvo by comparing the full ProactAgent against a baseline using only ExpOnEvo with passive retrieval. These ablations demonstrate the specific contribution of the proactive retrieval policy. While the reward signal is derived from downstream task outcomes, the paired-branch comparison provides granular, step-wise supervision that reduces reliance on post-hoc adjustments. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external task outcomes and benchmark experiments

full rationale

The paper's core claims rest on introducing ExpOnEvo for memory refinement and ProactRL for learning a retrieval policy via paired-branch comparisons that assign rewards from downstream task success rates and efficiency on SciWorld, AlfWorld, and StuLife. These are not self-definitional, as the supervision signal derives from independent environment outcomes rather than re-using fitted parameters or prior self-citations as the sole justification. No equations or sections reduce the reported success rates (73.50% on SciWorld, 71.28% on AlfWorld) to inputs by construction; the method is falsifiable against external benchmarks and does not invoke uniqueness theorems or ansatzes from overlapping prior work. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Review performed on abstract only; ledger is therefore incomplete. The framework assumes a well-structured experience base and that paired-branch comparisons yield reliable supervision signals.

axioms (2)

domain assumption Retrieval decisions can be supervised by comparing task outcomes from identical prefixes with and without retrieval
Core of ProactRL training signal; appears in abstract description of paired-branch process rewards.
domain assumption Organizing memory into factual, episodic, and skill repositories enables both evidence and actionable guidance
Stated as the basis for the experience base design.

invented entities (2)

ProactRL no independent evidence
purpose: Models retrieval as an explicit policy action learned via paired-branch rewards
New component introduced to enable proactive decisions; no independent evidence outside the paper.
ExpOnEvo no independent evidence
purpose: Enables continual improvement through policy updates and memory refinement
Framework component for experience-driven evolution; no external validation provided.

pith-pipeline@v0.9.0 · 5568 in / 1373 out tokens · 42218 ms · 2026-05-10T00:05:19.666653+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 10 canonical work pages · 5 internal anchors

[1]

ALFWorld: Aligning text and embodied environments for interactive learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning text and embodied environments for interactive learning. InInternational Conference on 10 ICALK@ECNU Learning Representations, 2021

2021
[2]

ScienceWorld: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

Ruoyao Wang, Peter Jansen, Marc-Alexandre Cote, and Prithviraj Ammanabrolu. ScienceWorld: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

2022
[3]

V oyager: An open-ended embodied agent with large language models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. InAdvances in Neural Information Processing Systems, 2023

2023
[4]

15 Yuhong Cao, Jeric Lew, Jingsong Liang, Jin Cheng, and Guillaume Sartoretti

Yuxuan Cai, Yipeng Hao, Jie Zhou, Hang Yan, Zhikai Lei, Rui Zhen, Zhenhua Han, Yutao Yang, Junsong Li, Qianjun Pan, Tianyu Huai, Qin Chen, Xin Li, Kai Chen, Bo Zhang, Xipeng Qiu, and Liang He. Building self-evolving agents via experience-driven lifelong learning: A framework and benchmark.arXiv preprint arXiv:2508.19005, 2025

work page arXiv 2025
[5]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022

2022
[6]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems, 2023

2023
[7]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, 2023

2023
[8]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

2023
[9]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In Advances in Neural Information Processing Sy...

2023
[10]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, 2023

2023
[11]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 2023

2023
[12]

ExpeL: LLM agents are experiential learners.Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19666–19674, 2024

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. ExpeL: LLM agents are experiential learners.Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19666–19674, 2024

2024
[13]

MemoryBank: Enhancing large language models with long-term memory.Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19724– 19731, 2024

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. MemoryBank: Enhancing large language models with long-term memory.Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19724– 19731, 2024

2024
[14]

URLhttps://arxiv.org/abs/2512.18746

Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang, He Zhu, Wangchunshu Zhou, and Shuicheng Yan. Memevolve: Meta-evolution of agent memory systems.arXiv preprint arXiv:2512.18746, 2025

work page arXiv 2025
[15]

MultiHop-RAG: Benchmarking retrieval-augmented generation for multi-hop queries.arXiv:2401.15391, 2024

Yixuan Tang and Yi Yang. Multihop-rag: Benchmarking retrieval-augmented generation for multi-hop queries. arXiv preprint arXiv:2401.15391, 2024

work page arXiv 2024
[16]

Reflectiverag: Rethinking adaptivity in retrieval-augmented generation

Akshay Verma, Swapnil Gupta, Siddharth Pillai, Prateek Sircar, and Deepak Gupta. Reflectiverag: Rethinking adaptivity in retrieval-augmented generation. 2026

2026
[17]

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, and Wenya Wang. Memskill: Learning and evolving memory skills for self-evolving agents.arXiv preprint arXiv:2602.02474, 2026

work page internal anchor Pith review arXiv 2026
[18]

Neural Turing Machines

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines.arXiv preprint arXiv:1410.5401, 2014

work page internal anchor Pith review arXiv 2014
[19]

Memory networks

Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. InInternational Conference on Learning Representations, 2015

2015
[20]

Hybrid computing using a neural network with dynamic external memory.Nature, 538(7626):471–476, 2016

Alex Graves, Greg Wayne, Malcolm Reynolds, et al. Hybrid computing using a neural network with dynamic external memory.Nature, 538(7626):471–476, 2016. 11 ICALK@ECNU

2016
[21]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kuttler, Mike Lewis, Wen-tau Yih, Tim Rocktaschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems, 2020

2020
[22]

Rae, Erich Elsen, and Laurent Sifre

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Ori...

2022
[23]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review arXiv 2023
[24]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production- ready AI agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review arXiv 2025
[25]

Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig

Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

2023
[26]

Self-RAG: Learning to retrieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-RAG: Learning to retrieve, generate, and critique through self-reflection. InInternational Conference on Learning Representations, 2024

2024
[27]

Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C. Park. Adaptive-RAG: Learning to adapt retrieval-augmented large language models through question complexity. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics, 2024

2024
[28]

Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023

2023
[29]

Agentgym: Evaluating and training large language model-based agents across diverse environments

Zhiheng Xi, Yiwen Ding, Wenxiang Chen, Boyang Hong, Honglin Guo, Junzhe Wang, Xin Guo, Dingwen Yang, Chenyang Liao, Wei He, et al. Agentgym: Evaluating and training large language model-based agents across diverse environments. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 27914–27...

2025
[30]

TTRL: Test-Time Reinforcement Learning

Yuxin Zuo, Kaiyan Zhang, Shang Qu, Li Sheng, Xuekai Zhu, Biqing Qi, Youbang Sun, Ganqu Cui, Ning Ding, and Bowen Zhou. TTRL: Test-time reinforcement learning.arXiv preprint arXiv:2504.16084, 2025

work page Pith review arXiv 2025
[31]

Agentgym-rl: Training LLM agents for long-horizon decision making through multi-turn reinforcement learning

Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, et al. Agentgym-rl: Training llm agents for long-horizon decision making through multi-turn reinforcement learning.arXiv preprint arXiv:2509.08755, 2025

work page arXiv 2025
[32]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 12 ICALK@ECNU A Theoretical Analysis and Proofs In this section, we provide a formal analysis of how...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

The extractor produces at most two entries of each type per trajectory, focusing on environment facts and trajectory-specific plans or constraints

Factual and episodic memoriesare extracted from individual trajectories via summarization. The extractor produces at most two entries of each type per trajectory, focusing on environment facts and trajectory-specific plans or constraints
[34]

Each distiller returns one to three JSON-formatted entries that encode reusable strategies (from successes) or corrective rules (from failures)

Success and failure skillsare distilled from outcome-specific trajectory subsets. Each distiller returns one to three JSON-formatted entries that encode reusable strategies (from successes) or corrective rules (from failures)
[35]

Paired A/B branches produced by PROACTRL are prioritized because they share the same task prefix and therefore expose the most localized contrastive signal

Comparative skillsare distilled from matched trajectory pairs. Paired A/B branches produced by PROACTRL are prioritized because they share the same task prefix and therefore expose the most localized contrastive signal. When such pairs are unavailable, the extractor falls back to outcome-ranked trajectory pairs from the same task group. The complete promp...
[36]

This stage is critical for initializing the policy with sufficient tool-calling competence before reinforcement learning begins (as confirmed by the ablation in Section 4.3)

Cold start.The base policy is trained via supervised learning on successful trajectories to learn the interaction format, valid action syntax, and retrieval-tag conventions. This stage is critical for initializing the policy with sufficient tool-calling competence before reinforcement learning begins (as confirmed by the ablation in Section 4.3)
[37]

A portion of these rollouts is configured as no-retrieval trajectories through the retrieval_enabled switch, whose probability is annealed across training phases

Rollout sampling.Multiple rollouts are sampled for each training prompt under the current policy. A portion of these rollouts is configured as no-retrieval trajectories through the retrieval_enabled switch, whose probability is annealed across training phases
[38]

Paired-branch construction.When paired branching is active, the system identifies retrieval-trigger steps in retrieval-enabled rollouts, replays the corresponding prefixes, and creates matched no-retrieval branches (Sec- tion D.2)
[39]

Reward computation.The environment outcome is combined with the paired-branch process reward and the efficiency bonus to produce the PROACTRL trajectory-level reward (Section 3.3)
[40]

Policy update.The policy is updated using GRPO-style group normalization with PPO-style clipped surrogate optimization
[41]

Experience base update.The experience base D is updated by extracting factual, episodic, success, failure, and comparative entries from the new trajectories (Appendix C.3). 17 ICALK@ECNU This organization ensures that policy learning and memory growth remain tightly interleaved throughout training, realizing the co-evolution loop described in Section 3.2....
[42]

* *Examples:* Info/Preferences, Domain Knowledge, Tool/System Facts

**Factual Memory (Objective Truths):** * *Definition:* Verifiable facts learned during execution. * *Examples:* Info/Preferences, Domain Knowledge, Tool/System Facts
[43]

If I take step A, error B occurs,

**Episodic Memory (Experience, Reflection & Temporal Events):** * *Definition:* Insights derived from the flow of events, strategies, errors, OR ** specific real-world time constraints/schedules**. * *Logic Examples:* "If I take step A, error B occurs," "Method X is faster than Y." * *Temporal Examples:* "User has a class at 8 AM on Mondays," "The deadlin...
[44]

**Selection:** Extract a MAXIMUM of **2 Factual** and **2 Episodic** memories
[45]

when_to_use

**"when_to_use" Strategy:** * For *Factual*: Focus on the **Context Trigger** (e.g., "When using ‘pandas‘..."). * For *Episodic (Logic)*: Focus on the **Situational Trigger** (e.g., "When the dataset is empty..."). * For *Episodic (Time)*: Focus on the **Temporal Trigger** (e.g., "When it is Monday morning," "At 8:00 AM")
[46]

factual_memories

**Minimum Output:** Do not output an empty result. You must find the most valuable takeaway, even if minor. # Output Format Output a single valid JSON object strictly following this schema. {{ "factual_memories": [ {{ "when_to_use": "<Precise context trigger>", "memory": "<The objective fact>" }} ], "episodic_memories": [ {{ "when_to_use": "<Precise situa...
[47]

when_to_use

**FIELD: "when_to_use" (The Trigger Scope)** - **Definition:** Precisely define the context where this best practice applies. You MUST consider three dimensions: a. **Task Requirement:** What is the user specifically asking for? (e.g., "When the task requires verifying code execution results...") b. **Specific Scenario:** What is the current state of the ...
[48]

experience

**FIELD: "experience" (The Solution)** - **Definition:** A strict, actionable standard operating procedure. - **Constraint:** **PURELY FORWARD-LOOKING.** Do NOT explain why the approach was superior. Do NOT include phrases like "The agent succeeded because..." or "It is better to...". - **Structure:** Directly provide the step-by-step instruction on how t...
[49]

when_to_use

**FIELD: "when_to_use" (The Trigger Scope)** - **Definition:** Precisely define the context where this memory applies. You MUST consider three dimensions: a. **Task Requirement:** What is the user specifically asking for? (e.g., "When the task requires distinct counting of similar objects...") b. **Specific Scenario:** What is the current state of the age...
[50]

experience

**FIELD: "experience" (The Solution)** - **Definition:** A strict, actionable instruction on how to handle this exact situation correctly. - **Constraint:** **PURELY FORWARD-LOOKING.** Do NOT explain why the previous attempt failed. Do NOT include diagnosis or "The agent failed because..." statements. - **Content:** Directly provide the optimized logic or...