Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

Bo Pang; Semih Yavuz; Shafiq Joty; Srijan Bansal; Yang Li; Ye Liu; Yifei Ming; Zeyu Leo Liu; Zixuan Ke

arxiv: 2607.01480 · v1 · pith:BJ3Q4N7Vnew · submitted 2026-07-01 · 💻 cs.AI · cs.LG

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

Ye Liu , Srijan Bansal , Bo Pang , Yang Li , Zeyu Leo Liu , Yifei Ming , Zixuan Ke , Shafiq Joty

show 1 more author

Semih Yavuz

This is my paper

Pith reviewed 2026-07-03 20:10 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords procedural memory distillationself-improving language modelsreinforcement learning with verifiable rewardsonline reflectionpolicy distillationco-evolution trainingself-distillation

0 comments

The pith

Language models improve by turning cross-episode rollout patterns into reusable memory that supervises and updates the policy itself.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard reinforcement learning from verifiable rewards and self-distillation methods update policies using only episode-level signals, leaving richer procedural information in trajectories unused across repeated encounters with related problems. Procedural Memory Distillation extracts three levels of structure from the model's own rollouts—raw trajectories, self-reflected strategies and lessons, and recurring behavioral patterns—and distills them into the policy weights through a memory-conditioned self-teacher. This creates a co-evolution dynamic in which the policy generates data that refines the memory and the memory in turn shapes supervision that refines the policy. The process yields a memory-free model at inference time while delivering measurable gains over prior self-distillation baselines. A reader would care because the method shows how self-improvement can capture and internalize cross-episode regularities that single-episode updates miss.

Core claim

Procedural Memory Distillation converts cross-episode signals from model rollouts into reusable procedural memory organized at three levels of abstraction—raw trajectories, self-reflected strategies and lessons, and higher-level behavioral patterns—and distills this memory into the policy via a memory-conditioned self-teacher that supervises the student on its own rollouts, enabling the policy to progressively internalize the procedural knowledge and producing a memory-free model at inference.

What carries the argument

The co-evolution loop in which the policy generates rollouts that update the procedural memory and the memory then supplies supervision that updates the policy.

If this is right

Across Qwen3-8B and OLMo3-Instruct-7B, the method improves over SDPO by 3.8-5.5 percent on SCIKNOWEVAL and 7.9-13.6 percent on LIVECODEBENCH.
Freezing either the memory or the policy during training reduces gains by more than 10 percent across SCIKNOWEVAL domains.
The final trained policy operates without external memory at inference time.
Co-evolution between policy and memory is required for the observed improvements rather than either component alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The three-level memory structure might allow selective transfer of only the higher-level patterns to smaller models without retraining the full memory.
The online extraction process could be applied to other verifiable-reward settings such as theorem proving or tool-use sequences where recurring failure modes appear across episodes.
If the self-teacher supervision proves robust, it may reduce reliance on curated offline datasets for continued self-improvement after initial training.

Load-bearing premise

The self-reflected strategies, lessons, and patterns extracted online from the model's trajectories are accurate and non-noisy enough to serve as effective supervision signals.

What would settle it

An ablation that replaces the extracted memory contents with random or inverted reflections and measures whether performance on SCIKNOWEVAL and LIVECODEBENCH still exceeds the SDPO baseline by the reported margins.

Figures

Figures reproduced from arXiv: 2607.01480 by Bo Pang, Semih Yavuz, Shafiq Joty, Srijan Bansal, Yang Li, Ye Liu, Yifei Ming, Zeyu Leo Liu, Zixuan Ke.

**Figure 1.** Figure 1: Overview of Procedural Memory Distillation (PMD). (1) The student makes repeated attempts and receives verifier feedback. (2) Self-reflection summarizes successes and failures into online memory. (3) The teacher retrieves relevant memory in the form of experience, insight, and behaviors to provide memory-conditioned supervision. (4) This guidance is distilled into the updated student for the next epoch. Th… view at source ↗

**Figure 2.** Figure 2: Memory transfer on SCIKNOWEVAL Memories are learned from Qwen3-8B under both PMD (co-evolving policy) and frozen-policy settings, then transferred across model scales. Top: PMD vs. frozen-policy memory transfer (shaded bands: cross-domain variability). Bottom: PMD retrieved memories vs. performance (K ∈ 1, 3, 5); dotted black denotes no-memory. We evaluate cross-scale transfer of learned memory (insight+… view at source ↗

**Figure 3.** Figure 3: PMD preserves answer-space coverage that SDPO collapses on SCIKNOWEVAL. Using 16 rollouts/problem, lines show maj@k and best@k as rollout budget k increases. The shaded band (maj@k → best@k) is verifier headroom. PMD’s band is 2–4× wider than SDPO across all subjects, indicating greater retained candidate diversity. 14% 62% 4% neither: 20% N=50 Biology PMD 76% SDPO 66% 9% 82% 3% neither: 7% N=210 Chemistry… view at source ↗

**Figure 4.** Figure 4: Per-subject Venn diagrams of problems with at least one of 16 rollouts correct ( [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Surface-level probe of procedural-memory internalization. We collect rollouts from [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Memory dynamics across SCIKNOWEVAL subjects. We track experience memory, insight memory, and behavior memory during PMD training. Per-problem memories tend to accumulate and saturate, while the global behavior bank shows subject-dependent consolidation dynamics [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

read the original abstract

Reinforcement learning with verifiable rewards (RLVR), along with recent selfdistillation variants such as SDPO, evaluates each rollout against a verifier and updates the policy from that episode-level signal. However, the richer procedural information in the rollout is rarely retained or reused. Across episodes and epochs, the model repeatedly encounters related problems under a changing policy, producing cross-episode signals that episode-local updates cannot capture: which strategies consistently pass verification, which failure modes persist, which patterns recur. We propose Procedural Memory Distillation (PMD), which converts these crossepisode signals into reusable procedural memory and distills it into the policy's weights during training. This memory functions as a training scaffold, absorbed into the policy itself, yielding a memory-free model at inference. PMD organizes the memory at three levels of abstraction: raw trajectories, self-reflected strategies and lessons, and higher-level behavioral patterns that recur across problems, all extracted online from the model's own trajectories. A memory-conditioned self-teacher draws on the accumulated experience to supervise the student on its own rollouts, enabling student to progressively internalize procedural knowledge within its parameters. The central design principle is co-evolution: the policy generates rollouts that update the memory, and memory shapes the supervision that updates the policy. Empirically, across Qwen3-8B and OLMo3-Instruct-7B, PMD improves over SDPO by 3.8-5.5% on SCIKNOWEVAL and 7.9-13.6% on LIVECODEBENCH. Co-evolution powers these gains: freezing either the memory or the policy trails PMD by more than 10% across SCIKNOWEVAL domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PMD adds three-level procedural memory and online co-evolution to SDPO-style training, with reported gains on coding and knowledge benchmarks, but the abstract leaves reflection accuracy and implementation details unaddressed.

read the letter

The core idea is that standard RLVR and SDPO updates stay too local to single episodes, missing recurring strategies and failure modes across related problems. PMD tries to fix that by extracting raw trajectories, self-reflected strategies, and higher-level patterns into reusable memory, then using a memory-conditioned self-teacher to supervise the student policy so the knowledge gets baked into the weights. The co-evolution loop—policy updates memory, memory shapes next supervision—is presented as the mechanism that drives the reported 3.8-5.5% lift on SCIKNOWEVAL and 7.9-13.6% on LIVECODEBENCH over SDPO, with ablations showing bigger drops when either component is frozen.

What stands out is the explicit three-level organization and the claim that the final model runs memory-free at inference. That matches a practical need in self-improvement setups for reasoning and code tasks. The cross-episode signal point is also a fair observation that most current methods ignore.

The main limitation visible from the abstract is the lack of any description of how reflections are generated, filtered, or verified. If the model’s own summaries of its trajectories contain errors, the co-evolution loop would just amplify them, and the freezing ablations would not catch that. No equations, no quality-control steps, and no statistical details on the gains are supplied here, so the data-to-claim link stays thin. The stress-test concern about noisy self-reflections therefore lands directly on the central assumption.

This is aimed at groups already running RLVR or SDPO pipelines on 7-8B models and looking for ways to retain procedural knowledge. It is worth sending to peer review so the implementation and reflection reliability can be checked properly.

Referee Report

3 major / 2 minor

Summary. The paper proposes Procedural Memory Distillation (PMD) to address the limitation of episode-local updates in RLVR and SDPO by extracting cross-episode procedural memory at three levels (raw trajectories, self-reflected strategies/lessons, higher-level behavioral patterns) from the model's own rollouts. A memory-conditioned self-teacher then supervises the student policy in a co-evolution loop where the policy updates the memory and the memory shapes policy updates, with the goal of internalizing the knowledge into model weights for a memory-free inference model. Empirically, PMD outperforms SDPO by 3.8-5.5% on SCIKNOWEVAL and 7.9-13.6% on LIVECODEBENCH across Qwen3-8B and OLMo3-Instruct-7B, with ablations showing >10% drops when freezing either memory or policy.

Significance. If the results and the accuracy of the self-reflections hold, the work could meaningfully advance online self-improvement methods for LLMs by converting transient rollout signals into reusable, internalized procedural knowledge. The explicit co-evolution principle and multi-level memory organization provide a concrete design pattern that, if validated with full implementation details, would be a useful contribution to the self-distillation literature.

major comments (3)

[Abstract] Abstract and methods description: no mechanism is described for verifying the accuracy of self-reflected strategies/lessons or filtering noise from model-generated reflections before they are used as supervision signals. This is load-bearing for the central claim, as inaccurate reflections would be distilled and amplified through the co-evolution loop, directly undermining the reported gains over SDPO.
[Empirical results] Empirical results: the manuscript supplies no statistical tests, run-to-run variance, or detailed ablation breakdowns (beyond the high-level freezing experiment) to support the 3.8-5.5% and 7.9-13.6% improvements or to isolate the contribution of the three-level memory organization.
[Abstract] Abstract: the co-evolution loop is described only at a high level with no equations, update rules, or pseudocode for memory extraction, memory-conditioned supervision, or the interaction between policy and memory. This prevents assessment of whether performance gains are independent of the extraction process or partly circular.

minor comments (2)

[Abstract] Abstract contains several missing hyphens in compound terms ("selfdistillation", "crossepisode", "self-teacher").
The paper would benefit from a diagram illustrating the three memory levels and the co-evolution flow between policy and memory.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The three major comments identify important gaps in methodological detail, empirical rigor, and clarity of the co-evolution process. We address each point below and will revise the manuscript to incorporate the requested clarifications and additions.

read point-by-point responses

Referee: [Abstract] Abstract and methods description: no mechanism is described for verifying the accuracy of self-reflected strategies/lessons or filtering noise from model-generated reflections before they are used as supervision signals. This is load-bearing for the central claim, as inaccurate reflections would be distilled and amplified through the co-evolution loop, directly undermining the reported gains over SDPO.

Authors: We agree that the absence of an explicit verification or filtering mechanism for self-reflections is a substantive omission. The current manuscript relies on the fact that only trajectories passing the external verifier are used to seed reflections, but does not describe additional consistency or cross-episode validation steps. In the revision we will add a new subsection (Methods 3.3) that specifies (1) outcome-based filtering that retains only reflections associated with verified successes and (2) a lightweight consistency check that discards reflections whose implied strategy contradicts the verifier outcome on the same trajectory. These additions will be accompanied by an ablation measuring the effect of the filter. revision: yes
Referee: [Empirical results] Empirical results: the manuscript supplies no statistical tests, run-to-run variance, or detailed ablation breakdowns (beyond the high-level freezing experiment) to support the 3.8-5.5% and 7.9-13.6% improvements or to isolate the contribution of the three-level memory organization.

Authors: The referee is correct that the reported improvements lack statistical support and fine-grained ablations. We will revise the Experiments section to include: (a) results from five independent random seeds with mean and standard deviation, (b) paired t-tests or Wilcoxon tests with p-values for all headline comparisons against SDPO, and (c) an expanded ablation table that isolates the incremental contribution of each memory level (raw trajectories, self-reflected strategies, behavioral patterns) as well as the interaction between levels. These changes will be added before resubmission. revision: yes
Referee: [Abstract] Abstract: the co-evolution loop is described only at a high level with no equations, update rules, or pseudocode for memory extraction, memory-conditioned supervision, or the interaction between policy and memory. This prevents assessment of whether performance gains are independent of the extraction process or partly circular.

Authors: The full manuscript already contains pseudocode (Algorithm 1) and the memory-conditioned loss (Equation 4) in Section 3, but the abstract and high-level overview remain purely descriptive. We will revise the abstract to include a concise statement of the alternating update rule and will move the existing pseudocode and equations into a more prominent position in the main text with an explicit non-circularity argument: memory is updated only from verified rollouts, while the policy is trained on a mixture of memory-conditioned and standard RLVR objectives. This should make the independence of the two processes clearer. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks and ablations

full rationale

The paper proposes an empirical training procedure (PMD) that extracts procedural memory from model trajectories and uses a memory-conditioned self-teacher for supervision, with co-evolution as the design principle. Claims of improvement (3.8-5.5% on SCIKNOWEVAL, 7.9-13.6% on LIVECODEBENCH over SDPO) are supported by direct comparisons to baselines and ablations (freezing memory or policy drops performance >10%). No equations, parameter fits, or derivations are described that reduce the reported gains to the method inputs by construction. The evaluation uses held-out benchmarks and external verifiers, rendering results falsifiable outside the training loop itself. No self-citations, uniqueness theorems, or ansatzes are invoked in the provided text to justify core choices.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The approach rests on domain assumptions about the value of extracted reflections and the feasibility of online memory construction; no explicit free parameters or invented entities with independent evidence are stated in the abstract.

axioms (2)

domain assumption The richer procedural information in the rollout is rarely retained or reused across episodes and epochs in RLVR and SDPO.
Explicitly stated as the core motivation in the abstract.
ad hoc to paper Self-reflected strategies and lessons extracted from the model's trajectories can be reliably used to supervise the student.
Central unverified premise required for the distillation step to succeed.

invented entities (1)

Procedural memory organized at three levels of abstraction (raw trajectories, self-reflected strategies/lessons, higher-level behavioral patterns) no independent evidence
purpose: To capture and distill cross-episode signals into reusable form for policy update.
New construct introduced by the paper; no independent evidence or falsifiable handle provided in the abstract.

pith-pipeline@v0.9.1-grok · 5877 in / 1469 out tokens · 27465 ms · 2026-07-03T20:10:07.713376+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 41 canonical work pages · 31 internal anchors

[1]

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sergio Ramos Garea, Matthieu Geist, and Olivier Bachem. “On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes”. In:International Conference on Learning Representations. 2024

2024
[2]

X-KD: General Experiential Knowledge Distillation for Large Language Models

Yuang Cai and Yuyu Yuan. “X-KD: General Experiential Knowledge Distillation for Large Language Models”. In:arXiv preprint arXiv:2602.12674(2026)

work page arXiv 2026
[3]

Mem$^2$Evolve: Towards Self-Evolving Agents via Co-Evolutionary Capability Expansion and Experience Distillation

Zihao Cheng, Zeming Liu, Yingyu Shan, Xinyi Wang, Xiangrong Zhu, Yunpu Ma, Hongru Wang, Yuhang Guo, Wei Lin, and Yunhong Wang. “Mem2Evolve: Towards Self-Evolving Agents via Co-Evolutionary Capability Expansion and Experience Distillation”. In:arXiv preprint arXiv:2604.10923(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models

Kehua Feng, Keyan Ding, Weijie Wang, Xiang Zhuang, Zeyuan Wang, Ming Qin, Yu Zhao, Jianhua Yao, Qiang Zhang, and Huajun Chen. “SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models”. In:arXiv preprint arXiv:2406.09098(2024)

work page arXiv 2024
[5]

MiniLLM: Knowledge Distillation of Large Language Models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. “MiniLLM: Knowledge Distillation of Large Language Models”. In:International Conference on Learning Representations. 2024

2024
[6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, and Xiao Bi. “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning”. In:arXiv preprint arXiv:2501.12948(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

Yinghui He, Simran Kaur, Adithya Bhaskar, Yongjin Yang, Jiarui Liu, Narutatsu Ri, Liam Fowl, Abhishek Panigrahi, Danqi Chen, and Sanjeev Arora. “Self-Distillation Zero: Self- Revision Turns Binary Rewards into Dense Supervision”. In:arXiv preprint arXiv:2604.12002 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. “Distilling the Knowledge in a Neural Network”. In:NeurIPS Deep Learning and Representation Learning Workshop. 2015

2015
[9]

R-Zero: Self-Evolving Reasoning LLM from Zero Data

Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. “R-Zero: Self-Evolving Reasoning LLM from Zero Data”. In:arXiv preprint arXiv:2508.05004(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. “Reinforcement learning via self-distillation”. In:arXiv preprint arXiv:2601.20802 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. “LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code”. In:arXiv preprint arXiv:2403.07974(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. “Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?” In:arXiv preprint arXiv:2603.24472(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

uttler, Mike Lewis, Wen-tau Yih, Tim Rockt

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K"uttler, Mike Lewis, Wen-tau Yih, Tim Rockt"aschel, Sebastian Riedel, and Douwe Kiela. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”. In: Advances in Neural Information Processing Systems. 2020

2020
[14]

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, and Ning Ding. “Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe”. In:arXiv preprint arXiv:2604.13016(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Thinking Machines Lab: Con- nectionism

Kevin Lu and Thinking Machines Lab.On-Policy Distillation. Thinking Machines Lab: Con- nectionism. 2025.URL: https://thinkingmachines.ai/blog/on-policy- distillation/

2025
[16]

SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization

Zhengxi Lu, Zhiyuan Yao, Jinyang Wu, Chengcheng Han, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. “SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization”. In:arXiv preprint arXiv:2604.02268(2026). 10

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

Self-Refine: Iterative Refinement with Self-Feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Sean Welleck, Bodhisattwa Prasad Majumder, Shashank Gupta, Amir Yazdanbakhsh, and Peter Clark. “Self-Refine: Iterative Refinement with Self-Feedback”. In:Advances in Neural Information Processing Systems. 2023

2023
[18]

Olmo 3

Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heine- man, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, and Hamish Ivison. “Olmo 3”. In: arXiv preprint arXiv:2512.13961(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

Siru Ouyang, Jun Yan, I Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T Le, Samira Daruki, and Xiangru Tang. “Reasoningbank: Scaling agent self-evolving with reasoning memory”. In:arXiv preprint arXiv:2509.25140(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Vivian Fang, Shishir G. Patil, Kevin Lin, Sarah Wooders, and Joseph E. Gonza- lez. “MemGPT: Towards LLMs as Operating Systems”. In:arXiv preprint arXiv:2310.08560 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Generative Agents: Interactive Simulacra of Human Behavior

Joon Sung Park, Joseph C. O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. “Generative Agents: Interactive Simulacra of Human Behavior”. In: ACM Symposium on User Interface Software and Technology. 2023

2023
[22]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. “Direct Preference Optimization: Your Language Model is Secretly a Reward Model”. In:Advances in Neural Information Processing Systems. 2023

2023
[23]

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

Stephane Ross, Geoffrey Gordon, and Drew Bagnell. “A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning”. In:International Conference on Artificial Intelligence and Statistics. 2011

2011
[24]

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. “On-policy self-distillation for reasoning compression”. In:arXiv preprint arXiv:2603.05433(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. “Proximal Policy Optimization Algorithms”. In:arXiv preprint arXiv:1707.06347(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models”. In:arXiv preprint arXiv:2402.03300 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. “Self-distillation enables continual learning”. In:arXiv preprint arXiv:2601.19897(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Experiential reinforcement learning

Taiwei Shi, Sihao Chen, Bowen Jiang, Linxin Song, Longqi Yang, and Jieyu Zhao. “Experien- tial reinforcement learning”. In:arXiv preprint arXiv:2602.13949(2026)

work page arXiv 2026
[29]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. “Reflexion: Language Agents with Verbal Reinforcement Learning”. In:Advances in Neural Information Processing Systems. 2023

2023
[30]

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song and Mao Zheng. “A Survey of On-Policy Distillation for Large Language Models”. In:arXiv preprint arXiv:2604.00626(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

MemoryGraft: Persistent Compromise of LLM Agents via Poisoned Experience Retrieval

Saksham Sahai Srivastava and Haoyu He. “MemoryGraft: Persistent Compromise of LLM Agents via Poisoned Experience Retrieval”. In:arXiv preprint arXiv:2512.16962(2025)

work page arXiv 2025
[32]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. “V oyager: An Open-Ended Embodied Agent with Large Language Models”. In:arXiv preprint arXiv:2305.16291(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

Hao Wang, Guozhi Wang, Han Xiao, Yufeng Zhou, Yue Pan, Jichao Wang, Ke Xu, Yafei Wen, Xiaohu Ruan, Xiaoxin Chen, and Honggang Qi. “Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents”. In:arXiv preprint arXiv:2604.10674(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

Skil- lOrchestra: Learning to route agents via skill transfer

Jiayu Wang, Yifei Ming, Zixuan Ke, Shafiq Joty, Aws Albarghouthi, and Frederic Sala. “Skil- lOrchestra: Learning to route agents via skill transfer”. In:arXiv preprint arXiv:2602.19672 (2026)

work page arXiv 2026
[35]

Mem-{\alpha}: Learning Memory Construction via Reinforcement Learning

Yu Wang, Ryuichi Takanobu, Zhiqi Liang, Yuzhen Mao, Yuanzhe Hu, Julian McAuley, and Xiaojian Wu. “Mem-α: Learning memory construction via reinforcement learning”. In:arXiv preprint arXiv:2509.25911(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Q-learning

Christopher JCH Watkins and Peter Dayan. “Q-learning”. In:Machine learning8.3 (1992), pp. 279–292. 11

1992
[37]

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

Rui Wu, Yifei Li, Yuchen Zhang, Yiming Wang, Xiaodong Li, Lichang Chen, Jinyang Chen, Lei Li, and Xipeng Qiu. “Self-Evolving LLM Agents through an Experience-Driven Lifecycle”. In:arXiv preprint arXiv:2510.16079(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

TokMem: One-token procedural memory for large language models

Zijun Wu, Yongchang Hao, and Lili Mou. “TokMem: One-token procedural memory for large language models”. In:arXiv preprint arXiv:2510.00444(2025)

work page arXiv 2025
[39]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. “SkillRL: Evolving agents via recursive skill-augmented reinforcement learning”. In:arXiv preprint arXiv:2602.08234(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

Meta-Reinforcement Learning with Self-Reflection for Agentic Search

Teng Xiao, Yige Yuan, Hamish Ivison, Huaisheng Zhu, Faeze Brahman, Nathan Lambert, Pradeep Dasigi, Noah A. Smith, and Hannaneh Hajishirzi. “Meta-Reinforcement Learning with Self-Reflection for Agentic Search”. In:arXiv preprint arXiv:2603.11327(2026)

work page arXiv 2026
[41]

A-mem: Agentic memory for llm agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. “A-mem: Agentic memory for llm agents”. In:Advances in Neural Information Processing Systems38 (2026), pp. 17577–17604

2026
[42]

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z. Pan, Hinrich Schütze, V olker Tresp, and Yunpu Ma. “Memory-R1: Enhancing large language model agents to manage and utilize memories via reinforcement learning”. In:arXiv preprint arXiv:2508.19828(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, and Chenxu Lv. “Qwen3 technical report”. In:arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Self-Distilled RLVR

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. “Self-Distilled RLVR”. In:arXiv preprint arXiv:2604.03128(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. “ReAct: Synergizing Reasoning and Acting in Language Models”. In:International Conference on Learning Representations. 2023

2023
[46]

Online Experiential Learning for Language Models

Tianzhu Ye, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, and Furu Wei. “Online experiential learning for language models”. In:arXiv preprint arXiv:2603.16856(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[47]

On-Policy Context Distillation for Language Models

Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. “On-policy context distillation for language models”. In:arXiv preprint arXiv:2602.12275(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[48]

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao, Yuexiang Xie, Qingquan Tan, Jiaqi Feng, Yaliang Li, and Libing Wu. “Agentic memory: Learning unified long-term and short-term memory management for large language model agents”. In:arXiv preprint arXiv:2601.01885(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[49]

arXiv preprint arXiv:2509.24704 , year=

Guibin Zhang, Muxin Fu, and Shuicheng Yan. “MemGen: Weaving generative latent memory for self-evolving agents”. In:arXiv preprint arXiv:2509.24704(2025)

work page arXiv 2025
[50]

Embarrassingly Simple Self-Distillation Improves Code Generation

Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, and Yizhe Zhang. “Embarrassingly Simple Self-Distillation Improves Code Generation”. In:arXiv preprint arXiv:2604.01193(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[51]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, and Junyang Lin. “Qwen3 embedding: Advancing text embedding and reranking through foundation models”. In:arXiv preprint arXiv:2506.05176(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

MemFly: On-the-Fly Memory Optimization via Information Bottleneck

Zhenyuan Zhang, Xianzhang Jia, Zhiqin Yang, Zhenbo Song, Wei Xue, Sirui Han, and Yike Guo. “MemFly: On-the-Fly Memory Optimization via Information Bottleneck”. In:arXiv preprint arXiv:2602.07885(2026)

work page arXiv 2026
[53]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. “Self-distilled reasoner: On-policy self-distillation for large language models”. In: arXiv preprint arXiv:2601.18734(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[54]

Memorybank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. “Memorybank: Enhancing large language models with long-term memory”. In:Proceedings of the AAAI conference on artificial intelligence. V ol. 38. 17. 2024, pp. 19724–19731

2024
[55]

Memento: Fine-tuning llm agents without fine-tuning llms.arXiv preprint arXiv:2508.16153, 2025

Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, and Jun Wang. “Memento: Fine-tuning LLM agents without fine-tuning LLMs”. In:arXiv preprint arXiv:2508.16153(2025). 12 Table 4: Generalization of different procedural-memory levels on SCIKNOWEVAL. The frozen-policy setting tests whether ...

work page arXiv 2025
[56]

{behavior_1_name}: {behavior_1_instruction}
[57]

strategies

{behavior_2_name}: {behavior_2_instruction} ... K. {behavior_K_name}: {behavior_K_instruction} Correct solution: {successful_previous_attempt} The following is feedback from your unsuccessful earlier attempt: {feedback_raw} Correctly solve the original question. Here, {problem_text} is the original question. The fields {strategies} and {lessons} are probl...
[58]

Identify recurring patterns across these related problems
[59]

Extract 3--8 reusable behaviors, each with a name and instruction
[60]

Focus on high-level, transferable guidance
[61]

Include both positive behaviors and mistakes to avoid
[62]

behaviors

Avoid problem-specific wording, option-letter shortcuts, and references to individual attempts. Respond ONLY with valid JSON: {"behaviors": [{"name": "behavior_...", "instruction": "..."}, ...]} Evolution prompt.Once the bank contains existing behaviors, the extractor switches to an evolution prompt. It is shown the current semantic-cluster summaries and ...
[63]

Review existing behaviors against the new cluster evidence
[64]

Identify gaps not covered by existing behaviors
[65]

Decide on actions: new, update, or remove
[66]

actions": [ {

Keep behaviors reusable, concise, and independent of any single problem. Respond ONLY with valid JSON: {"actions": [ {"action": "new", "name": "behavior_...", "instruction": "..."}, {"action": "update", "name": "behavior_existing_name", "instruction": "..."}, {"action": "remove", "name": "behavior_bad_one"}]} This retrieve-then-decide pattern prevents the...

[1] [1]

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sergio Ramos Garea, Matthieu Geist, and Olivier Bachem. “On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes”. In:International Conference on Learning Representations. 2024

2024

[2] [2]

X-KD: General Experiential Knowledge Distillation for Large Language Models

Yuang Cai and Yuyu Yuan. “X-KD: General Experiential Knowledge Distillation for Large Language Models”. In:arXiv preprint arXiv:2602.12674(2026)

work page arXiv 2026

[3] [3]

Mem$^2$Evolve: Towards Self-Evolving Agents via Co-Evolutionary Capability Expansion and Experience Distillation

Zihao Cheng, Zeming Liu, Yingyu Shan, Xinyi Wang, Xiangrong Zhu, Yunpu Ma, Hongru Wang, Yuhang Guo, Wei Lin, and Yunhong Wang. “Mem2Evolve: Towards Self-Evolving Agents via Co-Evolutionary Capability Expansion and Experience Distillation”. In:arXiv preprint arXiv:2604.10923(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models

Kehua Feng, Keyan Ding, Weijie Wang, Xiang Zhuang, Zeyuan Wang, Ming Qin, Yu Zhao, Jianhua Yao, Qiang Zhang, and Huajun Chen. “SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models”. In:arXiv preprint arXiv:2406.09098(2024)

work page arXiv 2024

[5] [5]

MiniLLM: Knowledge Distillation of Large Language Models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. “MiniLLM: Knowledge Distillation of Large Language Models”. In:International Conference on Learning Representations. 2024

2024

[6] [6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, and Xiao Bi. “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning”. In:arXiv preprint arXiv:2501.12948(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

Yinghui He, Simran Kaur, Adithya Bhaskar, Yongjin Yang, Jiarui Liu, Narutatsu Ri, Liam Fowl, Abhishek Panigrahi, Danqi Chen, and Sanjeev Arora. “Self-Distillation Zero: Self- Revision Turns Binary Rewards into Dense Supervision”. In:arXiv preprint arXiv:2604.12002 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. “Distilling the Knowledge in a Neural Network”. In:NeurIPS Deep Learning and Representation Learning Workshop. 2015

2015

[9] [9]

R-Zero: Self-Evolving Reasoning LLM from Zero Data

Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. “R-Zero: Self-Evolving Reasoning LLM from Zero Data”. In:arXiv preprint arXiv:2508.05004(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. “Reinforcement learning via self-distillation”. In:arXiv preprint arXiv:2601.20802 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. “LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code”. In:arXiv preprint arXiv:2403.07974(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. “Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?” In:arXiv preprint arXiv:2603.24472(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [13]

uttler, Mike Lewis, Wen-tau Yih, Tim Rockt

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K"uttler, Mike Lewis, Wen-tau Yih, Tim Rockt"aschel, Sebastian Riedel, and Douwe Kiela. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”. In: Advances in Neural Information Processing Systems. 2020

2020

[14] [14]

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, and Ning Ding. “Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe”. In:arXiv preprint arXiv:2604.13016(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

Thinking Machines Lab: Con- nectionism

Kevin Lu and Thinking Machines Lab.On-Policy Distillation. Thinking Machines Lab: Con- nectionism. 2025.URL: https://thinkingmachines.ai/blog/on-policy- distillation/

2025

[16] [16]

SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization

Zhengxi Lu, Zhiyuan Yao, Jinyang Wu, Chengcheng Han, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. “SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization”. In:arXiv preprint arXiv:2604.02268(2026). 10

work page internal anchor Pith review Pith/arXiv arXiv 2026

[17] [17]

Self-Refine: Iterative Refinement with Self-Feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Sean Welleck, Bodhisattwa Prasad Majumder, Shashank Gupta, Amir Yazdanbakhsh, and Peter Clark. “Self-Refine: Iterative Refinement with Self-Feedback”. In:Advances in Neural Information Processing Systems. 2023

2023

[18] [18]

Olmo 3

Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heine- man, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, and Hamish Ivison. “Olmo 3”. In: arXiv preprint arXiv:2512.13961(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

Siru Ouyang, Jun Yan, I Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T Le, Samira Daruki, and Xiangru Tang. “Reasoningbank: Scaling agent self-evolving with reasoning memory”. In:arXiv preprint arXiv:2509.25140(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Vivian Fang, Shishir G. Patil, Kevin Lin, Sarah Wooders, and Joseph E. Gonza- lez. “MemGPT: Towards LLMs as Operating Systems”. In:arXiv preprint arXiv:2310.08560 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Generative Agents: Interactive Simulacra of Human Behavior

Joon Sung Park, Joseph C. O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. “Generative Agents: Interactive Simulacra of Human Behavior”. In: ACM Symposium on User Interface Software and Technology. 2023

2023

[22] [22]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. “Direct Preference Optimization: Your Language Model is Secretly a Reward Model”. In:Advances in Neural Information Processing Systems. 2023

2023

[23] [23]

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

Stephane Ross, Geoffrey Gordon, and Drew Bagnell. “A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning”. In:International Conference on Artificial Intelligence and Statistics. 2011

2011

[24] [24]

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. “On-policy self-distillation for reasoning compression”. In:arXiv preprint arXiv:2603.05433(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. “Proximal Policy Optimization Algorithms”. In:arXiv preprint arXiv:1707.06347(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[26] [26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models”. In:arXiv preprint arXiv:2402.03300 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. “Self-distillation enables continual learning”. In:arXiv preprint arXiv:2601.19897(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[28] [28]

Experiential reinforcement learning

Taiwei Shi, Sihao Chen, Bowen Jiang, Linxin Song, Longqi Yang, and Jieyu Zhao. “Experien- tial reinforcement learning”. In:arXiv preprint arXiv:2602.13949(2026)

work page arXiv 2026

[29] [29]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. “Reflexion: Language Agents with Verbal Reinforcement Learning”. In:Advances in Neural Information Processing Systems. 2023

2023

[30] [30]

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song and Mao Zheng. “A Survey of On-Policy Distillation for Large Language Models”. In:arXiv preprint arXiv:2604.00626(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[31] [31]

MemoryGraft: Persistent Compromise of LLM Agents via Poisoned Experience Retrieval

Saksham Sahai Srivastava and Haoyu He. “MemoryGraft: Persistent Compromise of LLM Agents via Poisoned Experience Retrieval”. In:arXiv preprint arXiv:2512.16962(2025)

work page arXiv 2025

[32] [32]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. “V oyager: An Open-Ended Embodied Agent with Large Language Models”. In:arXiv preprint arXiv:2305.16291(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

Hao Wang, Guozhi Wang, Han Xiao, Yufeng Zhou, Yue Pan, Jichao Wang, Ke Xu, Yafei Wen, Xiaohu Ruan, Xiaoxin Chen, and Honggang Qi. “Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents”. In:arXiv preprint arXiv:2604.10674(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[34] [34]

Skil- lOrchestra: Learning to route agents via skill transfer

Jiayu Wang, Yifei Ming, Zixuan Ke, Shafiq Joty, Aws Albarghouthi, and Frederic Sala. “Skil- lOrchestra: Learning to route agents via skill transfer”. In:arXiv preprint arXiv:2602.19672 (2026)

work page arXiv 2026

[35] [35]

Mem-{\alpha}: Learning Memory Construction via Reinforcement Learning

Yu Wang, Ryuichi Takanobu, Zhiqi Liang, Yuzhen Mao, Yuanzhe Hu, Julian McAuley, and Xiaojian Wu. “Mem-α: Learning memory construction via reinforcement learning”. In:arXiv preprint arXiv:2509.25911(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Q-learning

Christopher JCH Watkins and Peter Dayan. “Q-learning”. In:Machine learning8.3 (1992), pp. 279–292. 11

1992

[37] [37]

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

Rui Wu, Yifei Li, Yuchen Zhang, Yiming Wang, Xiaodong Li, Lichang Chen, Jinyang Chen, Lei Li, and Xipeng Qiu. “Self-Evolving LLM Agents through an Experience-Driven Lifecycle”. In:arXiv preprint arXiv:2510.16079(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

TokMem: One-token procedural memory for large language models

Zijun Wu, Yongchang Hao, and Lili Mou. “TokMem: One-token procedural memory for large language models”. In:arXiv preprint arXiv:2510.00444(2025)

work page arXiv 2025

[39] [39]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. “SkillRL: Evolving agents via recursive skill-augmented reinforcement learning”. In:arXiv preprint arXiv:2602.08234(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[40] [40]

Meta-Reinforcement Learning with Self-Reflection for Agentic Search

Teng Xiao, Yige Yuan, Hamish Ivison, Huaisheng Zhu, Faeze Brahman, Nathan Lambert, Pradeep Dasigi, Noah A. Smith, and Hannaneh Hajishirzi. “Meta-Reinforcement Learning with Self-Reflection for Agentic Search”. In:arXiv preprint arXiv:2603.11327(2026)

work page arXiv 2026

[41] [41]

A-mem: Agentic memory for llm agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. “A-mem: Agentic memory for llm agents”. In:Advances in Neural Information Processing Systems38 (2026), pp. 17577–17604

2026

[42] [42]

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z. Pan, Hinrich Schütze, V olker Tresp, and Yunpu Ma. “Memory-R1: Enhancing large language model agents to manage and utilize memories via reinforcement learning”. In:arXiv preprint arXiv:2508.19828(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, and Chenxu Lv. “Qwen3 technical report”. In:arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

Self-Distilled RLVR

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. “Self-Distilled RLVR”. In:arXiv preprint arXiv:2604.03128(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[45] [45]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. “ReAct: Synergizing Reasoning and Acting in Language Models”. In:International Conference on Learning Representations. 2023

2023

[46] [46]

Online Experiential Learning for Language Models

Tianzhu Ye, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, and Furu Wei. “Online experiential learning for language models”. In:arXiv preprint arXiv:2603.16856(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[47] [47]

On-Policy Context Distillation for Language Models

Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. “On-policy context distillation for language models”. In:arXiv preprint arXiv:2602.12275(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[48] [48]

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao, Yuexiang Xie, Qingquan Tan, Jiaqi Feng, Yaliang Li, and Libing Wu. “Agentic memory: Learning unified long-term and short-term memory management for large language model agents”. In:arXiv preprint arXiv:2601.01885(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[49] [49]

arXiv preprint arXiv:2509.24704 , year=

Guibin Zhang, Muxin Fu, and Shuicheng Yan. “MemGen: Weaving generative latent memory for self-evolving agents”. In:arXiv preprint arXiv:2509.24704(2025)

work page arXiv 2025

[50] [50]

Embarrassingly Simple Self-Distillation Improves Code Generation

Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, and Yizhe Zhang. “Embarrassingly Simple Self-Distillation Improves Code Generation”. In:arXiv preprint arXiv:2604.01193(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[51] [51]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, and Junyang Lin. “Qwen3 embedding: Advancing text embedding and reranking through foundation models”. In:arXiv preprint arXiv:2506.05176(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

MemFly: On-the-Fly Memory Optimization via Information Bottleneck

Zhenyuan Zhang, Xianzhang Jia, Zhiqin Yang, Zhenbo Song, Wei Xue, Sirui Han, and Yike Guo. “MemFly: On-the-Fly Memory Optimization via Information Bottleneck”. In:arXiv preprint arXiv:2602.07885(2026)

work page arXiv 2026

[53] [53]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. “Self-distilled reasoner: On-policy self-distillation for large language models”. In: arXiv preprint arXiv:2601.18734(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[54] [54]

Memorybank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. “Memorybank: Enhancing large language models with long-term memory”. In:Proceedings of the AAAI conference on artificial intelligence. V ol. 38. 17. 2024, pp. 19724–19731

2024

[55] [55]

Memento: Fine-tuning llm agents without fine-tuning llms.arXiv preprint arXiv:2508.16153, 2025

Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, and Jun Wang. “Memento: Fine-tuning LLM agents without fine-tuning LLMs”. In:arXiv preprint arXiv:2508.16153(2025). 12 Table 4: Generalization of different procedural-memory levels on SCIKNOWEVAL. The frozen-policy setting tests whether ...

work page arXiv 2025

[56] [56]

{behavior_1_name}: {behavior_1_instruction}

[57] [57]

strategies

{behavior_2_name}: {behavior_2_instruction} ... K. {behavior_K_name}: {behavior_K_instruction} Correct solution: {successful_previous_attempt} The following is feedback from your unsuccessful earlier attempt: {feedback_raw} Correctly solve the original question. Here, {problem_text} is the original question. The fields {strategies} and {lessons} are probl...

[58] [58]

Identify recurring patterns across these related problems

[59] [59]

Extract 3--8 reusable behaviors, each with a name and instruction

[60] [60]

Focus on high-level, transferable guidance

[61] [61]

Include both positive behaviors and mistakes to avoid

[62] [62]

behaviors

Avoid problem-specific wording, option-letter shortcuts, and references to individual attempts. Respond ONLY with valid JSON: {"behaviors": [{"name": "behavior_...", "instruction": "..."}, ...]} Evolution prompt.Once the bank contains existing behaviors, the extractor switches to an evolution prompt. It is shown the current semantic-cluster summaries and ...

[63] [63]

Review existing behaviors against the new cluster evidence

[64] [64]

Identify gaps not covered by existing behaviors

[65] [65]

Decide on actions: new, update, or remove

[66] [66]

actions": [ {

Keep behaviors reusable, concise, and independent of any single problem. Respond ONLY with valid JSON: {"actions": [ {"action": "new", "name": "behavior_...", "instruction": "..."}, {"action": "update", "name": "behavior_existing_name", "instruction": "..."}, {"action": "remove", "name": "behavior_bad_one"}]} This retrieve-then-decide pattern prevents the...