pith. machine review for the scientific record. sign in

arxiv: 2605.12484 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Learning, Fast and Slow: Towards LLMs That Adapt Continually

Devvrit Khatri, Inderjit S Dhillon, Joseph E. Gonzalez, Kurt Keutzer, Kusha Sareen, Lakshya A Agrawal, Matei Zaharia, Rishabh Agarwal, Rishabh Tiwari

Pith reviewed 2026-05-13 04:55 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords fast-slow learningcontinual learningLLM adaptationin-context learningcatastrophic forgettingplasticitysample efficiencyreinforcement learning
0
0 comments X

The pith

Optimizing context as fast weights lets LLMs adapt to new tasks with up to 3x better sample efficiency and far less forgetting than parameter updates alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models adapt to tasks either by updating parameters through reinforcement learning or by changing context without updates, yet each approach has clear limits on performance or stability. The paper introduces a fast-slow framework that treats model parameters as stable slow weights and optimized context as fast weights that absorb task details from textual feedback. This hybrid reaches higher final performance than slow learning only, uses fewer samples, and keeps the model closer to its base state. If the claim holds, LLMs could handle sequences of changing tasks without the usual loss of prior abilities or the need for full retraining. Readers care because the method directly targets the tradeoff between quick adaptation and long-term retention in deployed systems.

Core claim

Fast-Slow Training (FST) keeps model parameters as slow weights for general reasoning while using optimized context as fast weights to capture task-specific information from textual feedback. Across reasoning tasks, FST proves up to 3x more sample-efficient than pure parameter-based RL and reaches a higher performance level. The approach produces up to 70% less KL divergence from the base model, which reduces catastrophic forgetting and maintains plasticity so that models adapt more readily to later tasks. In scenarios where task domains shift during training, FST continues to improve on each new task while RL-only training stalls.

What carries the argument

The fast-slow weights split, in which context is optimized as fast weights to absorb task-specific textual feedback while parameters serve as slow weights that preserve general behaviors.

If this is right

  • FST reaches a higher performance asymptote than slow learning alone across the tested reasoning tasks.
  • Models trained with FST show up to 70% lower KL divergence from the base LLM.
  • Lower parameter drift produces measurably less catastrophic forgetting than RL training.
  • Preserved plasticity allows FST models to adapt more effectively to a second task after the first.
  • When task domains change during training, FST keeps acquiring new tasks while RL-only training stops improving.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of fast context updates from slow parameter changes could reduce the frequency of expensive full-model retraining in production LLM systems.
  • This style of hybrid learning may extend beyond RL to other update methods such as supervised fine-tuning when feedback remains textual.
  • The preserved plasticity suggests the framework could support agents that encounter lifelong sequences of tasks with minimal cumulative degradation.
  • Testing the same fast-slow split on non-textual feedback or multimodal inputs would reveal whether the efficiency pattern generalizes beyond the current reasoning benchmarks.

Load-bearing premise

That context optimization as fast weights can reliably take in task-specific information from feedback and generalize across tasks without creating new interference or instabilities that cancel out the efficiency and stability gains.

What would settle it

Running FST and RL-only training on a fresh set of reasoning tasks with varied feedback formats and finding no consistent 3x sample-efficiency advantage or no reduction in KL divergence or forgetting would show the central benefits do not hold.

Figures

Figures reproduced from arXiv: 2605.12484 by Devvrit Khatri, Inderjit S Dhillon, Joseph E. Gonzalez, Kurt Keutzer, Kusha Sareen, Lakshya A Agrawal, Matei Zaharia, Rishabh Agarwal, Rishabh Tiwari.

Figure 1
Figure 1. Figure 1: FST jointly optimizes slow parameters θ and a fast textual-context pool Φ via interleaved fast and slow update loops. The slow loop (top) updates θ from the scalar reward alone (θc → θc+1). The fast loop (bottom) updates Φ via reflective optimization, additionally consuming the rollout’s full text including thoughts, tool calls, errors, and rich feedback (Φc → Φc+1). Maintaining Φ as a Pareto-frontier popu… view at source ↗
Figure 2
Figure 2. Figure 2: Data efficiency across three training families. Top row: matched-step validation reward (running max, mean@4) — FST reaches RL’s running peak in substantially fewer training steps (3.0× on CodeIO, 1.4× on Math (Polaris), 3.0× on HoVer-hard). Bottom row: 6/8-axis coverage radars for Base→GEPA, RL→GEPA, and FST→GEPA on Mean@8 and Best@8, with axes grouped by in-distribution (sage), cross-domain (coral), and … view at source ↗
Figure 3
Figure 3. Figure 3: Performance asymptote on CodeIO, Math (Polaris), and HoVer-hard. For each run we fit a 4- parameter sigmoid R − R0 = A−R0 1+(Cmid/C)B to the validation-accuracy trajectory and annotate the upper asymptote A. FST’s asymptote (green) is higher than RL’s (blue) on all three tasks. Solid curves cover the fit window; dotted curves are extrapolation past the last training step. we compute token-level KL from the… view at source ↗
Figure 4
Figure 4. Figure 4: Validation reward versus KL(πtrain ∥ πbase)trajectories on CodeIO, HoVer, and Physics. Translucent markers are per-checkpoint measurements; the line is a rolling-mean smoothing along training step. At matched reward, FST (green) sits to the left of RL (blue) on every task, reaching the same reward at a significantly lower KL from the base policy. methods. FST reaches near-peak in every stage while learning… view at source ↗
Figure 5
Figure 5. Figure 5: Plasticity probe: starting from a Math (left) or Physics (right) checkpoint trained with either RL or FST, we run a fresh RL pass on HoVer-hard and plot HoVer validation accuracy over 400 steps. Base init (dotted) is the no-prior-training reference. FST-init (green) preserves more capacity for the new task than RL-init (blue) on both arms; on the Math arm, prior RL collapses HoVer-hard learnability to near… view at source ↗
Figure 6
Figure 6. Figure 6: Continual learning across HoVer → CodeIO → Physics: a single uninterrupted training run that switches task every 200 steps. The y-axis is per-task validation accuracy normalized with respect to the peak accuracy reached across methods within each stage. FST (solid) reaches near-peak on every stage; RL (dashed) acquires HoVer but completely stalls on CodeIO and only partially recovers on Physics. represents… view at source ↗
Figure 7
Figure 7. Figure 7: Star Graph Search Task. FST escapes the zero-reward regime by step ∼50, an order of magnitude before RL begins to move signal at ∼300. 0 200 400 600 Step 15 20 25 30 35 40 45 50 Validation reward (%) 0 200 400 600 Step 0.1 0.2 0.3 0.4 0.5 Actor entropy (nats) RL FST FST-distill GEPA [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: HoVer training: FST (green) lifts validation accuracy above the prompt-only ceiling (GEPA only, dashed), RL (blue) plateaus, and FST-distill, fast-weight self-distillation that relies on GEPA to drive reward gains. Fast and slow weights: complementary learning systems. The fast/slow decomposition predates deep learning, with roots in the neuroscience of complementary learning systems [25, 34] and a long li… view at source ↗
Figure 9
Figure 9. Figure 9: CodeIO design ablations (val mean@4 at training step 500). Sage bars mark the headline config￾uration in each panel; gray bars are the alternatives swept; the dashed line indicates RL-only at the same matched step. (a) Population size K. (b) Advantage baseline at K=2: per-prompt (Prompt baseline) vs. per-problem (Problem baseline). (c) Cycle length T at K=8, Problem baseline. (d) Light vs. full GEPA recipe… view at source ↗
Figure 10
Figure 10. Figure 10: Rollout reuse on HoVer-hard, training step [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Validation reward versus KL(πtrain ∥ πbase) on all four training tasks: CodeIO, Math (Polaris), HoVer, and Physics. Same axes, smoothing, and conventions as [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
read the original abstract

Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of plasticity. In contrast, in-context learning with fixed LLM parameters can cheaply and rapidly adapt to task-specific requirements (e.g., prompt optimization), but cannot by itself typically match the performance gains available through updating LLM parameters. There is no good reason for restricting learning to being in-context or in-weights. Moreover, humans also likely learn at different time scales (e.g., System 1 vs 2). To this end, we introduce a fast-slow learning framework for LLMs, with model parameters as "slow" weights and optimized context as "fast" weights. These fast "weights" can learn from textual feedback to absorb the task-specific information, while allowing slow weights to stay closer to the base model and persist general reasoning behaviors. Fast-Slow Training (FST) is up to 3x more sample-efficient than only slow learning (RL) across reasoning tasks, while consistently reaching a higher performance asymptote. Moreover, FST-trained models remain closer to the base LLM (up to 70% less KL divergence), resulting in less catastrophic forgetting than RL-training. This reduced drift also preserves plasticity: after training on one task, FST trained models adapt more effectively to a subsequent task than parameter-only trained models. In continual learning scenarios, where task domains change on the fly, FST continues to acquire each new task while parameter-only RL stalls.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a Fast-Slow Training (FST) framework for LLMs in which model parameters act as slow weights updated via RL while optimized context serves as fast weights that absorb task-specific information from textual feedback. The central empirical claims are that FST is up to 3x more sample-efficient than parameter-only RL across reasoning tasks, reaches higher performance asymptotes, produces up to 70% lower KL divergence from the base model, reduces catastrophic forgetting, and preserves plasticity so that models adapt more effectively to subsequent tasks in continual-learning settings where domains shift on the fly.

Significance. If the reported efficiency, KL, and plasticity gains hold under fuller verification, the work would provide a concrete, empirically supported route to continual adaptation in LLMs that avoids the usual trade-off between task acquisition and retention of general capabilities. The combination of parameter and context optimization is a straightforward extension of existing ideas yet yields measurable improvements on the stated metrics; reproducible code or machine-checked elements are not mentioned, but the falsifiable predictions (sample-efficiency ratios, KL deltas, plasticity retention) are a strength.

major comments (2)
  1. [§4] §4 (Experimental Results): the claims of 3x sample efficiency, higher asymptotes, and 70% KL reduction are presented as direct comparisons to RL baselines, yet the section provides neither error bars, complete baseline hyper-parameter tables, nor explicit task-selection criteria; without these the quantitative support for the central efficiency and forgetting-mitigation claims remains only partially verifiable.
  2. [§3] §3 (Fast-Slow Training procedure): the optimization of context as fast weights is asserted to absorb task-specific information from textual feedback without introducing offsetting interference, but no ablation isolating the contribution of context optimization versus parameter updates is reported; this is load-bearing for the claim that FST simultaneously improves efficiency and plasticity.
minor comments (2)
  1. [Abstract] Abstract and §1: the phrase 'up to 3x more sample-efficient' and 'up to 70% less KL divergence' would benefit from immediate parenthetical reference to the specific tasks or figures that attain those maxima.
  2. [§2] Notation: the distinction between 'slow weights' (parameters) and 'fast weights' (context) is clear, but a short table summarizing the update rules for each would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help improve the rigor and clarity of our work. We address each major comment point by point below, agreeing where revisions are needed to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Results): the claims of 3x sample efficiency, higher asymptotes, and 70% KL reduction are presented as direct comparisons to RL baselines, yet the section provides neither error bars, complete baseline hyper-parameter tables, nor explicit task-selection criteria; without these the quantitative support for the central efficiency and forgetting-mitigation claims remains only partially verifiable.

    Authors: We acknowledge that the current presentation of results in §4 lacks error bars, full hyperparameter details, and explicit task selection criteria, which limits full verifiability. In the revised manuscript, we will add error bars computed over multiple random seeds for all key metrics, include a detailed table of hyperparameters for FST and all RL baselines, and clearly state the task selection criteria used in our experiments. These changes will provide stronger quantitative support for the reported improvements in sample efficiency, performance, and reduced forgetting. revision: yes

  2. Referee: [§3] §3 (Fast-Slow Training procedure): the optimization of context as fast weights is asserted to absorb task-specific information from textual feedback without introducing offsetting interference, but no ablation isolating the contribution of context optimization versus parameter updates is reported; this is load-bearing for the claim that FST simultaneously improves efficiency and plasticity.

    Authors: The absence of a dedicated ablation study is a valid point, as it would more directly isolate the role of context optimization. Although the main experiments compare FST to parameter-only RL and show benefits in efficiency and plasticity, we will incorporate an ablation analysis in the revised version. This will include variants where context optimization is removed or where only context is optimized, to demonstrate that the fast-slow combination yields synergistic improvements without interference. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper defines the Fast-Slow Training (FST) framework explicitly as a hybrid of slow parameter updates via RL and fast context optimization from textual feedback. All performance claims (sample efficiency, higher asymptote, reduced KL divergence, preserved plasticity) are presented strictly as empirical outcomes from direct comparisons to RL baselines on reasoning tasks. No equations, predictions, or uniqueness theorems are invoked that reduce by construction to fitted inputs, self-citations, or ansatzes from prior author work. The derivation chain consists of a straightforward procedural description followed by experimental results, with no load-bearing steps that loop back to the paper's own definitions or data fits.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on standard assumptions about in-context learning effectiveness and the separability of task-specific versus general knowledge in LLMs, with limited new free parameters or invented entities beyond the conceptual framing.

free parameters (1)
  • context optimization hyperparameters
    Parameters governing how textual feedback is used to optimize the fast context weights are required for the method but not specified in the abstract.
axioms (1)
  • domain assumption Textual feedback can be used to optimize context to absorb task-specific information without parameter changes
    The core mechanism assumes in-context optimization can serve as an effective fast learning channel.
invented entities (1)
  • fast weights as optimized context no independent evidence
    purpose: To enable rapid task adaptation while keeping slow weights stable
    The paper introduces this framing of context optimization as fast weights, though the underlying technique of prompt/context tuning is not new.

pith-pipeline@v0.9.0 · 5626 in / 1424 out tokens · 68766 ms · 2026-05-13T04:55:30.038838+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 19 internal anchors

  1. [1]

    PromptWizard: Task-aware prompt optimization framework, 2024

    Eshaan Agarwal, Joykirat Singh, Vivek Dani, Raghav Magazine, Tanuja Ganu, and Akshay Nambi. PromptWizard: Task-aware prompt optimization framework, 2024. URLhttps://arxiv.org/abs/ 2405.18369. 9

  2. [2]

    GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

    Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi,HerumbShandilya,MichaelJRyan,MengJiang,ChristopherPotts,KoushikSen,AlexandrosG. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. Gepa: Reflective prompt evolution can outperform reinforcement learning, 2026. URLhttps://arxiv.org/abs/250...

  3. [3]

    Polaris: A post-training recipe for scaling reinforcement learningonadvancedreasoningmodels, 2025

    ChenxinAn,ZhihuiXie,XiaonanLi,LeiLi,JunZhang,ShansanGong,MingZhong,JingjingXu,Xipeng Qiu, Mingxuan Wang, and Lingpeng Kong. Polaris: A post-training recipe for scaling reinforcement learningonadvancedreasoningmodels, 2025. URL https://hkunlp.github.io/blog/2025/Polaris. 4

  4. [4]

    Prediction and control in continual reinforcement learning, 2023

    Nishanth Anand and Doina Precup. Prediction and control in continual reinforcement learning, 2023. URLhttps://arxiv.org/abs/2312.11669. 2

  5. [5]

    Thinking fast and slow with deep learning and tree search

    Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. InAdvances in Neural Information Processing Systems, 2017. URLhttps://arxiv.org/abs/1705. 08439. 10

  6. [6]

    Leibo, and Catalin Ionescu

    Jimmy Ba, Geoffrey Hinton, Volodymyr Mnih, Joel Z. Leibo, and Catalin Ionescu. Using fast weights to attend to the recent past, 2016. URLhttps://arxiv.org/abs/1610.06258. 2, 10

  7. [7]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, GretchenKrueger,TomHenighan,RewonChild,AdityaRamesh,DanielM.Ziegler,JeffreyWu,Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray,...

  8. [8]

    Trace is the next AutoDiff: Generative optimiza- tion with rich feedback, execution traces, and LLMs, 2024

    Ching-An Cheng, Allen Nie, and Adith Swaminathan. Trace is the next AutoDiff: Generative optimiza- tion with rich feedback, execution traces, and LLMs, 2024. URLhttps://arxiv.org/abs/2406.16218. 9

  9. [9]

    The entropy mechanism of reinforcement learning for reasoning language models,

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models,

  10. [10]

    URLhttps://arxiv.org/abs/2505.22617. 1

  11. [11]

    RLPrompt: Optimizingdiscretetextpromptswithreinforcementlearning,

    Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song, EricP.Xing, andZhitingHu. RLPrompt: Optimizingdiscretetextpromptswithreinforcementlearning,

  12. [12]

    URLhttps://arxiv.org/abs/2205.12548. 9

  13. [13]

    Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A

    Shibhansh Dohare, J. Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A. Rupam Mah- mood, and Richard S. Sutton. Loss of plasticity in deep continual learning.Nature, 632(8026): 768–774, 2024. ISSN 1476-4687. doi: 10.1038/s41586-024-07711-7. URLhttps://doi.org/10.1038/ s41586-024-07711-7. 1, 5, 6, 9

  14. [14]

    Nemotron-math: Efficient long-context distillation of mathematical reasoning from multi-mode supervision.arXiv preprint arXiv:2512.15489, 2025

    Wei Du, Shubham Toshniwal, Branislav Kisacanin, Sadegh Mahdavi, Ivan Moshkov, George Armstrong, Stephen Ge, Edgar Minasyan, Feng Chen, and Igor Gitman. Nemotron-math: Efficient long-context distillation of mathematical reasoning from multi-mode supervision.arXiv preprint arXiv:2512.15489,

  15. [15]

    Promptbreeder: Self-referential self-improvement via prompt evolution, 2023

    Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktäschel. Promptbreeder: Self-referential self-improvement via prompt evolution, 2023. URLhttps://arxiv. org/abs/2309.16797. 9

  16. [16]

    IOS Press, October 2024

    Lapo Frati, Neil Traft, Jeff Clune, and Nick Cheney.Reset It and Forget It: Relearning Last-Layer Weights Improves Continual and Transfer Learning. IOS Press, October 2024. ISBN 9781643685489. doi: 10.3233/ faia240840. URLhttp://dx.doi.org/10.3233/FAIA240840. 1, 5, 6, 9

  17. [17]

    Scalinglawsforrewardmodeloveroptimization,2022

    LeoGao,JohnSchulman,andJacobHilton. Scalinglawsforrewardmodeloveroptimization,2022. URL https://arxiv.org/abs/2210.10760. 1

  18. [18]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. Deepseek-coder: When the large language model meets programming – the rise of code intelligence, 2024. URLhttps://arxiv.org/abs/2401. 14196. 1

  19. [19]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

  20. [20]

    EvoPrompt: Connecting LLMs with evolutionary algorithms yields powerful prompt optimizers,

    Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. EvoPrompt: Connecting LLMs with evolutionary algorithms yields powerful prompt optimizers,

  21. [21]

    URLhttps://arxiv.org/abs/2309.08532. 9

  22. [22]

    G. E. Hinton and D. C. Plaut. Using fast weights to deblur old memories. InProceedings of the 9th Annual Conference of the Cognitive Science Society, pages 177–186, Hillsdale, NJ, 1987. Lawrence Erlbaum Associates. 2, 10

  23. [23]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Reinforcement learning via self-distillation, 2026. URLhttps://arxiv.org/abs/2601.20802. 8 13

  24. [24]

    Hover: A dataset for many-hop fact extraction and claim verification, 2020

    Yichen Jiang, Shikha Bordia, Zheng Zhong, Charles Dognin, Maneesh Singh, and Mohit Bansal. Hover: A dataset for many-hop fact extraction and claim verification, 2020. URLhttps://arxiv.org/abs/ 2011.03088. 4

  25. [25]

    Scaling laws for forgetting when fine-tuning large language models, 2024

    Damjan Kalajdzievski. Scaling laws for forgetting when fine-tuning large language models, 2024. URL https://arxiv.org/abs/2401.05605. 1

  26. [26]

    arXiv preprint arXiv:2510.13786 , year=

    Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S. Dhillon, David Brandfonbrener, and Rishabh Agarwal. The art of scaling reinforcement learning compute for llms, 2025. URLhttps://arxiv.org/abs/2510.13786. 2, 3, 5, 9, 19

  27. [27]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vard- hamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compiling declarative language model calls into self-improving pipelines, 2023. URLhttps://arxiv.org/abs/2310.03714. 2, 3, 9

  28. [28]

    McClelland

    Dharshan Kumaran, Demis Hassabis, and James L. McClelland. What learning systems do intelligent agents need? Complementary learning systems theory updated.Trends in Cognitive Sciences, 20(7): 512–534, 2016. doi: 10.1016/j.tics.2016.05.004. 10

  29. [29]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi...

  30. [30]

    Lanpo: Bootstrapping language and numerical feedback for reinforcement learning in LLMs, 2025

    Ang Li, Yifei Wang, Zhihang Yuan, Stefanie Jegelka, and Yisen Wang. Lanpo: Bootstrapping language and numerical feedback for reinforcement learning in LLMs, 2025. URLhttps://arxiv.org/abs/ 2510.16552. 10

  31. [31]

    Codei/o: Condensing reasoning patterns via code input-output prediction.CoRR, abs/2502.07316, 2025a

    JunlongLi,DayaGuo,DejianYang,RunxinXu,YuWu,andJunxianHe. Codei/o: Condensingreasoning patterns via code input-output prediction, 2025. URLhttps://arxiv.org/abs/2502.07316. 4

  32. [32]

    The choice of divergence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward, 2026

    Long Li, Zhijian Zhou, Jiaran Hao, Jason Klein Liu, Yanting Miao, Wei Pang, Xiaoyu Tan, Wei Chu, Zhe Wang, Shirui Pan, Chao Qu, and Yuan Qi. The choice of divergence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward, 2026. URLhttps://arxiv.org/ abs/2509.07430. 1

  33. [33]

    Mitigating the alignment tax of rlhf, 2024

    Yong Lin, Hangyu Lin, Wei Xiong, Shizhe Diao, Jianmeng Liu, Jipeng Zhang, Rui Pan, Haoxiang Wang, Wenbin Hu, Hanning Zhang, Hanze Dong, Renjie Pi, Han Zhao, Nan Jiang, Heng Ji, Yuan Yao, and Tong Zhang. Mitigating the alignment tax of rlhf, 2024. URLhttps://arxiv.org/abs/2309.06256. 1

  34. [34]

    An empirical study of catastrophic forgetting in large language models during continual fine-tuning, 2025

    Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning, 2025. URLhttps: //arxiv.org/abs/2308.08747. 1

  35. [35]

    Understanding and preventing capacity loss in reinforce- ment learning, 2022

    Clare Lyle, Mark Rowland, and Will Dabney. Understanding and preventing capacity loss in reinforce- ment learning, 2022. URLhttps://arxiv.org/abs/2204.09560. 1, 5, 6, 9

  36. [36]

    Self-Refine: Iterative Refinement with Self-Feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refine- ment with self-feedback, 2023. URLhttps://arxiv.org/abs/2303.17651. 9, 17

  37. [37]

    McClelland, Bruce L

    James L. McClelland, Bruce L. McNaughton, and Randall C. O’Reilly. Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory.Psychological Review, 102(3):419–457, 1995. doi: 10.1037/ 0033-295X.102.3.419. 10 14

  38. [38]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    MiniMax, :, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao, Haigang Zhou, HaimoZhang, HanDing, HaohaiSun,HaoyuFeng, HuaiguangCai, HaichaoZhu, Jia...

  39. [39]

    Optimizing instructions and demonstrations for multi-stage language model programs,

    Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs,

  40. [40]

    URLhttps://arxiv.org/abs/2406.11695. 2, 3, 9

  41. [41]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...

  42. [42]

    Quang Pham, Chenghao Liu, and Steven C. H. Hoi. Continual learning, fast and slow, 2022. URL https://arxiv.org/abs/2209.02370. 10

  43. [43]

    What can you do when you have zero rewards during rl?, 2025

    Jatin Prakash and Anirudh Buvanesh. What can you do when you have zero rewards during rl?, 2025. URLhttps://arxiv.org/abs/2510.03971. 18

  44. [44]

    GrIPS: Gradient-free, edit-based instruction search for prompting large language models, 2023

    Archiki Prasad, Peter Hase, Xiang Zhou, and Mohit Bansal. GrIPS: Gradient-free, edit-based instruction search for prompting large language models, 2023. URLhttps://arxiv.org/abs/2203.07281. 9

  45. [45]

    gradient descent

    Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with “gradient descent” and beam search, 2023. URLhttps://arxiv.org/abs/2305. 03495. 9

  46. [46]

    How to explore to scaleRLtrainingofLLMsonhardproblems.CMUMachineLearningBlog,2025.URL https://blog.ml

    Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, and Aviral Kumar. How to explore to scaleRLtrainingofLLMsonhardproblems.CMUMachineLearningBlog,2025.URL https://blog.ml. cmu.edu/2025/11/26/how-to-explore-to-scale-rl-training-of-llms-on-hard-problems/ . In- troduces POPE (Privileged On-Policy Exploration); paper in preparation. 10

  47. [47]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024. URL https://arxiv.org/abs/2305.18290. 3, 9

  48. [48]

    Linear transformers are secretly fast weight programmers, 2021

    Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers, 2021. URLhttps://arxiv.org/abs/2102.11174. 10

  49. [49]

    Schmidhuber

    J. Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 4(1):131–139, 1992. doi: 10.1162/neco.1992.4.1.131. URLhttps://doi. org/10.1162/neco.1992.4.1.131. 2, 10 15

  50. [50]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347. 3, 9

  51. [51]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300. 1, 2, 3, 9, 19

  52. [52]

    Rl’s ra- zor: Why online reinforcement learning forgets less.arXiv preprint arXiv:2509.04259, 2025

    Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. Rl’s razor: Why online reinforcement learning forgets less, 2025. URLhttps://arxiv.org/abs/2509.04259. 6, 9

  53. [53]

    Autoprompt: Eliciting knowledge from language models with automatically generated prompts

    Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts, 2020. URLhttps: //arxiv.org/abs/2010.15980. 9

  54. [54]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023. URLhttps://arxiv.org/ abs/2303.11366. 9, 17

  55. [55]

    Joint prompt optimization of stacked LLMs using variational inference, 2023

    Alessandro Sordoni, Xingdi Yuan, Marc-Alexandre Côté, Matheus Pereira, and Adam Trischler. Joint prompt optimization of stacked LLMs using variational inference, 2023. URLhttps://arxiv.org/ abs/2306.12509. 9

  56. [56]

    Fine-tuning and prompt optimization: Two great steps that work better together, 2024

    Dilara Soylu, Christopher Potts, and Omar Khattab. Fine-tuning and prompt optimization: Two great steps that work better together, 2024. URLhttps://arxiv.org/abs/2407.10930. EMNLP 2024. 10

  57. [57]

    Reasoninggym: Reasoningenvironmentsforreinforcementlearningwithverifiable rewards, 2025

    Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour, andAndreasKöpf. Reasoninggym: Reasoningenvironmentsforreinforcementlearningwithverifiable rewards, 2025. URLhttps://arxiv.org/abs/2505.24760. 4

  58. [58]

    Dynamic cheatsheet: Test-time learning with adaptive memory, 2025

    Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, and James Zou. Dynamic cheatsheet: Test-time learning with adaptive memory, 2025. URLhttps://arxiv.org/abs/2504.07952. 9

  59. [59]

    Mit- igating plasticity loss in continual reinforcement learning by reducing churn, 2025

    Hongyao Tang, Johan Obando-Ceron, Pablo Samuel Castro, Aaron Courville, and Glen Berseth. Mit- igating plasticity loss in continual reinforcement learning by reducing churn, 2025. URLhttps: //arxiv.org/abs/2506.00592. 1, 5, 6, 9

  60. [60]

    Agent workflow memory

    Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory, 2024. URLhttps://arxiv.org/abs/2409.07429. 9

  61. [61]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/abs/2201.11903. 1

  62. [62]

    Hard Prompts Made Easy : Gradient-Based Discrete Optimization for Prompt Tuning and Discovery

    Yuxin Wen, Neel Jain, John Kirchenbauer, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery, 2023. URL https://arxiv.org/abs/2302.03668. 9

  63. [63]

    Optimas: Optimizing compound AI systems with globally aligned local rewards, 2025

    Shirley Wu, Parth Sarthi, Shiyu Zhao, Aaron Lee, Herumb Shandilya, Arnav Singhvi, Bowen Hong, Wenfei Liang, James Zou, Omar Khattab, Jure Leskovec, and Matei Zaharia. Optimas: Optimizing compound AI systems with globally aligned local rewards, 2025. URLhttps://arxiv.org/abs/2507. 03041. 9

  64. [64]

    A-MEM: Agentic Memory for LLM Agents

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-MEM: Agentic memory for LLM agents, 2025. URLhttps://arxiv.org/abs/2502.12110. 9

  65. [65]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  66. [66]

    Le, Denny Zhou, and Xinyun Chen

    Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers, 2024. URLhttps://arxiv.org/abs/2309.03409. 2, 3, 9, 17

  67. [67]

    TextGrad: Automatic "Differentiation" via Text

    Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. TextGrad: Automatic “differentiation” via text, 2024. URLhttps://arxiv.org/abs/2406.07496. 9

  68. [68]

    Evolutionarysystempromptlearningforreinforcement learning in llms, 2026

    LunjunZhang,RyanChen,andBradlyC.Stadie. Evolutionarysystempromptlearningforreinforcement learning in llms, 2026. URLhttps://arxiv.org/abs/2602.14697. 10

  69. [69]

    Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

    Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic context engineering: Evolving contexts for self-improving language models, 2025. URLhttps: //arxiv.org/abs/2510.04618. 9

  70. [70]

    Large language models are human-level prompt engineers, 2022

    YongchaoZhou,AndreiIoanMuresanu,ZiwenHan,KeiranPaster,SilviuPitis,HarrisChan,andJimmy Ba. Large language models are human-level prompt engineers, 2023. URLhttps://arxiv.org/abs/ 2211.01910. 2, 3, 9

  71. [71]

    Composing Policy Gradients and Prompt Optimization for Language Model Programs

    Noah Ziems, Dilara Soylu, Lakshya A Agrawal, Isaac Miller, Liheng Lai, Chen Qian, Kaiqiang Song, Meng Jiang, Dan Klein, Matei Zaharia, Karel D’Oosterlinck, Christopher Potts, and Omar Khattab. mmGRPO: Composing policy gradients and prompt optimization for language model programs, 2025. URLhttps://arxiv.org/abs/2508.04660. 10 A GEPA We optimize the fast we...

  72. [72]

    question

    at learning rate10−6 with a10-step linear warm-up; we use no learning-rate decay. Each RL step samples G= 8 rollouts per problem withtrain_batch_size= 32 problems (so256 rollouts per step), runs PPO updates withppo_mini_batch_size= 32, and uses tensor-parallel size1 for the rollout engine (vLLM). At evaluation time we report mean@4 over four rollouts per ...