CLaaS: Continual learning as a service for sample efficient online learning

Barak Widawsky; Kion Fallah; Qingqing Mao; Silen Naihin

arxiv: 2606.05559 · v1 · pith:NBFXUUD4new · submitted 2026-06-04 · 💻 cs.LG

CLaaS: Continual learning as a service for sample efficient online learning

Kion Fallah , Silen Naihin , Barak Widawsky , Qingqing Mao This is my paper

Pith reviewed 2026-06-28 02:46 UTC · model grok-4.3

classification 💻 cs.LG

keywords continual learningonline learningexperience replaylanguage model agentsforward transfercatastrophic forgettingsample efficiency

0 comments

The pith

CLaaS lets language-model agents adapt online from single-use rollouts by storing them in a replay buffer for gradient reuse.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how deployed agents must handle distribution shifts in environments that cannot be reset, so each scenario yields only one rollout. It introduces CLaaS, a service that keeps an experience replay buffer behind a chat API and performs asynchronous parametric updates on the stored data. On an adversarial streaming task the authors find that these updates produce better forward transfer to new scenarios and less forgetting of earlier ones than in-context learning alone, and that the replay step is required to reach the reported sample efficiency.

Core claim

In an experiential online continual learning setting where each scenario is encountered only once, storing rollouts in an experience replay buffer enables asynchronous gradient-based training that yields superior forward transfer and reduced forgetting relative to in-context learning.

What carries the argument

An experience replay buffer that stores single-use agent rollouts so gradients can be reused during asynchronous training.

If this is right

Parametric updates produce higher forward transfer to future tasks than in-context learning.
Parametric updates produce lower forgetting of prior tasks than in-context learning.
The replay buffer is required to reach the observed sample efficiency.
The mechanism can be hidden behind a standard chat API without changing the agent interface.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same replay-plus-asynchronous-update pattern could be applied in other non-resettable domains such as physical robotics.
Long-running deployments might accumulate enough replay data to support periodic model compression or distillation steps.
The separation of rollout collection from training opens the possibility of scaling the service to many agents sharing one buffer.

Load-bearing premise

The replay buffer can be filled and sampled asynchronously from single-use rollouts without introducing bias or instability that cancels the reported gains in transfer and retention.

What would settle it

A controlled run on the same adversarial task in which replay is disabled yet forward transfer and forgetting metrics remain at least as good as the replay-enabled condition.

Figures

Figures reproduced from arXiv: 2606.05559 by Barak Widawsky, Kion Fallah, Qingqing Mao, Silen Naihin.

**Figure 1.** Figure 1: CLaaS enables experiential learning of agents from a stream of rollouts gathered during deployment. Gradient reuse from a replay buffer improves sample efficiency during async training, leading to faster generalization with less data. improvements that generalize to future tasks while avoiding degradation on previous tasks. Parametric updates via onpolicy reinforcement learning (RL) have been effective fo… view at source ↗

**Figure 2.** Figure 2: CLaaS: continual learning as-a-service for online policy improvements via async training. Given any user agent harness that uses a chat API, CLaaS collects live rollouts in an experience replay buffer B. The training engine improves the policy by sampling batches, in addition to rewards gathered from the environment, for gradient updates to a LoRA that gets hot-reloaded into the inference server. The contr… view at source ↗

**Figure 3.** Figure 3: Average defender success rate across splits at every checkpoint under different replay eviction ages with REINFORCE++. Performance improves until Amax = 25 before destabilizing at larger replay ages. 4. Experiments 4.1. Experimental Setup Experiments are conducted using CLaaS with a sequence of scenarios that induce adversarial distribution shift. The policy is assigned reward for responses which comply… view at source ↗

**Figure 4.** Figure 4: Walkthrough of one adaptive IH-Challenge scenario. The attacker receives ICL feedback after each turn and adapts its strategy. The defender’s responses are graded by a per-scenario Python verifier. Successful defenses produce r = 1 training signals; failures produce r = 0 with corrective feedback for SDPO reprompting. • Score (defender pass rate): SDPO converges to higher scores faster and maintains them; … view at source ↗

**Figure 5.** Figure 5: Checkpoint-split transfer matrices for each method, averaged over 9 trials. Each cell shows defender success rate (%) when checkpoint k (row) is evaluated on split j (column). SDPO achieves uniformly high performance across the matrix, indicating both strong forward transfer and minimal forgetting [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Training dynamics comparison across methods. Left: defender score over training steps. Center: policy entropy. Right: policy gradient objective. Thin lines are raw values; bold lines are smoothed (window=10). SDPO maintains higher entropy and lower PG loss variance throughout training. C.2. Method-Specific Configurations Each method uses the shared configuration above and overrides algorithm-specific hyper… view at source ↗

**Figure 7.** Figure 7: Left: timing breakdown per training step for each method. Right: replay buffer size over training steps, showing the steady-state fill level with async collection and age-based eviction. Periodic drops in SDPO buffer size correspond to split boundaries where stale records are evicted [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: SDPO teacher reprompt templates used on IH-Challenge. <SOLUTION> is replaced by the student’s prior visible response (think blocks stripped). The template produces a conditional prompt for the EMA teacher to generate an improved response [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

read the original abstract

Deployed large language model agents must adapt to distribution shift in dynamic environments. Ideally, adaptation can be performed from accumulated agent experiences and retain prior capabilities while transferring to future tasks. However, agent actions and environmental transitions can only be sampled once per scenario, as real-world environments cannot be trivially reset. To this end, we investigate an experiential and online continual learning setting in which agents learn from a stream of scenarios. We propose continual learning as-a-service (CLaaS), a system which enables agents to improve during deployment, abstracted behind a chat API. To increase sample efficiency, CLaaS stores rollouts in an experience replay buffer for gradient reuse during asynchronous training. We evaluate CLaaS on an adversarial task, demonstrating that parametric updates lead to superior forward transfer and less forgetting than in-context learning, with replay being a critical choice for sample efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CLaaS offers a service framing for online agent adaptation but the abstract supplies no metrics or bias controls, leaving the main claim uncheckable.

read the letter

The thing to know is that this paper proposes CLaaS as a chat-API service layer so LLM agents can do continual learning from single-use experience streams in non-resettable environments. It claims parametric updates with replay give better forward transfer and less forgetting than in-context learning, and that replay is essential for sample efficiency.

The new piece is the service abstraction itself plus the specific one-shot online setting where rollouts cannot be repeated. The paper correctly identifies that real deployments cannot reset scenarios at will, so any learning must work from a one-pass stream. Framing the learner as a backend service could make it easier to drop into existing agent code without changing the front-end interface.

The evaluation section is the clear weak point. The abstract asserts superiority on an adversarial task but reports no numbers, no baselines, no error bars, and no dataset or exclusion details. That alone makes the result impossible to assess. The stress-test concern about off-policy bias also lands: the buffer is filled from a changing policy on single-use rollouts and sampled asynchronously for gradients. Without importance sampling, buffer management, or any correction described, the stored transitions come from stale distributions, which can erase efficiency gains and hurt retention. The paper would need to show how it avoids that bias to support the sample-efficiency claim.

This is aimed at practitioners who build deployed agents and want a reusable learning backend. A reader already working on replay-based continual learning for LLMs might pick up the API framing if the full experiments include proper controls.

I would not send it to peer review in this form. The central empirical result lacks the data and methods needed for a referee to evaluate it.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes CLaaS, a continual learning as-a-service system for LLM agents adapting to distribution shift in non-resettable environments. Agents accumulate experiences from a stream of scenarios; rollouts are stored in an experience replay buffer to enable asynchronous gradient-based parametric updates. The central empirical claim is that these parametric updates yield superior forward transfer and less forgetting than in-context learning, with replay being critical for sample efficiency, as demonstrated on an adversarial task.

Significance. If the results hold with appropriate controls, the work addresses a practically important setting for deployed agents where environments cannot be reset and experiences are single-use. The emphasis on replay for gradient reuse in an online continual-learning regime is a concrete contribution to sample-efficient adaptation while retaining prior capabilities.

major comments (2)

[Abstract] Abstract: the claim that parametric updates lead to superior forward transfer and less forgetting is presented without any quantitative metrics, baselines, error bars, dataset details, or exclusion criteria, so it is not possible to assess whether the data support the superiority statement.
[CLaaS system description] CLaaS system description: the replay buffer is populated from single-use rollouts under a changing policy and sampled asynchronously for gradient reuse, yet no off-policy correction (importance sampling, prioritized replay, or explicit bias analysis) is described; this directly threatens the sample-efficiency and stability claims that underpin the reported gains in transfer and retention.

minor comments (1)

[Abstract] The phrase 'adversarial task' is used without reference to a concrete benchmark, environment, or task definition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that parametric updates lead to superior forward transfer and less forgetting is presented without any quantitative metrics, baselines, error bars, dataset details, or exclusion criteria, so it is not possible to assess whether the data support the superiority statement.

Authors: We agree that the abstract lacks quantitative support for the claims. In the revised manuscript we will update the abstract to include specific metrics on forward transfer and retention, the in-context learning baseline, error bars, and details on the adversarial task and evaluation criteria. revision: yes
Referee: [CLaaS system description] CLaaS system description: the replay buffer is populated from single-use rollouts under a changing policy and sampled asynchronously for gradient reuse, yet no off-policy correction (importance sampling, prioritized replay, or explicit bias analysis) is described; this directly threatens the sample-efficiency and stability claims that underpin the reported gains in transfer and retention.

Authors: We acknowledge that the manuscript does not describe off-policy corrections. We will revise the CLaaS system description to include an explicit discussion of policy-induced bias together with either an importance-sampling correction or a quantitative bias analysis, so that the sample-efficiency and stability claims rest on firmer methodological ground. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system comparison with no derivation chain

full rationale

The paper describes an empirical evaluation of CLaaS, a continual learning system using experience replay for online adaptation of LLM agents. The central claims concern experimental outcomes on forward transfer and forgetting when comparing parametric updates (with replay) against in-context learning. No mathematical derivations, fitted parameters renamed as predictions, self-citations as load-bearing uniqueness theorems, or ansatzes are present in the provided text. The replay buffer usage is presented as an implementation choice whose benefits are measured experimentally rather than derived by construction from the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that real-world scenarios permit only single samples per rollout and that standard replay techniques transfer directly to LLM agents without new instabilities.

axioms (1)

domain assumption Real-world environments cannot be trivially reset, so agent actions and transitions can be sampled only once per scenario.
Explicitly stated in the abstract as the motivation for the experiential online setting.

pith-pipeline@v0.9.1-grok · 5687 in / 1123 out tokens · 47393 ms · 2026-06-28T02:46:29.727078+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 4 canonical work pages

[1]

M., Bohnet, B., Rosias, L., Chan, S., Zhang, B., Anand, A., Abbas, Z., Nova, A., Co-Reyes, J

Agarwal, R., Singh, A., Zhang, L. M., Bohnet, B., Rosias, L., Chan, S., Zhang, B., Anand, A., Abbas, Z., Nova, A., Co-Reyes, J. D., Chu, E., Behbahani, F., Faust, A., and Larochelle, H. Many-shot in-context learning, 2024. URL https://arxiv.org/abs/2404.11018

arXiv 2024
[2]

Biderman, D., Portes, J., Ortiz, J. J. G., Paul, M., Greengard, P., Jennings, C., King, D., Havens, S., Chiley, V., Frankle, J., Blakeney, C., and Cunningham, J. P. Lora learns less and forgets less, 2024. URL https://arxiv.org/abs/2405.09673

arXiv 2024
[3]

Challenges of real-world reinforcement learning

Dulac-Arnold, G., Mankowitz, D., and Hester, T. Challenges of real-world reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, 2019

2019
[4]

F., Zhu, S., Choquette-Choo, C

Guo, C., Ceron Uribe, J. F., Zhu, S., Choquette-Choo, C. A., Lin, S., Kandpal, N., Nasr, M., Rai, Toyer, S., Wang, M., Yu, Y., Beutel, A., and Xiao, K. Ih-challenge: A training dataset to improve instruction hierarchy on frontier llms. March 2026. URL https://cdn.openai.com/pdf/14e541fa-7e48-4d79-9cbf-61c3cde3e263/ih-challenge-paper.pdf

2026
[5]

Hybridflow: A flexible and efficient rlhf framework

Guo, H., Jiang, Z., , et al. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the 20th European Conference on Computer Systems (EuroSys), 2025. URL https://github.com/volcengine/verl

2025
[6]

J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685

Pith/arXiv arXiv 2021
[7]

K., Xu, H., and Shen, W

Hu, J., Liu, J. K., Xu, H., and Shen, W. Reinforce++: Stabilizing critic-free policy optimization with global advantage normalization, 2025. URL https://arxiv.org/abs/2501.03262

Pith/arXiv arXiv 2025
[8]

Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal, 2024

Huang, J., Cui, L., Wang, A., Yang, C., Liao, X., Song, L., Yao, J., and Su, J. Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal, 2024. URL https://arxiv.org/abs/2403.01244

arXiv 2024
[9]

K., Guestrin, C., and Krause, A

Hübotter, J., Lübeck, F., Behric, L., Baumann, A., Bagatella, M., Marta, D., Hakimi, I., Shenfeld, I., Buening, T. K., Guestrin, C., and Krause, A. Reinforcement learning via self-distillation, 2026. URL https://arxiv.org/abs/2601.20802

Pith/arXiv arXiv 2026
[10]

Proceedings of the National Academy of Sciences114(13), 3521– 3526 (2017).https://doi.org/10.1073/pnas.1611835114,https://www.pnas

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., and Hadsell, R. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114 0 (13): 0 3521–3526, March 2017. ISSN 1091-6490. ...

work page doi:10.1073/pnas.1611835114 2017
[11]

Konda, V. R. and Tsitsiklis, J. N. On actor-critic algorithms. SIAM Journal on Control and Optimization, 42 0 (4): 0 1143--1166, 2003. doi:10.1137/S0363012901385691

work page doi:10.1137/s0363012901385691 2003
[12]

H., Gonzalez, J

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

2023
[13]

Lambert, N., Morrison, J., Pyatkin, V., Huang, S., Ivison, H., Brahman, F., Miranda, L. J. V., Liu, A., Dziri, N., Lyu, S., Gu, Y., Malik, S., Graf, V., Hwang, J. D., Yang, J., Bras, R. L., Tafjord, O., Wilhelm, C., Soldaini, L., Smith, N. A., Wang, Y., Dasigi, P., and Hajishirzi, H. Tulu 3: Pushing frontiers in open language model post-training, 2025. UR...

Pith/arXiv arXiv 2025
[14]

Machine Learning , author =

Lin, L.-J. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8 0 (3--4): 0 293--321, 1992. doi:10.1007/BF00992699

work page doi:10.1007/bf00992699 1992
[15]

An empirical study of catastrophic forgetting in large language models during continual fine-tuning, 2025

Luo, Y., Yang, Z., Meng, F., Li, Y., Zhou, J., and Zhang, Y. An empirical study of catastrophic forgetting in large language models during continual fine-tuning, 2025. URL https://arxiv.org/abs/2308.08747

Pith/arXiv arXiv 2025
[16]

Cohen , abstract =

McCloskey, M. and Cohen, N. J. Catastrophic interference in connectionist networks: The sequential learning problem. volume 24 of Psychology of Learning and Motivation, pp.\ 109--165. Academic Press, 1989. doi:https://doi.org/10.1016/S0079-7421(08)60536-8. URL https://www.sciencedirect.com/science/article/pii/S0079742108605368

work page doi:10.1016/s0079-7421(08)60536-8 1989
[17]

Reinforcement learning finetunes small subnetworks in large language models, 2025

Mukherjee, S., Yuan, L., Hakkani-Tur, D., and Peng, H. Reinforcement learning finetunes small subnetworks in large language models, 2025. URL https://arxiv.org/abs/2505.11711

arXiv 2025
[18]

Pipelinerl: Faster on-policy reinforcement learning for long sequence generation, 2025

Piché, A., Kamalloo, E., Pardinas, R., Chen, X., and Bahdanau, D. Pipelinerl: Faster on-policy reinforcement learning for long sequence generation, 2025. URL https://arxiv.org/abs/2509.19128

arXiv 2025
[19]

Rebuffi, S.-A., Kolesnikov, A., Sperl, G., and Lampert, C. H. icarl: Incremental classifier and representation learning, 2017. URL https://arxiv.org/abs/1611.07725

Pith/arXiv arXiv 2017
[20]

Proximal policy optimization algorithms, 2017

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms, 2017. URL https://arxiv.org/abs/1707.06347

Pith/arXiv arXiv 2017
[21]

K., Wu, Y., and Guo, D

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y. K., Wu, Y., and Guo, D. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300

Pith/arXiv arXiv 2024
[22]

Rl's razor: Why online reinforcement learning forgets less, 2025

Shenfeld, I., Pari, J., and Agrawal, P. Rl's razor: Why online reinforcement learning forgets less, 2025. URL https://arxiv.org/abs/2509.04259

Pith/arXiv arXiv 2025
[23]

Test-time training with self-supervision for generalization under distribution shifts

Sun, Y., Wang, X., Liu, Z., Miller, J., Efros, A., and Hardt, M. Test-time training with self-supervision for generalization under distribution shifts. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.\ 9229--9248. PMLR, 13--18 Jul 2020. URL...

2020
[24]

Openclaw-rl: Train any agent simply by talking, 2026

Wang, Y., Chen, X., Jin, X., Wang, M., and Yang, L. Openclaw-rl: Train any agent simply by talking, 2026. URL https://arxiv.org/abs/2603.10165

Pith/arXiv arXiv 2026
[25]

Qwen3 technical report, 2025

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q...

Pith/arXiv arXiv 2025
[26]

Prorl agent: Rollout-as-a-service for rl training of multi-turn llm agents, 2026

Zhang, H., Liu, M., Zhang, S., Han, S., Hu, J., Jin, Z., Zhang, Y., Diao, S., Lu, X., Xu, B., Yu, Z., Kautz, J., and Dong, Y. Prorl agent: Rollout-as-a-service for rl training of multi-turn llm agents, 2026. URL https://arxiv.org/abs/2603.18815

arXiv 2026

[1] [1]

M., Bohnet, B., Rosias, L., Chan, S., Zhang, B., Anand, A., Abbas, Z., Nova, A., Co-Reyes, J

Agarwal, R., Singh, A., Zhang, L. M., Bohnet, B., Rosias, L., Chan, S., Zhang, B., Anand, A., Abbas, Z., Nova, A., Co-Reyes, J. D., Chu, E., Behbahani, F., Faust, A., and Larochelle, H. Many-shot in-context learning, 2024. URL https://arxiv.org/abs/2404.11018

arXiv 2024

[2] [2]

Biderman, D., Portes, J., Ortiz, J. J. G., Paul, M., Greengard, P., Jennings, C., King, D., Havens, S., Chiley, V., Frankle, J., Blakeney, C., and Cunningham, J. P. Lora learns less and forgets less, 2024. URL https://arxiv.org/abs/2405.09673

arXiv 2024

[3] [3]

Challenges of real-world reinforcement learning

Dulac-Arnold, G., Mankowitz, D., and Hester, T. Challenges of real-world reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, 2019

2019

[4] [4]

F., Zhu, S., Choquette-Choo, C

Guo, C., Ceron Uribe, J. F., Zhu, S., Choquette-Choo, C. A., Lin, S., Kandpal, N., Nasr, M., Rai, Toyer, S., Wang, M., Yu, Y., Beutel, A., and Xiao, K. Ih-challenge: A training dataset to improve instruction hierarchy on frontier llms. March 2026. URL https://cdn.openai.com/pdf/14e541fa-7e48-4d79-9cbf-61c3cde3e263/ih-challenge-paper.pdf

2026

[5] [5]

Hybridflow: A flexible and efficient rlhf framework

Guo, H., Jiang, Z., , et al. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the 20th European Conference on Computer Systems (EuroSys), 2025. URL https://github.com/volcengine/verl

2025

[6] [6]

J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685

Pith/arXiv arXiv 2021

[7] [7]

K., Xu, H., and Shen, W

Hu, J., Liu, J. K., Xu, H., and Shen, W. Reinforce++: Stabilizing critic-free policy optimization with global advantage normalization, 2025. URL https://arxiv.org/abs/2501.03262

Pith/arXiv arXiv 2025

[8] [8]

Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal, 2024

Huang, J., Cui, L., Wang, A., Yang, C., Liao, X., Song, L., Yao, J., and Su, J. Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal, 2024. URL https://arxiv.org/abs/2403.01244

arXiv 2024

[9] [9]

K., Guestrin, C., and Krause, A

Hübotter, J., Lübeck, F., Behric, L., Baumann, A., Bagatella, M., Marta, D., Hakimi, I., Shenfeld, I., Buening, T. K., Guestrin, C., and Krause, A. Reinforcement learning via self-distillation, 2026. URL https://arxiv.org/abs/2601.20802

Pith/arXiv arXiv 2026

[10] [10]

Proceedings of the National Academy of Sciences114(13), 3521– 3526 (2017).https://doi.org/10.1073/pnas.1611835114,https://www.pnas

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., and Hadsell, R. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114 0 (13): 0 3521–3526, March 2017. ISSN 1091-6490. ...

work page doi:10.1073/pnas.1611835114 2017

[11] [11]

Konda, V. R. and Tsitsiklis, J. N. On actor-critic algorithms. SIAM Journal on Control and Optimization, 42 0 (4): 0 1143--1166, 2003. doi:10.1137/S0363012901385691

work page doi:10.1137/s0363012901385691 2003

[12] [12]

H., Gonzalez, J

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

2023

[13] [13]

Lambert, N., Morrison, J., Pyatkin, V., Huang, S., Ivison, H., Brahman, F., Miranda, L. J. V., Liu, A., Dziri, N., Lyu, S., Gu, Y., Malik, S., Graf, V., Hwang, J. D., Yang, J., Bras, R. L., Tafjord, O., Wilhelm, C., Soldaini, L., Smith, N. A., Wang, Y., Dasigi, P., and Hajishirzi, H. Tulu 3: Pushing frontiers in open language model post-training, 2025. UR...

Pith/arXiv arXiv 2025

[14] [14]

Machine Learning , author =

Lin, L.-J. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8 0 (3--4): 0 293--321, 1992. doi:10.1007/BF00992699

work page doi:10.1007/bf00992699 1992

[15] [15]

An empirical study of catastrophic forgetting in large language models during continual fine-tuning, 2025

Luo, Y., Yang, Z., Meng, F., Li, Y., Zhou, J., and Zhang, Y. An empirical study of catastrophic forgetting in large language models during continual fine-tuning, 2025. URL https://arxiv.org/abs/2308.08747

Pith/arXiv arXiv 2025

[16] [16]

Cohen , abstract =

McCloskey, M. and Cohen, N. J. Catastrophic interference in connectionist networks: The sequential learning problem. volume 24 of Psychology of Learning and Motivation, pp.\ 109--165. Academic Press, 1989. doi:https://doi.org/10.1016/S0079-7421(08)60536-8. URL https://www.sciencedirect.com/science/article/pii/S0079742108605368

work page doi:10.1016/s0079-7421(08)60536-8 1989

[17] [17]

Reinforcement learning finetunes small subnetworks in large language models, 2025

Mukherjee, S., Yuan, L., Hakkani-Tur, D., and Peng, H. Reinforcement learning finetunes small subnetworks in large language models, 2025. URL https://arxiv.org/abs/2505.11711

arXiv 2025

[18] [18]

Pipelinerl: Faster on-policy reinforcement learning for long sequence generation, 2025

Piché, A., Kamalloo, E., Pardinas, R., Chen, X., and Bahdanau, D. Pipelinerl: Faster on-policy reinforcement learning for long sequence generation, 2025. URL https://arxiv.org/abs/2509.19128

arXiv 2025

[19] [19]

Rebuffi, S.-A., Kolesnikov, A., Sperl, G., and Lampert, C. H. icarl: Incremental classifier and representation learning, 2017. URL https://arxiv.org/abs/1611.07725

Pith/arXiv arXiv 2017

[20] [20]

Proximal policy optimization algorithms, 2017

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms, 2017. URL https://arxiv.org/abs/1707.06347

Pith/arXiv arXiv 2017

[21] [21]

K., Wu, Y., and Guo, D

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y. K., Wu, Y., and Guo, D. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300

Pith/arXiv arXiv 2024

[22] [22]

Rl's razor: Why online reinforcement learning forgets less, 2025

Shenfeld, I., Pari, J., and Agrawal, P. Rl's razor: Why online reinforcement learning forgets less, 2025. URL https://arxiv.org/abs/2509.04259

Pith/arXiv arXiv 2025

[23] [23]

Test-time training with self-supervision for generalization under distribution shifts

Sun, Y., Wang, X., Liu, Z., Miller, J., Efros, A., and Hardt, M. Test-time training with self-supervision for generalization under distribution shifts. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.\ 9229--9248. PMLR, 13--18 Jul 2020. URL...

2020

[24] [24]

Openclaw-rl: Train any agent simply by talking, 2026

Wang, Y., Chen, X., Jin, X., Wang, M., and Yang, L. Openclaw-rl: Train any agent simply by talking, 2026. URL https://arxiv.org/abs/2603.10165

Pith/arXiv arXiv 2026

[25] [25]

Qwen3 technical report, 2025

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q...

Pith/arXiv arXiv 2025

[26] [26]

Prorl agent: Rollout-as-a-service for rl training of multi-turn llm agents, 2026

Zhang, H., Liu, M., Zhang, S., Han, S., Hu, J., Jin, Z., Zhang, Y., Diao, S., Lu, X., Xu, B., Yu, Z., Kautz, J., and Dong, Y. Prorl agent: Rollout-as-a-service for rl training of multi-turn llm agents, 2026. URL https://arxiv.org/abs/2603.18815

arXiv 2026