CLaaS: Continual learning as a service for sample efficient online learning
Pith reviewed 2026-06-28 02:46 UTC · model grok-4.3
The pith
CLaaS lets language-model agents adapt online from single-use rollouts by storing them in a replay buffer for gradient reuse.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In an experiential online continual learning setting where each scenario is encountered only once, storing rollouts in an experience replay buffer enables asynchronous gradient-based training that yields superior forward transfer and reduced forgetting relative to in-context learning.
What carries the argument
An experience replay buffer that stores single-use agent rollouts so gradients can be reused during asynchronous training.
If this is right
- Parametric updates produce higher forward transfer to future tasks than in-context learning.
- Parametric updates produce lower forgetting of prior tasks than in-context learning.
- The replay buffer is required to reach the observed sample efficiency.
- The mechanism can be hidden behind a standard chat API without changing the agent interface.
Where Pith is reading between the lines
- The same replay-plus-asynchronous-update pattern could be applied in other non-resettable domains such as physical robotics.
- Long-running deployments might accumulate enough replay data to support periodic model compression or distillation steps.
- The separation of rollout collection from training opens the possibility of scaling the service to many agents sharing one buffer.
Load-bearing premise
The replay buffer can be filled and sampled asynchronously from single-use rollouts without introducing bias or instability that cancels the reported gains in transfer and retention.
What would settle it
A controlled run on the same adversarial task in which replay is disabled yet forward transfer and forgetting metrics remain at least as good as the replay-enabled condition.
Figures
read the original abstract
Deployed large language model agents must adapt to distribution shift in dynamic environments. Ideally, adaptation can be performed from accumulated agent experiences and retain prior capabilities while transferring to future tasks. However, agent actions and environmental transitions can only be sampled once per scenario, as real-world environments cannot be trivially reset. To this end, we investigate an experiential and online continual learning setting in which agents learn from a stream of scenarios. We propose continual learning as-a-service (CLaaS), a system which enables agents to improve during deployment, abstracted behind a chat API. To increase sample efficiency, CLaaS stores rollouts in an experience replay buffer for gradient reuse during asynchronous training. We evaluate CLaaS on an adversarial task, demonstrating that parametric updates lead to superior forward transfer and less forgetting than in-context learning, with replay being a critical choice for sample efficiency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes CLaaS, a continual learning as-a-service system for LLM agents adapting to distribution shift in non-resettable environments. Agents accumulate experiences from a stream of scenarios; rollouts are stored in an experience replay buffer to enable asynchronous gradient-based parametric updates. The central empirical claim is that these parametric updates yield superior forward transfer and less forgetting than in-context learning, with replay being critical for sample efficiency, as demonstrated on an adversarial task.
Significance. If the results hold with appropriate controls, the work addresses a practically important setting for deployed agents where environments cannot be reset and experiences are single-use. The emphasis on replay for gradient reuse in an online continual-learning regime is a concrete contribution to sample-efficient adaptation while retaining prior capabilities.
major comments (2)
- [Abstract] Abstract: the claim that parametric updates lead to superior forward transfer and less forgetting is presented without any quantitative metrics, baselines, error bars, dataset details, or exclusion criteria, so it is not possible to assess whether the data support the superiority statement.
- [CLaaS system description] CLaaS system description: the replay buffer is populated from single-use rollouts under a changing policy and sampled asynchronously for gradient reuse, yet no off-policy correction (importance sampling, prioritized replay, or explicit bias analysis) is described; this directly threatens the sample-efficiency and stability claims that underpin the reported gains in transfer and retention.
minor comments (1)
- [Abstract] The phrase 'adversarial task' is used without reference to a concrete benchmark, environment, or task definition.
Simulated Author's Rebuttal
We thank the referee for their insightful comments. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that parametric updates lead to superior forward transfer and less forgetting is presented without any quantitative metrics, baselines, error bars, dataset details, or exclusion criteria, so it is not possible to assess whether the data support the superiority statement.
Authors: We agree that the abstract lacks quantitative support for the claims. In the revised manuscript we will update the abstract to include specific metrics on forward transfer and retention, the in-context learning baseline, error bars, and details on the adversarial task and evaluation criteria. revision: yes
-
Referee: [CLaaS system description] CLaaS system description: the replay buffer is populated from single-use rollouts under a changing policy and sampled asynchronously for gradient reuse, yet no off-policy correction (importance sampling, prioritized replay, or explicit bias analysis) is described; this directly threatens the sample-efficiency and stability claims that underpin the reported gains in transfer and retention.
Authors: We acknowledge that the manuscript does not describe off-policy corrections. We will revise the CLaaS system description to include an explicit discussion of policy-induced bias together with either an importance-sampling correction or a quantitative bias analysis, so that the sample-efficiency and stability claims rest on firmer methodological ground. revision: yes
Circularity Check
No circularity: empirical system comparison with no derivation chain
full rationale
The paper describes an empirical evaluation of CLaaS, a continual learning system using experience replay for online adaptation of LLM agents. The central claims concern experimental outcomes on forward transfer and forgetting when comparing parametric updates (with replay) against in-context learning. No mathematical derivations, fitted parameters renamed as predictions, self-citations as load-bearing uniqueness theorems, or ansatzes are present in the provided text. The replay buffer usage is presented as an implementation choice whose benefits are measured experimentally rather than derived by construction from the inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Real-world environments cannot be trivially reset, so agent actions and transitions can be sampled only once per scenario.
Reference graph
Works this paper leans on
-
[1]
M., Bohnet, B., Rosias, L., Chan, S., Zhang, B., Anand, A., Abbas, Z., Nova, A., Co-Reyes, J
Agarwal, R., Singh, A., Zhang, L. M., Bohnet, B., Rosias, L., Chan, S., Zhang, B., Anand, A., Abbas, Z., Nova, A., Co-Reyes, J. D., Chu, E., Behbahani, F., Faust, A., and Larochelle, H. Many-shot in-context learning, 2024. URL https://arxiv.org/abs/2404.11018
arXiv 2024
-
[2]
Biderman, D., Portes, J., Ortiz, J. J. G., Paul, M., Greengard, P., Jennings, C., King, D., Havens, S., Chiley, V., Frankle, J., Blakeney, C., and Cunningham, J. P. Lora learns less and forgets less, 2024. URL https://arxiv.org/abs/2405.09673
arXiv 2024
-
[3]
Challenges of real-world reinforcement learning
Dulac-Arnold, G., Mankowitz, D., and Hester, T. Challenges of real-world reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, 2019
2019
-
[4]
F., Zhu, S., Choquette-Choo, C
Guo, C., Ceron Uribe, J. F., Zhu, S., Choquette-Choo, C. A., Lin, S., Kandpal, N., Nasr, M., Rai, Toyer, S., Wang, M., Yu, Y., Beutel, A., and Xiao, K. Ih-challenge: A training dataset to improve instruction hierarchy on frontier llms. March 2026. URL https://cdn.openai.com/pdf/14e541fa-7e48-4d79-9cbf-61c3cde3e263/ih-challenge-paper.pdf
2026
-
[5]
Hybridflow: A flexible and efficient rlhf framework
Guo, H., Jiang, Z., , et al. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the 20th European Conference on Computer Systems (EuroSys), 2025. URL https://github.com/volcengine/verl
2025
-
[6]
J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685
Pith/arXiv arXiv 2021
-
[7]
Hu, J., Liu, J. K., Xu, H., and Shen, W. Reinforce++: Stabilizing critic-free policy optimization with global advantage normalization, 2025. URL https://arxiv.org/abs/2501.03262
Pith/arXiv arXiv 2025
-
[8]
Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal, 2024
Huang, J., Cui, L., Wang, A., Yang, C., Liao, X., Song, L., Yao, J., and Su, J. Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal, 2024. URL https://arxiv.org/abs/2403.01244
arXiv 2024
-
[9]
K., Guestrin, C., and Krause, A
Hübotter, J., Lübeck, F., Behric, L., Baumann, A., Bagatella, M., Marta, D., Hakimi, I., Shenfeld, I., Buening, T. K., Guestrin, C., and Krause, A. Reinforcement learning via self-distillation, 2026. URL https://arxiv.org/abs/2601.20802
Pith/arXiv arXiv 2026
-
[10]
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., and Hadsell, R. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114 0 (13): 0 3521–3526, March 2017. ISSN 1091-6490. ...
-
[11]
Konda, V. R. and Tsitsiklis, J. N. On actor-critic algorithms. SIAM Journal on Control and Optimization, 42 0 (4): 0 1143--1166, 2003. doi:10.1137/S0363012901385691
-
[12]
H., Gonzalez, J
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
2023
-
[13]
Lambert, N., Morrison, J., Pyatkin, V., Huang, S., Ivison, H., Brahman, F., Miranda, L. J. V., Liu, A., Dziri, N., Lyu, S., Gu, Y., Malik, S., Graf, V., Hwang, J. D., Yang, J., Bras, R. L., Tafjord, O., Wilhelm, C., Soldaini, L., Smith, N. A., Wang, Y., Dasigi, P., and Hajishirzi, H. Tulu 3: Pushing frontiers in open language model post-training, 2025. UR...
Pith/arXiv arXiv 2025
-
[14]
Lin, L.-J. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8 0 (3--4): 0 293--321, 1992. doi:10.1007/BF00992699
-
[15]
Luo, Y., Yang, Z., Meng, F., Li, Y., Zhou, J., and Zhang, Y. An empirical study of catastrophic forgetting in large language models during continual fine-tuning, 2025. URL https://arxiv.org/abs/2308.08747
Pith/arXiv arXiv 2025
-
[16]
McCloskey, M. and Cohen, N. J. Catastrophic interference in connectionist networks: The sequential learning problem. volume 24 of Psychology of Learning and Motivation, pp.\ 109--165. Academic Press, 1989. doi:https://doi.org/10.1016/S0079-7421(08)60536-8. URL https://www.sciencedirect.com/science/article/pii/S0079742108605368
-
[17]
Reinforcement learning finetunes small subnetworks in large language models, 2025
Mukherjee, S., Yuan, L., Hakkani-Tur, D., and Peng, H. Reinforcement learning finetunes small subnetworks in large language models, 2025. URL https://arxiv.org/abs/2505.11711
arXiv 2025
-
[18]
Pipelinerl: Faster on-policy reinforcement learning for long sequence generation, 2025
Piché, A., Kamalloo, E., Pardinas, R., Chen, X., and Bahdanau, D. Pipelinerl: Faster on-policy reinforcement learning for long sequence generation, 2025. URL https://arxiv.org/abs/2509.19128
arXiv 2025
-
[19]
Rebuffi, S.-A., Kolesnikov, A., Sperl, G., and Lampert, C. H. icarl: Incremental classifier and representation learning, 2017. URL https://arxiv.org/abs/1611.07725
Pith/arXiv arXiv 2017
-
[20]
Proximal policy optimization algorithms, 2017
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms, 2017. URL https://arxiv.org/abs/1707.06347
Pith/arXiv arXiv 2017
-
[21]
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y. K., Wu, Y., and Guo, D. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300
Pith/arXiv arXiv 2024
-
[22]
Rl's razor: Why online reinforcement learning forgets less, 2025
Shenfeld, I., Pari, J., and Agrawal, P. Rl's razor: Why online reinforcement learning forgets less, 2025. URL https://arxiv.org/abs/2509.04259
Pith/arXiv arXiv 2025
-
[23]
Test-time training with self-supervision for generalization under distribution shifts
Sun, Y., Wang, X., Liu, Z., Miller, J., Efros, A., and Hardt, M. Test-time training with self-supervision for generalization under distribution shifts. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.\ 9229--9248. PMLR, 13--18 Jul 2020. URL...
2020
-
[24]
Openclaw-rl: Train any agent simply by talking, 2026
Wang, Y., Chen, X., Jin, X., Wang, M., and Yang, L. Openclaw-rl: Train any agent simply by talking, 2026. URL https://arxiv.org/abs/2603.10165
Pith/arXiv arXiv 2026
-
[25]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q...
Pith/arXiv arXiv 2025
-
[26]
Prorl agent: Rollout-as-a-service for rl training of multi-turn llm agents, 2026
Zhang, H., Liu, M., Zhang, S., Han, S., Hu, J., Jin, Z., Zhang, Y., Diao, S., Lu, X., Xu, B., Yu, Z., Kautz, J., and Dong, Y. Prorl agent: Rollout-as-a-service for rl training of multi-turn llm agents, 2026. URL https://arxiv.org/abs/2603.18815
arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.