pith. sign in

arxiv: 2606.25527 · v1 · pith:6QE4WGXLnew · submitted 2026-06-24 · 💻 cs.LG

Beyond One-Size-Fits-All: Diagnosis-Driven Online Reinforcement Learning with Offline Priors

Pith reviewed 2026-06-25 21:13 UTC · model grok-4.3

classification 💻 cs.LG
keywords online reinforcement learningoffline priorsdiagnosis-driventension managementoffline-to-online RLfoundation modelsembodied intelligenceadaptive deployment
0
0 comments X

The pith

Online RL with offline priors should use deployment-specific evidence to manage tensions rather than seeking universal strategies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that the validity of offline priors in online reinforcement learning varies across deployments and changes during training. This variation means that no single method for managing the priors is optimal in all cases, and that rankings from benchmarks have limited applicability to real deployments. The authors propose shifting the field toward diagnosis-driven tension management, where specific evidence from the deployment guides how the learner interacts with its priors over the course of training. This approach would enable more flexible and adaptive use of offline knowledge in diverse settings such as foundation model post-training and embodied intelligence.

Core claim

Because prior validity varies across deployments and shifts during training, no single approach to managing it is universally optimal and benchmark rankings offer limited guidance for real-world use; the field should shift to diagnosis-driven tension management in which deployment-specific evidence guides how the learner relates to its priors throughout training.

What carries the argument

A framework characterizing how priors reshape online optimization through three functional roles.

If this is right

  • Help-or-hurt reversals occur in controlled experiments depending on the deployment.
  • Cross-domain evidence supports the approach from foundation model post-training to embodied intelligence.
  • Five substantive counterarguments can be engaged with under this view.
  • Both flexible and adaptive deployment become feasible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Monitoring tools for detecting shifts in prior validity could become standard in RL deployments.
  • Similar diagnosis-driven strategies might benefit other areas of machine learning that rely on pre-trained components.
  • New evaluation protocols could focus on adaptability metrics instead of fixed benchmark scores.

Load-bearing premise

Prior validity varies across deployments and shifts during training such that no single management approach is universally optimal and benchmark rankings offer limited guidance for real-world use.

What would settle it

Demonstration of one fixed prior management strategy that consistently ranks highest in performance across a range of diverse, real-world deployments and training stages would falsify the central claim.

read the original abstract

Online reinforcement learning (RL) agents increasingly depend on knowledge acquired offline to achieve practical efficiency. Originally studied in offline-to-online RL, this paradigm now spans foundation model post-training and embodied intelligence, with prior types expanding from offline datasets and pre-trained policies to increasingly diverse knowledge sources such as multimodal foundation models and generative world models. Offline priors have become central to how deep RL is developed and deployed. However, this reliance introduces a challenge that the prevailing benchmark-driven paradigm cannot resolve: because prior validity varies across deployments and shifts during training, no single approach to managing it is universally optimal, and benchmark rankings offer limited guidance for real-world deployments. Rather than pursuing universal solutions, we argue that the field should shift to diagnosis-driven tension management, in which deployment-specific evidence guides how the learner relates to its priors throughout training, enabling both flexible and adaptive deployment. We support this position with a framework characterizing how priors reshape online optimization through three functional roles, controlled experiments demonstrating help-or-hurt reversals, cross-domain evidence from foundation model post-training to embodied intelligence, and engagement with five substantive counterarguments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript is a position paper arguing that offline priors in online RL have validity that varies across deployments and shifts during training, so that no single management strategy is universally optimal and benchmark rankings provide limited guidance for real-world use. It advocates shifting the field to diagnosis-driven tension management, in which deployment-specific evidence guides how the learner relates to its priors. The position is supported by a framework that characterizes priors through three functional roles, controlled experiments demonstrating help-or-hurt reversals, cross-domain evidence spanning foundation-model post-training to embodied intelligence, and explicit engagement with five counterarguments.

Significance. If the variation premise is granted, the argument could usefully reorient research away from the search for universal offline-to-online recipes toward adaptive, evidence-based practices that are more relevant to practical deployments in foundation-model post-training and embodied systems. The paper earns credit for supplying an explicit three-role framework, for including controlled experiments that exhibit reversals, for assembling cross-domain evidence, and for directly addressing counterarguments rather than leaving them implicit.

minor comments (3)
  1. The three functional roles for priors are introduced in the abstract and presumably elaborated in the framework section; a short table or diagram summarizing the roles, their interactions with online optimization, and the associated tension-management levers would improve readability.
  2. The manuscript states that five substantive counterarguments are engaged; listing them explicitly (e.g., in a dedicated subsection or enumerated list) would make the rebuttal structure easier to follow.
  3. Cross-domain evidence is cited as support; adding a concise summary table that maps each domain to the observed reversal or tension pattern would strengthen the presentation without altering the argument.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and constructive review, including the recognition of the three-role framework, controlled reversal experiments, cross-domain evidence, and direct engagement with counterarguments. The recommendation for minor revision is appreciated, and we note that no specific major comments requiring point-by-point clarification were raised.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

This is a position paper with no mathematical derivation, fitted parameters, or predictive claims. The central argument—that prior validity varies across deployments and training, so no single offline-to-online management strategy is universally optimal—rests on a descriptive framework of three functional roles, controlled experiments showing help-or-hurt reversals, cross-domain evidence, and direct engagement with counterarguments. No load-bearing step reduces by construction to self-citation, ansatz, or input data; the logical step to diagnosis-driven management follows from the stated variation premise without internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central position rests on the domain assumption of variable prior validity; no free parameters, invented entities, or additional axioms are extractable from the abstract.

axioms (1)
  • domain assumption Prior validity varies across deployments and shifts during training
    This underpins the claim that no universal approach is optimal.

pith-pipeline@v0.9.1-grok · 5732 in / 1248 out tokens · 43337 ms · 2026-06-25T21:13:27.282946+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 14 linked inside Pith

  1. [1]

    Gradient regularization prevents reward hacking in reinforcement learning from human feedback and verifiable rewards.arXiv preprint arXiv:2602.18037, 2026

    Johannes Ackermann, Michael Noukhovitch, Takashi Ishida, and Masashi Sugiyama. Gradient regularization prevents reward hacking in reinforcement learning from human feedback and verifiable rewards.arXiv preprint arXiv:2602.18037, 2026. 8, 11

  2. [2]

    Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopad- hyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 4, 11

  3. [3]

    WIMLE: Uncertainty-aware world models with IMLE for sample-efficient continuous control

    Mehran Aghabozorgi, Alireza Moazeni, Yanshu Zhang, and Ke Li. WIMLE: Uncertainty-aware world models with IMLE for sample-efficient continuous control. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=mzLOnTb3WH. 12

  4. [4]

    What matters for simulation to online reinforcement learning on real robots

    Yarden As, Dhruva Tirumala, René Zurbrügg, Chenhao Li, Stelian Coros, Andreas Krause, and Markus Wulfmeier. What matters for simulation to online reinforcement learning on real robots. arXiv preprint arXiv:2602.20220, 2026. 2, 4, 8, 11

  5. [5]

    Expert or not? assessing data quality in offline reinforcement learning.arXiv preprint arXiv:2510.12638, 2025

    Arip Asadulaev, Fakhri Karray, and Martin Takac. Expert or not? assessing data quality in offline reinforcement learning.arXiv preprint arXiv:2510.12638, 2025. 10, 12, 13

  6. [6]

    Efficient online reinforcement learning with offline data

    Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning, pages 1577–1594. PMLR, 2023. 2, 4, 5, 8, 23

  7. [7]

    Rethinking rl evaluation: Can benchmarks truly reveal failures of rl methods?arXiv preprint arXiv:2510.10541, 2025

    Zihan Chen, Yiming Zhang, Hengguang Zhou, Zenghui Ding, Yining Sun, and Cho-Jui Hsieh. Rethinking rl evaluation: Can benchmarks truly reveal failures of rl methods?arXiv preprint arXiv:2510.10541, 2025. 8, 12

  8. [8]

    Annealing bridges offline and online RL,

    Geonwoo Cho, Jaegyun Im, Doyoon Kim, and Lexin Li. Annealing bridges offline and online RL,

  9. [9]

    URLhttps://openreview.net/forum?id=umVAbmKf1L. 10

  10. [10]

    Loss of plasticity in deep continual learning.Nature, 632(8026): 768–774, 2024

    Shibhansh Dohare, J Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A Rupam Mah- mood, and Richard S Sutton. Loss of plasticity in deep continual learning.Nature, 632(8026): 768–774, 2024. 5

  11. [11]

    First return, then explore.Nature, 590(7847):580–586, 2021

    Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. First return, then explore.Nature, 590(7847):580–586, 2021. 5

  12. [12]

    D4rl: Datasets for deep data-driven reinforcement learning, 2020

    Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning, 2020. 22

  13. [13]

    A minimalist approach to offline reinforcement learn- ing

    Scott Fujimoto and Shixiang (Shane) Gu. A minimalist approach to offline reinforcement learn- ing. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, 14 Beyond One-Size-Fits-All: Diagnosis-Driven Online Reinforcement Learning with Offline Priors Advances in Neural Information Processing Systems, volume 34, pages 20132–...

  14. [14]

    Scaling laws for reward model overoptimization

    Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023. 2, 5, 8

  15. [15]

    Offline rl policies should be trained to be adaptive

    Dibya Ghosh, Anurag Ajay, Pulkit Agrawal, and Sergey Levine. Offline rl policies should be trained to be adaptive. InInternational Conference on Machine Learning, pages 7513–7530. PMLR, 2022. 6, 11

  16. [16]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2, 4

  17. [17]

    Improving vision-language-action model with online reinforcement learning

    Yanjiang Guo, Jianke Zhang, Xiaoyu Chen, Xiang Ji, Yen-Jen Wang, Yucheng Hu, and Jianyu Chen. Improving vision-language-action model with online reinforcement learning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 15665–15672. IEEE, 2025. 4

  18. [18]

    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018. 7, 24

  19. [19]

    Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023. 2, 4, 5

  20. [20]

    FIRE: Frobenius-isometry reinitialization for balancing the stability–plasticity tradeoff

    Isaac Han, Sangyeon Park, Seungwon Oh, Donghu Kim, Hojoon Lee, and KyungJoong Kim. FIRE: Frobenius-isometry reinitialization for balancing the stability–plasticity tradeoff. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/ forum?id=CfZLxT3zIZ. 13

  21. [21]

    TD-MPC2: Scalable, robust world models for continuous control

    Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, robust world models for continuous control. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Oxh5CstDJU. 4

  22. [22]

    Exploration in deep reinforcement learning: From single-agent to multiagent domain

    Jianye Hao, Tianpei Yang, Hongyao Tang, Chenjia Bai, Jinyi Liu, Zhaopeng Meng, Peng Liu, and Zhen Wang. Exploration in deep reinforcement learning: From single-agent to multiagent domain. IEEE transactions on neural networks and learning systems, 35(7):8762–8782, 2023. 5

  23. [23]

    World model for robot learning: A comprehensive survey.arXiv preprint arXiv:2605.00080, 2026

    Bohan Hou, Gen Li, Jindou Jia, Tuo An, Xinying Guo, Sicong Leng, Haoran Geng, Yanjie Ze, Tatsuya Harada, Philip Torr, Oier Mees, Marc Pollefeys, Zhuang Liu, Jiajun Wu, Pieter Abbeel, Jitendra Malik, Yilun Du, and Jianfei Yang. World model for robot learning: A comprehensive survey.arXiv preprint arXiv:2605.00080, 2026. 4 15 Beyond One-Size-Fits-All: Diagn...

  24. [24]

    Bayesiandesignprinciplesforoffline-to-onlinereinforcement learning

    Hao Hu, Yiqin Yang, Jianing Ye, Chengjie Wu, Ziqing Mai, Yujing Hu, Tangjie Lv, Changjie Fan, QianchuanZhao, andChongjieZhang. Bayesiandesignprinciplesforoffline-to-onlinereinforcement learning. InForty-first International Conference on Machine Learning, 2024. URL https:// openreview.net/forum?id=HLHQxMydFk. 6, 11

  25. [25]

    Simple recipe works: Vision-language-action models are natural continual learners with reinforcement learning.arXiv preprint arXiv:2603.11653, 2026

    Jiaheng Hu, Jay Shim, Chen Tang, Yoonchang Sung, Bo Liu, Peter Stone, and Roberto Martin- Martin. Simple recipe works: Vision-language-action models are natural continual learners with reinforcement learning.arXiv preprint arXiv:2603.11653, 2026. 8, 11

  26. [26]

    Emma Jordan, Adam White, Bruno Castro da Silva, Martha White, and Philip S. Thomas. Position: Benchmarking is limited in reinforcement learning research. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=Xe7n2ZqpBP. 8, 12

  27. [27]

    Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024. 11

  28. [28]

    Adaptive capacity allocation for vision language action fine-tuning.arXiv preprint arXiv:2603.07404, 2026

    Donghoon Kim, Minji Bae, Unghui Nam, Gyeonghun Kim, Suyun Lee, Kyuhong Shim, and By- onghyo Shim. Adaptive capacity allocation for vision language action fine-tuning.arXiv preprint arXiv:2603.07404, 2026. 11

  29. [29]

    OpenVLA: An open- source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An open- source vision-language-action model. In8th Annual Conference on Robot Learni...

  30. [30]

    Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13): 3521–3526, 2017

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13): 3521–3526, 2017. 11

  31. [31]

    Plasticity loss in deep reinforcement learning: A survey.arXiv preprint arXiv:2411.04832,

    Timo Klein, Christoph Luther, Manus McAuliffe, Lukas Miklautz, Claudia Plant, and Sebastian Tschi- atschek. Plasticity loss in deep reinforcement learning: A survey.arXiv preprint arXiv:2411.04832,

  32. [32]

    Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems, 33:1179–1191, 2020

    Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems, 33:1179–1191, 2020. 4

  33. [33]

    Exploration in deep reinforcement learning: A survey.Information Fusion, 85:1–22, 2022

    Pawel Ladosz, Lilian Weng, Minwoo Kim, and Hyondong Oh. Exploration in deep reinforcement learning: A survey.Information Fusion, 85:1–22, 2022. 5

  34. [34]

    Simulation distillation: Pretraining world models in simulation for rapid real-world adaptation.arXiv preprint arXiv:2603.15759, 2026

    Jacob Levy, Tyler Westenbroek, Kevin Huang, Fernando Palafox, Patrick Yin, Shayegan Omidshafiei, Dong-Ki Kim, Abhishek Gupta, and David Fridovich-Keil. Simulation distillation: Pretraining world models in simulation for rapid real-world adaptation.arXiv preprint arXiv:2603.15759, 2026. 2, 8 16 Beyond One-Size-Fits-All: Diagnosis-Driven Online Reinforcemen...

  35. [35]

    Ramesh, Edan Meyer, Dale Schuurmans, and Marlos C

    Alex Lewandowski, Aditya A. Ramesh, Edan Meyer, Dale Schuurmans, and Marlos C. Machado. The world is bigger! a computationally-embedded perspective on the big world hypothesis. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps: //openreview.net/forum?id=gJclyLFSdU. 2

  36. [36]

    The three regimes of offline-to-online rein- forcement learning.arXiv preprint arXiv:2510.01460, 2025

    Lu Li, Tianwei Ni, Yihao Sun, and Pierre-Luc Bacon. The three regimes of offline-to-online rein- forcement learning.arXiv preprint arXiv:2510.01460, 2025. 2, 5, 8, 10, 11, 12, 13

  37. [37]

    What matters in building vision–language–action models for generalist robots.Nature Machine Intelligence, pages 1–15, 2026

    Xinghang Li, Peiyan Li, Long Qian, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Xinlong Wang, Di Guo, et al. What matters in building vision–language–action models for generalist robots.Nature Machine Intelligence, pages 1–15, 2026. 11

  38. [38]

    Pretrained vision-language- action models are surprisingly resistant to forgetting in continual learning.arXiv preprint arXiv:2603.03818, 2026

    Huihan Liu, Changyeon Kim, Bo Liu, Minghuan Liu, and Yuke Zhu. Pretrained vision-language- action models are surprisingly resistant to forgetting in continual learning.arXiv preprint arXiv:2603.03818, 2026. 11

  39. [39]

    ProRL: Prolonged reinforcement learning expands reasoning boundaries in large language models

    Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. ProRL: Prolonged reinforcement learning expands reasoning boundaries in large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=YPsJha5HXQ. 10

  40. [40]

    Imitation is not enough: Robustify- ing imitation with reinforcement learning for challenging driving scenarios

    Yiren Lu, Justin Fu, George Tucker, Xinlei Pan, Eli Bronstein, Rebecca Roelofs, Benjamin Sapp, Brandyn White, Aleksandra Faust, Shimon Whiteson, et al. Imitation is not enough: Robustify- ing imitation with reinforcement learning for challenging driving scenarios. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 755...

  41. [41]

    Serl: A software suite for sample-efficient robotic reinforcement learning

    Jianlan Luo, Zheyuan Hu, Charles Xu, You Liang Tan, Jacob Berg, Archit Sharma, Stefan Schaal, Chelsea Finn, Abhishek Gupta, and Sergey Levine. Serl: A software suite for sample-efficient robotic reinforcement learning. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 16961–16969. IEEE, 2024. 2

  42. [42]

    Finetuning from offline reinforcement learning: Challenges, trade-offs and practical solutions.arXiv preprint arXiv:2303.17396, 2023

    Yicheng Luo, Jackie Kay, Edward Grefenstette, and Marc Peter Deisenroth. Finetuning from offline reinforcement learning: Challenges, trade-offs and practical solutions.arXiv preprint arXiv:2303.17396, 2023. 5

  43. [43]

    Understanding plasticity in neural networks

    Clare Lyle, Zeyu Zheng, Evgenii Nikishin, Bernardo Avila Pires, Razvan Pascanu, and Will Dabney. Understanding plasticity in neural networks. InInternational Conference on Machine Learning, pages 23190–23211. PMLR, 2023. 5, 12

  44. [44]

    Disentangling the causes of plasticity loss in neural networks.arXiv preprint arXiv:2402.18762, 2024

    Clare Lyle, Zeyu Zheng, Khimya Khetarpal, Hado Van Hasselt, Razvan Pascanu, James Martens, and Will Dabney. Disentangling the causes of plasticity loss in neural networks.arXiv preprint arXiv:2402.18762, 2024. 5

  45. [45]

    Guozheng Ma, Lu Li, Sen Zhang, Zixuan Liu, Zhen Wang, Yixin Chen, Li Shen, Xueqian Wang, and Dacheng Tao. Revisiting plasticity in visual reinforcement learning: Data, modules and 17 Beyond One-Size-Fits-All: Diagnosis-Driven Online Reinforcement Learning with Offline Priors training stages. InThe Twelfth International Conference on Learning Representatio...

  46. [46]

    What makes value learning efficient in residual reinforcement learning?arXiv preprint arXiv:2602.10539, 2026

    Guozheng Ma, Lu Li, Haoyu Wang, Zixuan Liu, Pierre-Luc Bacon, and Dacheng Tao. What makes value learning efficient in residual reinforcement learning?arXiv preprint arXiv:2602.10539, 2026. 13

  47. [47]

    Position: Lifetime tuning is incompatible with continual reinforcement learning

    Golnaz Mesbahi, Parham Mohammad Panahi, Olya Mastikhina, Steven Tang, Martha White, and Adam White. Position: Lifetime tuning is incompatible with continual reinforcement learning. InForty-second International Conference on Machine Learning Position Paper Track, 2025. URL https://openreview.net/forum?id=JMoWFkwnvv. 8, 12

  48. [48]

    Information- theoretic reward modeling for stable rlhf: Detecting and mitigating reward hacking.arXiv preprint arXiv:2510.13694, 2025

    Yuchun Miao, Liang Ding, Sen Zhang, Rong Bao, Lefei Zhang, and Dacheng Tao. Information- theoretic reward modeling for stable rlhf: Detecting and mitigating reward hacking.arXiv preprint arXiv:2510.13694, 2025. 10, 12

  49. [49]

    Awac: Accelerating online reinforcement learning with offline datasets.arXiv preprint arXiv:2006.09359, 2020

    Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. Awac: Accelerating online reinforcement learning with offline datasets.arXiv preprint arXiv:2006.09359, 2020. 4, 5

  50. [50]

    Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning

    Mitsuhiko Nakamoto, Simon Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, and Sergey Levine. Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning. Advances in Neural Information Processing Systems, 36:62244–62269, 2023. 4, 5, 7, 24

  51. [51]

    Long-horizon model-based offline reinforcement learning without conservatism.arXiv preprint arXiv:2512.04341, 2025

    Tianwei Ni, Esther Derman, Vineet Jain, Vincent Taboga, Siamak Ravanbakhsh, and Pierre-Luc Bacon. Long-horizon model-based offline reinforcement learning without conservatism.arXiv preprint arXiv:2512.04341, 2025. 8, 11

  52. [52]

    From static policies to adaptive priors in offline reinforcement learning.Preprint, 2026

    Tianwei Ni, Vineet Jain, Akash Karthikeyan, and Pierre-Luc Bacon. From static policies to adaptive priors in offline reinforcement learning.Preprint, 2026. URLhttps://twni2016.github.io/ papers/2026position_paper.pdf. 6, 11

  53. [53]

    The primacy bias in deep reinforcement learning

    Evgenii Nikishin, Max Schwarzer, Pierluca D’Oro, Pierre-Luc Bacon, and Aaron Courville. The primacy bias in deep reinforcement learning. InInternational conference on machine learning, pages 16828–16847. PMLR, 2022. 5

  54. [54]

    Deep reinforcement learning with plasticity injection

    Evgenii Nikishin, Junhyuk Oh, Georg Ostrovski, Clare Lyle, Razvan Pascanu, Will Dabney, and Andre Barreto. Deep reinforcement learning with plasticity injection. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id= jucDLW6G9l. 5

  55. [55]

    Simplicial embeddings improve sample efficiency in actor–critic agents

    Johan Obando-Ceron, Walter Mayor, Samuel Lavoie, Scott Fujimoto, Aaron Courville, and Pablo Samuel Castro. Simplicial embeddings improve sample efficiency in actor–critic agents. InThe Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=mCpq1GCKxA. 13 18 Beyond One-Size-Fits-All: Diagnosis-Driven Onl...

  56. [56]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730– 27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730– 27744, 2022. 2, 4, 5, 8

  57. [57]

    A survey on transfer learning.IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009

    Sinno Jialin Pan and Qiang Yang. A survey on transfer learning.IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009. 11

  58. [58]

    Think dense, not long: Dynamic decoupled conditional advantage for efficient reasoning

    Keqin Peng, Yuanxin Ouyang, Xuebo Liu, Zhiliang Tian, Ruijian Han, Yancheng Yuan, and Liang Ding. Think dense, not long: Dynamic decoupled conditional advantage for efficient reasoning. arXiv preprint arXiv:2602.02099, 2026. 8

  59. [59]

    Sim-to-real transfer of robotic control with dynamics randomization

    Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. In2018 IEEE international conference on robotics and automation (ICRA), pages 3803–3810. IEEE, 2018. 2, 4

  60. [60]

    Videodex: Learning dexterity from internet videos

    Kenneth Shaw, Shikhar Bahl, and Deepak Pathak. Videodex: Learning dexterity from internet videos. In6th Annual Conference on Robot Learning, 2022. URLhttps://openreview.net/ forum?id=qUhkhHw8Dz. 4

  61. [61]

    Welcome to the era of experience.Google AI, 1:11, 2025

    David Silver and Richard S Sutton. Welcome to the era of experience.Google AI, 1:11, 2025. 2

  62. [62]

    The dormant neuron phe- nomenon in deep reinforcement learning

    Ghada Sokar, Rishabh Agarwal, Pablo Samuel Castro, and Utku Evci. The dormant neuron phe- nomenon in deep reinforcement learning. InInternational Conference on Machine Learning, pages 32145–32168. PMLR, 2023. 5, 10, 12, 13

  63. [63]

    Adaptive replay buffer for offline-to-online rein- forcement learning

    Chihyeon Song, Jaewoo Lee, and Jinkyoo Park. Adaptive replay buffer for offline-to-online rein- forcement learning. InThe 29th International Conference on Artificial Intelligence and Statistics,

  64. [64]

    URLhttps://openreview.net/forum?id=NgmNlIBiBz. 10

  65. [65]

    Scaling behaviors of llm reinforcement learning post-training: An empirical study in mathematical reasoning.arXiv preprint arXiv:2509.25300,

    Zelin Tan, Hejia Geng, Xiaohang Yu, Mulei Zhang, Guancheng Wan, Yifan Zhou, Qiang He, Xi- angyuan Xue, Heng Zhou, Yutao Fan, et al. Scaling behaviors of llm reinforcement learning post-training: An empirical study in mathematical reasoning.arXiv preprint arXiv:2509.25300,

  66. [66]

    Position: Ignoring hyperparameter tuning costs misleads the development of efficient rl algorithms.preprint, 2025

    Ziqi Tang and Xuezhou Zhang. Position: Ignoring hyperparameter tuning costs misleads the development of efficient rl algorithms.preprint, 2025. 8, 12

  67. [67]

    Gigabrain-0.5 m*: a vla that learns from world model-based reinforcement learning.arXiv preprint arXiv:2602.12099, 2026

    GigaBrain Team, Boyuan Wang, Bohan Li, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Jie Li, Jindi Lv, Jingyu Liu, et al. Gigabrain-0.5 m*: a vla that learns from world model-based reinforcement learning.arXiv preprint arXiv:2602.12099, 2026. 4, 11

  68. [68]

    Overcoming the sim-to-real gap: Leveraging simulation to learn to explore for real-world RL

    Andrew Wagenmaker, Kevin Huang, Liyiming Ke, Kevin Jamieson, and Abhishek Gupta. Overcoming the sim-to-real gap: Leveraging simulation to learn to explore for real-world RL. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview. net/forum?id=JjQl8hXJAS. 4 19 Beyond One-Size-Fits-All: Diagnosis-Driven O...

  69. [69]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen- Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023. 2, 4, 11

  70. [70]

    Train once, get a family: State-adaptive balances for offline-to-online reinforcement learning.Advances in Neural Information Processing Systems, 36:47081–47104, 2023

    Shenzhi Wang, Qisen Yang, Jiawei Gao, Matthieu Lin, Hao Chen, Liwei Wu, Ning Jia, Shiji Song, and Gao Huang. Train once, get a family: State-adaptive balances for offline-to-online reinforcement learning.Advances in Neural Information Processing Systems, 36:47081–47104, 2023. 2

  71. [71]

    Fine-tuning reinforcement learning models is secretly a forgetting mitigation problem

    Maciej Wolczyk, Bartłomiej Cupiał, Mateusz Ostaszewski, Michał Bortkiewicz, Michał Zając, Razvan Pascanu, Łukasz Kuciński, and Piotr Miłoś. Fine-tuning reinforcement learning models is secretly a forgetting mitigation problem. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=53iSXb1m8w. 5

  72. [72]

    Behavior regularized offline reinforcement learning

    Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019. 4

  73. [73]

    Drm: Mastering visual reinforcement learning through dormant ratio minimization

    Guowei Xu, Ruijie Zheng, Yongyuan Liang, Xiyao Wang, Zhecheng Yuan, Tianying Ji, Yu Luo, Xiaoyu Liu, Jiaxin Yuan, Pu Hua, Shuzhen Li, Yanjie Ze, Hal Daumé III, Furong Huang, and Huazhe Xu. Drm: Mastering visual reinforcement learning through dormant ratio minimization. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://op...

  74. [74]

    World action models are zero-shot policies, 2026

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruijie Zheng, Yuqi Xie, Jimmy Wu, Qi ...

  75. [75]

    DAPO:Anopen-sourceLLM reinforcement learning system at scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, YuYue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao ...

  76. [76]

    Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=4OsgYD7em5. 5

  77. [77]

    A survey on negative transfer.IEEE/CAA Journal of Automatica Sinica, 10(2):305–329, 2022

    Wen Zhang, Lingfei Deng, Lei Zhang, and Dongrui Wu. A survey on negative transfer.IEEE/CAA Journal of Automatica Sinica, 10(2):305–329, 2022. 6 20 Beyond One-Size-Fits-All: Diagnosis-Driven Online Reinforcement Learning with Offline Priors

  78. [78]

    Kakade, Cengiz Pehlevan, Samy Jelassi, and Eran Malach

    Rosie Zhao, Alexandru Meterez, Sham M. Kakade, Cengiz Pehlevan, Samy Jelassi, and Eran Malach. Echo chamber: RL post-training amplifies behaviors learned in pretraining. InSecond Conference on Language Modeling, 2025. URLhttps://openreview.net/forum?id=dp4KWuSDzj. 5, 6

  79. [79]

    Efficient online reinforce- ment learning fine-tuning need not retain offline data

    Zhiyuan Zhou, Andy Peng, Qiyang Li, Sergey Levine, and Aviral Kumar. Efficient online reinforce- ment learning fine-tuning need not retain offline data. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=HN0CYZbAPw. 2, 4, 5, 8, 10

  80. [80]

    with” and “without

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 4 21 Beyond One-Size-Fits-All: Diagnosis-Driven Online Reinforcement Learning with Off...