Recognition: unknown
AdamO: A Collapse-Suppressed Optimizer for Offline RL
Pith reviewed 2026-05-10 14:44 UTC · model grok-4.3
The pith
AdamO modifies Adam with a regulated orthogonality correction to stabilize offline RL critics against collapse.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that offline TD critic updates remain stable if and only if the spectral radius of the induced update operator is strictly less than one. Standard Adam can push this radius above one by altering geometry, amplifying TD errors. AdamO adds an explicit orthogonality correction that is decoupled from the main update and bounded by a task-alignment budget; within the modeled regime this correction forces the radius below one, thereby guaranteeing task safety without destroying the dissipative structure of Adam's continuous-time dynamics.
What carries the argument
AdamO, an Adam-based optimizer that augments each step with a decoupled orthogonality correction whose magnitude is strictly limited by a task-alignment budget, thereby restoring the spectral-radius condition for stability.
If this is right
- Any offline RL algorithm that swaps its optimizer for AdamO inherits the local stability guarantee and improved returns.
- The same orthogonality correction can be added to other first-order methods without changing their continuous-time limits.
- Worst-case task safety holds as long as the task-alignment budget is respected during training.
Where Pith is reading between the lines
- The feedback-system view of TD updates may be reusable for diagnosing instability in online RL or model-based methods.
- Enforcing orthogonality only on the critic parameters could be tested as a lighter-weight alternative to full AdamO.
- If the spectral condition is violated in practice, simply rescaling the learning rate might restore stability without the extra orthogonality term.
Load-bearing premise
The stability proof and spectral-radius condition apply only inside the specific regime where local update dynamics can be represented as a linear feedback system.
What would settle it
An offline RL run using AdamO in which the critic still diverges to unusable Q-values while the spectral radius of the update operator remains below one would falsify the safety claim.
Figures
read the original abstract
Offline reinforcement learning (RL) can fail spectacularly when bootstrapped temporal-difference (TD) updates amplify their own errors, driving the critic toward extreme and unusable Q-values. A key counterintuitive insight of this work is that collapse is not only a property of the backup rule or network architecture: optimizer dynamics themselves can directly trigger or suppress instability. From a control-theoretic viewpoint, we model offline TD learning as a feedback system and analyze Adam-based critic updates. This yields a necessary and sufficient condition for stability of the induced local update dynamics: within the regime we analyze, these dynamics are stable if and only if the spectral radius of the corresponding update operator is strictly below one. Further analysis suggests that standard Adam updates can inadvertently distort the parameter geometry, motivating explicit orthogonality constraints to prevent TD error amplification. To this end, we propose AdamO, an Adam-based optimizer with a decoupled orthogonality correction regulated by a strict task-alignment budget. We prove that this design theoretically guarantees worst-case task safety and preserves Adam's continuous-time dissipative dynamics. Empirically, AdamO is broadly compatible with diverse offline RL baselines, improving stability and returns across a broad suite of benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AdamO, an Adam-based optimizer augmented with a decoupled orthogonality correction controlled by a task-alignment budget, for stabilizing critic updates in offline RL. It models offline TD learning as a feedback system, derives a necessary-and-sufficient spectral-radius condition for local stability of the update dynamics, and claims a proof that the design guarantees worst-case task safety while preserving Adam's continuous-time dissipative properties. Empirical results indicate improved stability and returns when plugged into diverse offline RL baselines across benchmarks.
Significance. If the local-to-global bridging argument and the worst-case safety proof hold beyond the modeled regime, the contribution would be notable: it reframes optimizer choice itself as a mechanism for suppressing TD collapse rather than relying solely on algorithmic or architectural fixes, and the control-theoretic lens could generalize to other bootstrapped settings. The compatibility with existing baselines is a practical strength.
major comments (2)
- [theoretical analysis / feedback-system derivation] The necessary-and-sufficient spectral-radius <1 condition is derived only for the local update dynamics when modeled as a feedback system (theoretical analysis section). No explicit argument is supplied showing why this local operator property prevents TD-error amplification over the full non-stationary offline trajectory, across the state-action distribution, or when bootstrapping pushes the system outside the modeled regime, yet the abstract and introduction assert a proof of worst-case task safety.
- [AdamO design / task-alignment budget] The task-alignment budget is introduced to regulate the orthogonality correction and avoid geometry distortion, but the manuscript does not demonstrate that its selection is independent of the stability objective; this creates a risk that the budget is implicitly tuned to the same spectral-radius condition it is meant to enforce (see the definition of the correction term and the budget parameter).
minor comments (2)
- [experiments] The experimental section would benefit from an ablation isolating the contribution of the orthogonality correction versus the budget alone, and from reporting the fraction of runs that still exhibit collapse under AdamO.
- [preliminaries / notation] Notation for the update operator and its spectral radius should be introduced earlier and used consistently; the transition from continuous-time dissipative dynamics to the discrete feedback model is not clearly sign-posted.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments raise important points about the scope of our theoretical guarantees and the independence of design parameters. We address each major comment below and outline revisions that will strengthen the manuscript.
read point-by-point responses
-
Referee: The necessary-and-sufficient spectral-radius <1 condition is derived only for the local update dynamics when modeled as a feedback system (theoretical analysis section). No explicit argument is supplied showing why this local operator property prevents TD-error amplification over the full non-stationary offline trajectory, across the state-action distribution, or when bootstrapping pushes the system outside the modeled regime, yet the abstract and introduction assert a proof of worst-case task safety.
Authors: We appreciate the referee highlighting the need for clearer bridging between local and global regimes. Our feedback-system model is constructed precisely to represent the core TD update operator under the fixed offline data distribution; the necessary-and-sufficient spectral-radius condition establishes contractivity of the linearized dynamics, which precludes local error amplification. In the offline setting the state-action distribution is stationary by construction, and the worst-case safety claim follows from the fact that the orthogonality correction enforces the radius bound uniformly. Nevertheless, we acknowledge that an explicit connection from the local linearization to the full trajectory (including potential distribution shifts induced by bootstrapping) is not spelled out in sufficient detail. In the revised manuscript we will add a dedicated subsection that invokes a discrete-time Lyapunov argument to show that local spectral-radius <1 implies bounded error propagation over finite-length trajectories under the bounded non-stationarity present in offline RL. revision: yes
-
Referee: The task-alignment budget is introduced to regulate the orthogonality correction and avoid geometry distortion, but the manuscript does not demonstrate that its selection is independent of the stability objective; this creates a risk that the budget is implicitly tuned to the same spectral-radius condition it is meant to enforce (see the definition of the correction term and the budget parameter).
Authors: The task-alignment budget is introduced to limit the magnitude of the orthogonality correction so that the update direction remains aligned with the original Adam gradient, thereby preserving the continuous-time dissipative properties we analyze. The stability proof shows that the spectral radius is strictly less than one for every budget value in (0,1], independent of the particular numerical choice; the budget therefore does not need to be tuned to the radius condition itself. Its role is geometric rather than stability-specific. To remove any ambiguity we will insert a clarifying paragraph in the AdamO design section that states this independence explicitly and will augment the experimental section with a sensitivity plot demonstrating that performance remains stable across a wide interval of budget values without requiring re-tuning for the spectral-radius guarantee. revision: partial
Circularity Check
No circularity in derivation chain
full rationale
The paper models offline TD updates as a feedback system, derives a necessary-and-sufficient spectral-radius condition for local stability directly from that model, identifies geometry distortion in standard Adam, introduces a decoupled orthogonality correction with task-alignment budget, and claims a proof that the resulting AdamO design guarantees worst-case task safety while preserving dissipative dynamics. Each step is an analytical derivation or design choice motivated by the preceding analysis rather than a definitional equivalence, fitted input renamed as prediction, or load-bearing self-citation. The local-to-global safety claim is presented as a substantive theorem rather than a reduction to the input model by construction, and no equations or steps are shown to collapse into their own premises.
Axiom & Free-Parameter Ledger
free parameters (1)
- task-alignment budget
axioms (2)
- domain assumption Offline TD learning can be represented as a feedback system whose local dynamics are captured by an update operator
- domain assumption Standard Adam updates can distort parameter geometry in a manner that amplifies TD errors
Reference graph
Works this paper leans on
-
[1]
The annals of mathematical statistics , pages=
A stochastic approximation method , author=. The annals of mathematical statistics , pages=. 1951 , publisher=
1951
-
[2]
Advances in neural information processing systems , volume=
Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation , author=. Advances in neural information processing systems , volume=
-
[3]
Advances in Neural Information Processing Systems , volume=
Resetting the optimizer in deep rl: An empirical study , author=. Advances in Neural Information Processing Systems , volume=
-
[4]
Advances in Neural Information Processing Systems , volume=
Fast trac: A parameter-free optimizer for lifelong reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
-
[5]
2024 , eprint=
Learning to Optimize for Reinforcement Learning , author=. 2024 , eprint=
2024
-
[6]
Advances in Neural Information Processing Systems , volume=
Can learned optimization make reinforcement learning less difficult? , author=. Advances in Neural Information Processing Systems , volume=
-
[7]
2025 , eprint=
Stable Gradients for Stable Learning at Scale in Deep Reinforcement Learning , author=. 2025 , eprint=
2025
-
[8]
Continuous control with deep reinforcement learning
Continuous control with deep reinforcement learning , author=. arXiv preprint arXiv:1509.02971 , year=
work page internal anchor Pith review arXiv
-
[9]
2022 , booktitle=
Provable General Function Class Representation Learning in Multitask Bandits and MDPs , author=. 2022 , booktitle=
2022
-
[10]
Advances in Neural Information Processing Systems , volume=
Conservative Q-Learning for Offline Reinforcement Learning , author=. Advances in Neural Information Processing Systems , volume=
-
[11]
FOVA: Offline Federated Reinforcement Learning With Mixed-Quality Data , year=
Qiao, Nan and Yue, Sheng and Ren, Ju and Zhang, Yaoxue , journal=. FOVA: Offline Federated Reinforcement Learning With Mixed-Quality Data , year=
-
[12]
International Conference on Machine Learning , pages=
Off-policy deep reinforcement learning without exploration , author=. International Conference on Machine Learning , pages=. 2019 , organization=
2019
-
[13]
Advances in Neural Information Processing Systems , pages=
Stabilizing off-policy q-learning via bootstrapping error reduction , author=. Advances in Neural Information Processing Systems , pages=
-
[14]
Advances in neural information processing systems , volume=
A minimalist approach to offline reinforcement learning , author=. Advances in neural information processing systems , volume=
-
[15]
Advances in Neural Information Processing Systems , volume=
Conservative q-learning for offline reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
-
[16]
International Conference on Machine Learning , pages=
Offline reinforcement learning with fisher divergence critic regularization , author=. International Conference on Machine Learning , pages=. 2021 , organization=
2021
-
[17]
Advances in Neural Information Processing Systems , volume=
Offline rl without off-policy evaluation , author=. Advances in Neural Information Processing Systems , volume=
-
[18]
International Conference on Learning Representations , year=
Offline Reinforcement Learning with Implicit Q-Learning , author=. International Conference on Learning Representations , year=
-
[19]
Batch reinforcement learning , year =
Lange, Sascha and Gabel, Thomas and Riedmiller, Martin , booktitle =. Batch reinforcement learning , year =
-
[20]
ICLR , year=
Offline Reinforcement Learning via High-Fidelity Generative Behavior Modeling , author=. ICLR , year=
-
[21]
AAAI , year=
Value-Consistent Representation Learning for Data-Efficient Reinforcement Learning , author=. AAAI , year=
-
[22]
Advances in Neural Information Processing Systems , volume=
Revisiting the minimalist approach to offline reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
-
[23]
Advances in Neural Information Processing Systems , volume=
Mildly conservative q-learning for offline reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
-
[24]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Sumo: Search-based uncertainty estimation for model-based offline reinforcement learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[25]
ICML , year=
Efficient online reinforcement learning with offline data , author=. ICML , year=
-
[26]
A minimalist approach to offline reinforcement learning , year =
Fujimoto, Scott and Gu, Shixiang Shane , journal =. A minimalist approach to offline reinforcement learning , year =
-
[27]
Advances in neural information processing systems , volume=
Uncertainty-based offline reinforcement learning with diversified q-ensemble , author=. Advances in neural information processing systems , volume=
-
[28]
Advances in Neural Information Processing Systems , volume=
Conservative data sharing for multi-task offline reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
-
[29]
The Eleventh International Conference on Learning Representations , year=
When Data Geometry Meets Deep Function: Generalizing Offline Reinforcement Learning , author=. The Eleventh International Conference on Learning Representations , year=
-
[30]
Advances in Neural Information Processing Systems , volume=
Look beneath the surface: Exploiting fundamental symmetry for sample-efficient offline rl , author=. Advances in Neural Information Processing Systems , volume=
-
[31]
2018 , publisher=
Reinforcement learning: An introduction , author=. 2018 , publisher=
2018
-
[32]
, title =
Baird, Leemon C. , title =. Proceedings of the Twelfth International Conference on Machine Learning (ICML 1995) , editor =. 1995 , pages =
1995
-
[33]
An analysis of temporal-difference learning with function approximationtechnical , author=
-
[34]
Deep reinforcement learning and the deadly triad
Deep reinforcement learning and the deadly triad , author=. arXiv preprint arXiv:1812.02648 , year=
-
[35]
NIPS , year=
Double Q-learning , author=. NIPS , year=
-
[36]
ICML , year=
Addressing function approximation error in actor-critic methods , author=. ICML , year=
-
[37]
arXiv preprint arXiv:1902.05605 , year=
Crossnorm: Normalization for off-policy td reinforcement learning , author=. arXiv preprint arXiv:1902.05605 , year=
-
[38]
ICML , year=
Batch normalization: Accelerating deep network training by reducing internal covariate shift , author=. ICML , year=
-
[39]
arXiv preprint arXiv:1903.08894 , year=
Towards characterizing divergence in deep q-learning , author=. arXiv preprint arXiv:1903.08894 , year=
-
[40]
arXiv preprint arXiv:2211.11092 , year=
Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch Size , author=. arXiv preprint arXiv:2211.11092 , year=
-
[41]
2023 , eprint=
Improving and Benchmarking Offline Reinforcement Learning Algorithms , author=. 2023 , eprint=
2023
-
[42]
ICLR , year=
Offline Q-learning on Diverse Multi-Task Data Both Scales And Generalizes , author=. ICLR , year=
-
[43]
2022 , url=
Aviral Kumar and Rishabh Agarwal and Tengyu Ma and Aaron Courville and George Tucker and Sergey Levine , booktitle=. 2022 , url=
2022
-
[44]
arXiv preprint arXiv:2007.05520 , year=
Representations for Stable Off-Policy Reinforcement Learning , author=. arXiv preprint arXiv:2007.05520 , year=
-
[45]
International Conference on Learning Representations , year=
Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning , author=. International Conference on Learning Representations , year=
-
[46]
Ishan Durugkar and Peter Stone , year=
-
[47]
Advances in Neural Information Processing Systems , volume=
Understanding, predicting and better resolving q-value divergence in offline-rl , author=. Advances in Neural Information Processing Systems , volume=
-
[48]
2026 , eprint=
Less is More: Clustered Cross-Covariance Control for Offline RL , author=. 2026 , eprint=
2026
-
[49]
Advances in neural information processing systems , volume=
Decision transformer: Reinforcement learning via sequence modeling , author=. Advances in neural information processing systems , volume=
-
[50]
International conference on machine learning , pages=
Self-imitation learning , author=. International conference on machine learning , pages=. 2018 , organization=
2018
-
[51]
2021 IEEE International Conference on Computer Science, Electronic Information Engineering and Intelligent Control Technology (CEI) , pages=
A rank-based sampling framework for offline reinforcement learning , author=. 2021 IEEE International Conference on Computer Science, Electronic Information Engineering and Intelligent Control Technology (CEI) , pages=. 2021 , organization=
2021
-
[52]
Decoupling representa- tion and classifier for long-tailed recognition,
Decoupling representation and classifier for long-tailed recognition , author=. arXiv preprint arXiv:1910.09217 , year=
-
[53]
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
Advantage-weighted regression: Simple and scalable off-policy reinforcement learning , author=. arXiv preprint arXiv:1910.00177 , year=
work page internal anchor Pith review arXiv 1910
-
[54]
AWAC: Accelerating Online Reinforcement Learning with Offline Datasets
Awac: Accelerating online reinforcement learning with offline datasets , author=. arXiv preprint arXiv:2006.09359 , year=
work page internal anchor Pith review arXiv 2006
-
[55]
arXiv preprint arXiv:2201.13425 , year=
Don't change the algorithm, change the data: Exploratory data for offline reinforcement learning , author=. arXiv preprint arXiv:2201.13425 , year=
-
[56]
Nature , volume=
Mastering atari, go, chess and shogi by planning with a learned model , author=. Nature , volume=. 2020 , publisher=
2020
-
[57]
nature , volume=
Grandmaster level in StarCraft II using multi-agent reinforcement learning , author=. nature , volume=. 2019 , publisher=
2019
-
[58]
Behavior Regularized Offline Reinforcement Learning
Behavior regularized offline reinforcement learning , author=. arXiv preprint arXiv:1911.11361 , year=
work page internal anchor Pith review arXiv 1911
-
[59]
2020 , url=
Way Off-Policy Batch Deep Reinforcement Learning of Human Preferences in Dialog , author=. 2020 , url=
2020
-
[60]
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning
Diffusion policies as an expressive policy class for offline reinforcement learning , author=. arXiv preprint arXiv:2208.06193 , year=
work page internal anchor Pith review arXiv
-
[61]
2013 , publisher=
Introductory lectures on convex optimization: A basic course , author=. 2013 , publisher=
2013
-
[62]
Advances in neural information processing systems , volume=
Neural tangent kernel: Convergence and generalization in neural networks , author=. Advances in neural information processing systems , volume=
-
[63]
Advances in neural information processing systems , volume=
Wide neural networks of any depth evolve as linear models under gradient descent , author=. Advances in neural information processing systems , volume=
-
[64]
Advances in neural information processing systems , volume=
On lazy training in differentiable programming , author=. Advances in neural information processing systems , volume=
-
[65]
Advances in neural information processing systems , volume=
Action-gap phenomenon in reinforcement learning , author=. Advances in neural information processing systems , volume=
-
[66]
2015 , eprint=
Increasing the Action Gap: New Operators for Reinforcement Learning , author=. 2015 , eprint=
2015
-
[67]
Encyclopedia of optimization , year=
Neuro-dynamic programming , author=. Encyclopedia of optimization , year=
-
[68]
Borkar, V. S. and Meyn, S. P. , title =. SIAM Journal on Control and Optimization , volume =. 2000 , doi =
2000
-
[69]
2019 , eprint=
Decoupled Weight Decay Regularization , author=. 2019 , eprint=
2019
-
[70]
Qingmao Yao and Zhichao Lei and Tianyuan Chen and Ziyue Yuan and Xuefan Chen and Jianxiang Liu and Faguo Wu and Xiao Zhang , booktitle=. Offline. 2025 , url=
2025
-
[71]
D4RL: Datasets for Deep Data-Driven Reinforcement Learning
D4rl: Datasets for deep data-driven reinforcement learning , author=. arXiv preprint arXiv:2004.07219 , year=
work page internal anchor Pith review arXiv 2004
-
[72]
arXiv preprint arXiv:2507.08761 , year=
Penalizing Infeasible Actions and Reward Scaling in Reinforcement Learning with Offline Data , author=. arXiv preprint arXiv:2507.08761 , year=
-
[73]
2025 , url=
Tianyuan Chen and Ronglong Cai and Faguo Wu and Xiao Zhang , booktitle=. 2025 , url=
2025
-
[74]
2025 , eprint=
Cautious Optimizers: Improving Training with One Line of Code , author=. 2025 , eprint=
2025
-
[75]
2003 , publisher=
Stochastic approximation and recursive algorithms and applications , author=. 2003 , publisher=
2003
-
[76]
Advances in neural information processing systems , volume=
Analysis of temporal-diffference learning with function approximation , author=. Advances in neural information processing systems , volume=
-
[77]
2017 , eprint=
Adam: A Method for Stochastic Optimization , author=. 2017 , eprint=
2017
-
[78]
2019 , eprint=
On the Convergence of Adam and Beyond , author=. 2019 , eprint=
2019
-
[79]
2019 , eprint=
On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , author=. 2019 , eprint=
2019
-
[80]
nature , volume=
Human-level control through deep reinforcement learning , author=. nature , volume=. 2015 , publisher=
2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.