pith. sign in

arxiv: 2505.13196 · v2 · submitted 2025-05-19 · 💻 cs.LG · cs.AI· quant-ph

A Physics-Inspired Optimizer: Velocity Regularized Adam

Pith reviewed 2026-05-22 14:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AIquant-ph
keywords velocity regularized adamadam optimizeredge of stabilityconvergence boundsdeep neural networksphysics-inspired optimizationnon-convex stochastic optimization
0
0 comments X

The pith

Velocity-Regularized Adam damps oscillations by penalizing high-velocity updates and outperforms AdamW on standard tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Velocity-Regularized Adam, or VRAdam, which adds a higher-order penalty drawn from quartic kinetic energy terms to the Adam optimizer. This penalty automatically shrinks the effective learning rate whenever parameter updates grow large, reducing the rapid oscillations that arise when Adam operates at the edge of stability. The resulting hybrid combines global velocity damping with Adam's per-parameter adaptation. If the method works as described, it yields both stronger empirical performance across image classification, language modeling, and generative modeling and a convergence guarantee of order O(ln(N)/sqrt(N)) for stochastic non-convex problems under mild assumptions.

Core claim

VRAdam adds a velocity-based higher-order penalty to the Adam update rule so that the algorithm automatically slows down in regimes of large weight changes. The penalty is motivated by the stabilizing role of quartic terms in physical kinetic energy and is analyzed from both physical and control-theoretic viewpoints on momentum dynamics. Under mild assumptions the method delivers a convergence rate of O(ln(N)/sqrt(N)) for stochastic non-convex objectives while, in practice, exceeding the performance of AdamW on CNN image classification, Transformer language modeling, and GFlowNet generative tasks.

What carries the argument

The velocity regularizer, a higher-order penalty on the learning rate that scales inversely with the squared velocity of parameter updates and supplies global damping while preserving per-parameter scaling.

If this is right

  • VRAdam exceeds AdamW performance on image classification with CNNs, language modeling with Transformers, and generative modeling with GFlowNets.
  • The effective learning rate shrinks automatically in high-velocity regimes, damping oscillations at the edge of stability.
  • Convergence bounds of O(ln(N)/sqrt(N)) hold for stochastic non-convex objectives under the paper's mild assumptions.
  • The optimizer combines Adam-style per-parameter scaling with a single global velocity-based damping mechanism.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar velocity penalties could be attached to other adaptive first-order methods to obtain comparable stabilization.
  • The same control perspective might be used to design stabilizers for optimization in reinforcement learning or physics-informed neural networks.
  • The approach suggests that explicit penalties on update speed can reduce reliance on manual learning-rate schedules.

Load-bearing premise

The velocity penalty can be inserted into Adam without creating fresh instabilities and the mild assumptions used in the convergence proof continue to hold for the deep-network objectives and architectures tested.

What would settle it

A controlled run on one of the reported benchmarks in which VRAdam either diverges or records lower final accuracy than AdamW, or a calculation showing that the stated O(ln(N)/sqrt(N)) bound is violated once the velocity term is active.

Figures

Figures reproduced from arXiv: 2505.13196 by Lucas Schorling, Michael A. Osborne, Natalia Ares, Pranav Vaidhyanathan.

Figure 2
Figure 2. Figure 2: (a) Training loss curves for VRAdam, Adam, and SAM Foret et al. (2021) of ResNet 32 on CIFAR-10 (b) training accuracy curves (c) plot of maximal eigenvalues of the loss Hessian d effective learning rate during training. Hyperparameters for these plots are provided in Appendix D.4. 4.2 STABILITY OF VRADAM We analyze the behavior of VRAdam in the adaptive edge of stability regime compared to that of Adam in … view at source ↗
Figure 3
Figure 3. Figure 3: Train (left) and validation (right) loss curves with error envelopes calculated using different [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Train (left) and validation (right) loss curves with error envelopes calculated using different [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Train (left) and validation (right) loss curves calculated using different run values for [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Train (left) and validation (right) loss curves with error envelopes calculated using different [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Train (right) curves with error envelopes calculated using different run values for image [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Train (right) curves with error envelopes calculated using different run values for image [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Train (right) curves with error envelopes calculated using different run values for image [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Train (right) curves with error envelopes calculated using different run values for image [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
read the original abstract

We introduce Velocity-Regularized Adam (VRAdam), a physics-inspired optimizer for training deep neural networks that draws on ideas from quartic terms for kinetic energy with its stabilizing effects on various system dynamics. Previous algorithms, including the ubiquitous Adam, operate at the so-called adaptive edge of stability regime during training, leading to rapid oscillations and slowed convergence of loss. However, VRAdam adds a higher order penalty on the learning rate based on the velocity such that the algorithm automatically slows down whenever weight updates become large. In practice, we observe that the effective dynamic learning rate shrinks in high-velocity regimes, and damping oscillations. By combining this velocity-based regularizer for global damping with per-parameter scaling of Adam, we create a powerful hybrid optimizer. For this optimizer, we provide rigorous theoretical analysis of operation at the edge of stability from a physical and control perspective for the momentum. Furthermore, we derive convergence bounds with the rate $\mathcal{O}(\ln(N)/\sqrt{N})$ for a stochastic non convex objective under mild assumptions. We demonstrate that VRAdam exceeds the performance against standard optimizers including AdamW. We benchmark various tasks such as image classification, language modeling, and generative modeling using diverse architectures and training methodologies including Convolutional Neural Networks (CNNs), Transformers, and GFlowNets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Velocity-Regularized Adam (VRAdam), a physics-inspired optimizer that augments Adam with a higher-order velocity penalty derived from quartic kinetic energy terms. This penalty automatically reduces the effective learning rate in high-velocity regimes to damp oscillations at the edge of stability. The manuscript claims a rigorous physical/control-theoretic analysis of momentum dynamics and derives a convergence rate of O(ln(N)/sqrt(N)) for stochastic non-convex objectives under mild assumptions. Empirically, VRAdam is reported to outperform AdamW and other standard optimizers on image classification, language modeling, and generative modeling tasks across CNNs, Transformers, and GFlowNets.

Significance. If the convergence analysis can be made fully rigorous with explicit assumptions and the empirical gains prove robust under matched hyperparameter budgets, the hybrid of global velocity damping and per-parameter adaptation could provide both practical improvements and new insights into optimizer stability for deep networks. The stated rate would be noteworthy for non-convex stochastic optimization if the assumptions align with typical deep-learning regimes.

major comments (3)
  1. [Theoretical analysis] Theoretical analysis section: the O(ln(N)/sqrt(N)) convergence bound is asserted under 'mild assumptions' for stochastic non-convex objectives, yet no explicit statement of those assumptions, key lemmas, or derivation steps appears; this is load-bearing for the central theoretical claim.
  2. [Experiments] Experiments section: superiority over AdamW is stated for multiple tasks and architectures, but no quantitative tables, ablation controls on the velocity penalty strength, or error bars are provided, preventing assessment of whether gains survive identical tuning budgets.
  3. [Method] Optimizer description: the interaction between the velocity penalty and Adam's per-parameter adaptive scaling is described at a high level, but no analysis or experiments address whether the penalty introduces new instabilities or requires task-specific retuning beyond the baseline.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'rigorous theoretical analysis' should include a forward reference to the specific section containing the proof or derivation.
  2. [Method] Notation: the precise mathematical form of the velocity penalty term (e.g., how it modifies the update rule) should be stated explicitly with an equation number for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for their constructive feedback, which has helped us identify areas for improvement in our manuscript. Below, we provide detailed responses to each major comment and indicate the revisions we have made or plan to make.

read point-by-point responses
  1. Referee: [Theoretical analysis] Theoretical analysis section: the O(ln(N)/sqrt(N)) convergence bound is asserted under 'mild assumptions' for stochastic non-convex objectives, yet no explicit statement of those assumptions, key lemmas, or derivation steps appears; this is load-bearing for the central theoretical claim.

    Authors: We thank the referee for highlighting this issue. Upon review, we agree that the assumptions and derivation steps should be stated more explicitly to support the central claim. In the revised version, we will add a new subsection in the theoretical analysis that lists all assumptions clearly, presents the key lemmas, and sketches the main steps of the proof for the O(ln(N)/sqrt(N)) rate. This will ensure the analysis is self-contained and rigorous. revision: yes

  2. Referee: [Experiments] Experiments section: superiority over AdamW is stated for multiple tasks and architectures, but no quantitative tables, ablation controls on the velocity penalty strength, or error bars are provided, preventing assessment of whether gains survive identical tuning budgets.

    Authors: The referee is right that the experimental results need more quantitative support to allow proper evaluation. We have now included detailed tables with performance metrics for each task and architecture, along with ablation studies on the velocity penalty coefficient. Additionally, we report means and standard deviations over multiple independent runs to provide error bars. These changes were made under the constraint of matched hyperparameter tuning budgets where possible. revision: yes

  3. Referee: [Method] Optimizer description: the interaction between the velocity penalty and Adam's per-parameter adaptive scaling is described at a high level, but no analysis or experiments address whether the penalty introduces new instabilities or requires task-specific retuning beyond the baseline.

    Authors: We appreciate this observation. To address it, we have expanded the optimizer description to analyze the interaction between the velocity regularizer and Adam's adaptive mechanism from both theoretical and practical standpoints. We also conducted experiments testing for instabilities and the need for retuning, showing that the penalty parameter can be set to a default value that works across the tested tasks without significant additional tuning. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper introduces VRAdam by adding a velocity-based higher-order penalty to Adam, motivated by a physics analogy to quartic kinetic energy terms, and separately states convergence bounds of O(ln(N)/sqrt(N)) under explicitly labeled mild assumptions for stochastic non-convex objectives. No equations or steps in the provided text reduce a claimed prediction or bound back to a fitted parameter, self-citation, or ansatz by construction. The empirical benchmarks on image classification, language modeling, and generative tasks are presented as independent validation rather than forced outputs of the same inputs used in the analysis. The mild assumptions are not shown to be retrofitted to the target rate, and the physics framing supplies interpretive context without making the mathematical derivation tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claims rest on the existence of an adaptive edge-of-stability regime in standard Adam training and on mild assumptions sufficient for the stated convergence rate; no new particles or dimensions are introduced.

free parameters (1)
  • velocity penalty strength
    The coefficient controlling the higher-order velocity term is a tunable hyperparameter whose value is not derived from first principles.
axioms (2)
  • domain assumption Previous Adam-like optimizers operate at the adaptive edge of stability, producing rapid oscillations.
    Invoked to motivate the need for the velocity regularizer.
  • domain assumption Mild assumptions hold for the stochastic non-convex objective.
    Required to obtain the O(ln(N)/sqrt(N)) convergence bound.

pith-pipeline@v0.9.0 · 5768 in / 1475 out tokens · 64011 ms · 2026-05-22T14:30:54.247066+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 10 internal anchors

  1. [1]

    Benoˆıt Assi, Bernd A

    URLhttps://arxiv.org/abs/2205.09745. Benoˆıt Assi, Bernd A. Kniehl, and Joan Soto. Matching the standard model to heavy-quark ef- fective theory and nonrelativistic qcd.Nuclear Physics B, 992:116173, July

  2. [2]

    URLhttp://dx.doi.org/10.1016/j

    doi: 10.1016/j.nuclphysb.2023.116173. URLhttp://dx.doi.org/10.1016/j. nuclphysb.2023.116173. Yehonatan Avidan, Qianyi Li, and Haim Sompolinsky. Connecting ntk and nngp: A unified theo- retical framework for wide neural network learning dynamics,

  3. [3]

    org/abs/2309.04522

    URLhttps://arxiv. org/abs/2309.04522. 9 Michael Betancourt, Michael I. Jordan, and Ashia C. Wilson. On symplectic optimization.arXiv preprint arXiv:1802.03653,

  4. [4]

    URL https://arxiv.org/abs/hep-ph/9702225. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher ...

  5. [5]

    Language Models are Few-Shot Learners

    URL https://arxiv.org/abs/2005.14165. Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, et al. Symbolic discovery of optimization algorithms. Advances in neural information processing systems, 36:49205–49233,

  6. [6]

    org/abs/2103.00065

    URLhttps://arxiv. org/abs/2103.00065. Jeremy M. Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E. Dahl, and Justin Gilmer. Adaptive gradient methods at the edge of stability,

  7. [7]

    Alex Damian, Eshaan Nichani, and Jason D Lee

    URLhttps://arxiv.org/abs/ 2207.14484. Alex Damian, Eshaan Nichani, and Jason D Lee. Self-stabilization: The implicit bias of gradient descent at the edge of stability.arXiv preprint arXiv:2209.15594,

  8. [8]

    doi: 10.1080/10556788.2023. 2214837. Alexandre D´efossez, L´eon Bottou, Francis Bach, and Nicolas Usunier. A simple convergence proof of adam and adagrad,

  9. [9]

    Benjamin Ellis, Matthew T Jackson, Andrei Lupu, Alexander D Goldie, Mattie Fellows, Shimon Whiteson, and Jakob Foerster

    URLhttps://arxiv.org/abs/2003.02395. Benjamin Ellis, Matthew T Jackson, Andrei Lupu, Alexander D Goldie, Mattie Fellows, Shimon Whiteson, and Jakob Foerster. Adam on local time: Addressing nonstationarity in rl with relative adam timesteps.Advances in Neural Information Processing Systems, 37:134567–134590,

  10. [10]

    Sepp Hochreiter and J¨urgen Schmidhuber

    URLhttps: //arxiv.org/abs/1911.11626. Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima.Neural Computation, 9(1):1–42,

  11. [11]

    Peter Holderrieth, Yilun Xu, and Tommi Jaakkola

    doi: 10.1162/neco.1997.9.1.1. Peter Holderrieth, Yilun Xu, and Tommi Jaakkola. Hamiltonian score matching and generative flows,

  12. [12]

    URLhttps://arxiv.org/abs/2410.20470. John J. Hopfield. Neural networks and physical systems with emergent collective computational abilities.Proceedings of the National Academy of Sciences, 79(8):2554–2558,

  13. [13]

    Adam: A Method for Stochastic Optimization

    URL https://arxiv.org/abs/1412.6980. Alex Krizhevsky and Geoffrey E. Hinton. Learning multiple layers of features from tiny im- ages. Technical Report TR-2009-003, University of Toronto,

  14. [14]

    Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2024a

    URLhttps://www.cs. toronto.edu/˜kriz/learning-features-2009-TR.pdf. Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training.arXiv preprint arXiv:2305.14342,

  15. [15]

    On the variance of the adaptive learning rate and beyond

    Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. In8th International Conference on Learning Representations, ICLR 2020,

  16. [16]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

  17. [17]

    Hamiltonian Descent Methods

    Chris J. Maddison, Daniel Paulin, Yee Whye Teh, Brendan O’Donoghue, and Arnaud Doucet. Hamiltonian descent methods.arXiv preprint arXiv:1809.05042,

  18. [18]

    Guilherme Penedo, Hynek Kydl´ıˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf

    URLhttps://arxiv.org/ abs/2109.06091. Guilherme Penedo, Hynek Kydl´ıˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track,

  19. [19]

    On the Convergence of Adam and Beyond

    ISSN 0893-6080. doi: https://doi.org/10.1016/S0893-6080(98) 00116-6. URLhttps://www.sciencedirect.com/science/article/pii/ S0893608098001166. Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond.arXiv preprint arXiv:1904.09237,

  20. [20]

    An overview of gradient descent optimization algorithms

    URLhttps: //arxiv.org/abs/1609.04747. Alfred Shapere and Frank Wilczek. Classical time crystals.Physical Review Letters, 109(16), October

  21. [21]

    doi: 10.1103/physrevlett.109.160402

    ISSN 1079-7114. doi: 10.1103/physrevlett.109.160402. URLhttp://dx. doi.org/10.1103/PhysRevLett.109.160402. 11 Minhak Song and Chulhee Yun. Trajectory alignment: understanding the edge of stability phe- nomenon via bifurcation theory.arXiv preprint arXiv:2307.04204,

  22. [22]

    Attention Is All You Need

    URLhttps://arxiv. org/abs/1706.03762. Jing Wang and Anna Choromanska. A survey of optimization methods for training dl models: The- oretical perspective on convergence and generalization,

  23. [23]

    Zixuan Wang, Zhouzi Li, and Jian Li

    URLhttps://arxiv.org/ abs/2501.14458. Zixuan Wang, Zhouzi Li, and Jian Li. Analyzing sharpness along gd trajectory: Progressive sharp- ening and edge of stability.Advances in Neural Information Processing Systems, 35:9983–9994,

  24. [24]

    Large Batch Training of Convolutional Networks

    URLhttps://arxiv.org/abs/1708.03888. Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learn- ing: Training bert in 76 minutes,

  25. [25]

    Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

    URLhttps://arxiv.org/abs/1904.00962. Ya-xiang Yuan and Yi Zhang. Symplectic discretization approach for developing new proximal point algorithm.arXiv preprint arXiv:2308.03986,

  26. [26]

    A BOUNDINGNRQCD By explicitly breaking certain symmetries—Lorentz invariance in NRQCD and time-translation in time crystals— higher-order kinetic terms paradoxically enhance stability through topological protection mechanisms and the generation of emergent length/time scales Niemi (2021); Guha & Ghose-Choudhury (2019). As a demonstration of this phenomeno...

  27. [27]

    (1−β 1)α0 1 + min(β3|vt|2, α1) ,(45) so the method moves away from instability as oscillations grow. Second, each parameter update is uniformly bounded in norm by the gate, |xt −x t−1|=η t|vt|= α0|vt| 1 +β 3|vt|2 ≤ α0 2√β3 ,(46) which prevents runaway steps and is not available to classical momentum. These properties are consistent with the design of Algo...