A Link between Shock-wave Theory and Symmetry-reduced Stochastic Gradient Descent for Artificial Neural Networks

Taiki Miyagawa

arxiv: 2606.18303 · v1 · pith:LGZVJ2Q5new · submitted 2026-06-16 · 💻 cs.LG · cs.AI

A Link between Shock-wave Theory and Symmetry-reduced Stochastic Gradient Descent for Artificial Neural Networks

Taiki Miyagawa This is my paper

Pith reviewed 2026-06-27 02:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords shock-wave theorysymmetry reductionstochastic gradient descentHamilton-Jacobi equationBurgers equationneural networksquotient manifoldlocal entropy

0 comments

The pith

Symmetry-quotiented SGD dynamics in neural networks reduce to a viscous Hamilton-Jacobi equation whose gradient obeys a Burgers-type equation with possible shock formation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an explicit link between shock-wave theory and the symmetry-reduced learning dynamics of stochastic gradient descent in artificial neural networks. After quotienting out parameter symmetries via Lie group methods and applying local-entropy coarse-graining, the effective dynamics on the quotient manifold satisfy a viscous Hamilton-Jacobi equation. Under the assumption that raw parameter dynamics can be summarized by a gradient field on this quotient space, the gradient of the coarse-grained loss obeys a Burgers-type equation, for which shock formation is proved rigorously. The framework is verified on multilayer perceptrons, convolutional networks, Transformers, and mean-field networks, and it is conjectured to supply symmetry-corrected observables for monitoring training-phase transitions where raw parameter norms are distorted by redundancy.

Core claim

After quotienting parameter symmetries and applying local-entropy coarse-graining, the effective dynamics satisfy a viscous Hamilton--Jacobi equation on the quotient manifold. Under the assumption that the raw parameter dynamics can be summarized by a gradient field on the quotiented space, the gradient of the coarse-grained loss function obeys a Burgers-type equation, and shock formation can be established rigorously. The same equations hold for multilayer perceptrons, convolutional neural networks, Transformers, and mean-field networks.

What carries the argument

Symmetry quotienting of parameter space via Lie groups, followed by local-entropy coarse-graining, which produces a viscous Hamilton-Jacobi equation on the quotient manifold whose gradient satisfies a Burgers-type equation.

If this is right

Multilayer perceptrons, convolutional networks, Transformers, and mean-field networks all obey the derived Hamilton-Jacobi or Burgers-type equations after symmetry reduction.
Shock formation can be established rigorously once the gradient-field assumption holds on the quotient manifold.
Symmetry-corrected quotient observables supply a principled basis for monitoring, forecasting, and controlling training-phase transitions.
Raw parameter norms in architectures such as Transformers are often distorted by symmetry redundancy and therefore misleading.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The Burgers-shock picture may suggest new regularization strategies that explicitly damp or steer shock locations during training.
The same quotient construction could be applied to other first-order optimizers to test whether their reduced dynamics also admit fluid-mechanical descriptions.
Symmetry-corrected observables might serve as early-warning signals for generalization transitions that are invisible in the original parameter space.

Load-bearing premise

The raw parameter dynamics can be summarized by a gradient field on the quotiented space.

What would settle it

A direct numerical check showing that the gradient of the coarse-grained loss in a symmetry-quotiented multilayer perceptron fails to satisfy the Burgers-type equation would falsify the claimed reduction.

Figures

Figures reproduced from arXiv: 2606.18303 by Taiki Miyagawa.

**Figure 1.** Figure 1: Notation. Let Θ ⊂ R dΘ be a smooth parameter manifold, and let a Lie group or finite group G act smoothly on Θ. We assume that there is an open regular stratum Θreg ⊂ Θ on which the action is free and proper. Then, the quotient M := Θreg/G forms a smooth manifold, and the quotient map π : Θreg → M is a smooth submersion. In the finite-group case, the global quotient may be an orbifold, but on each princi… view at source ↗

**Figure 2.** Figure 2: Hopf–Cole shock profile in a quotient ReLU model. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Shock formation in viscous Burgers equation via Hopf–Cole transform. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

We develop a mathematically explicit link between shock-wave theory and the symmetry-quotiented learning dynamics of stochastic gradient descent, drawing on differential geometry, Lie group theory, and fluid mechanics. Specifically, after quotienting parameter symmetries and applying local-entropy coarse-graining, the effective dynamics satisfy a viscous Hamilton--Jacobi equation on the quotient manifold. Moreover, under the assumption that the raw parameter dynamics can be summarized by a gradient field on the quotiented space, the gradient of the coarse-grained loss function obeys a Burgers-type equation, and shock formation can be established rigorously. We apply our theory to multilayer perceptrons, convolutional neural networks, Transformers, and mean-field networks, and show that they obey the Hamilton--Jacobi or Burgers-type equations. We conjecture that this framework also yields practical diagnostics for deep learning. In architectures such as Transformers, raw parameter norms are often distorted by symmetry redundancy and may therefore be misleading, whereas symmetry-corrected quotient observables provide a principled basis for monitoring, forecasting, and controlling training-phase transitions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper links shock-wave theory to symmetry-reduced SGD but the key step assumes parameter dynamics form a gradient field on the quotient manifold without verification for the claimed architectures.

read the letter

The paper's central result is a mathematical connection between shock-wave theory and the dynamics of SGD after symmetry reduction. But this connection is conditional on the assumption that the parameter updates can be summarized by a gradient field on the quotient manifold. That assumption is not verified in the work for the architectures it claims to cover.

On the positive side, the authors use Lie group theory to handle parameter symmetries by quotienting, then introduce local-entropy coarse-graining to obtain a viscous Hamilton-Jacobi equation on the resulting manifold. Under the gradient field assumption, they derive that the gradient of the coarse-grained loss follows a Burgers-type equation, for which shock formation can be shown rigorously. This explicit use of fluid mechanics tools in the context of neural network optimization appears to be a fresh contribution.

The framework is internally consistent and applies the same structure to several common architectures, including MLPs, CNNs, Transformers, and mean-field networks. The idea that symmetry-corrected observables might be more reliable for monitoring training in redundant parameter spaces is worth considering.

The main limitation is the load-bearing assumption. SGD is stochastic, so trajectories include noise that may not fit neatly into a gradient field. If residual symmetry effects or batch-induced variations prevent the summarization, then the Burgers equation and the shock results do not follow. The paper states that these networks obey the equations but does not provide the checks or derivations that would confirm the assumption holds in each case.

This kind of work is aimed at researchers who think about optimization through the lens of differential geometry and PDEs. It could interest people looking for new ways to analyze training dynamics beyond standard loss curves.

I think it deserves peer review. The novelty of the link is enough to warrant referees looking at whether the assumption can be supported or if the theory can be made more robust.

Referee Report

1 major / 1 minor

Summary. The manuscript claims to establish a rigorous connection between shock-wave theory and the symmetry-quotiented dynamics of stochastic gradient descent in artificial neural networks. By quotienting out parameter symmetries and applying local-entropy coarse-graining, the effective dynamics are shown to satisfy a viscous Hamilton--Jacobi equation on the quotient manifold. Under the additional assumption that the raw parameter dynamics can be summarized by a gradient field on this quotient space, the gradient of the coarse-grained loss function is shown to obey a Burgers-type equation, allowing for rigorous establishment of shock formation. The framework is then applied to multilayer perceptrons, convolutional neural networks, Transformers, and mean-field networks, asserting that they obey these equations, with conjectures on practical diagnostics for training.

Significance. If the central assumption regarding the gradient-field summarization holds and the derivations are correct, the work offers a potentially significant interdisciplinary link between differential geometry, fluid mechanics, and machine learning theory. It could provide new analytical tools for understanding symmetry effects and phase transitions in deep learning training, particularly for architectures like Transformers where symmetry redundancy affects parameter norms. The explicit mathematical framework and conjectured diagnostics represent a strength if substantiated.

major comments (1)

[Applications to architectures (abstract and corresponding section)] The assertion that MLPs, CNNs, Transformers, and mean-field networks obey the Hamilton--Jacobi or Burgers-type equations (as stated in the abstract and the applications section) rests on the unexamined assumption that their raw parameter dynamics can be summarized by a gradient field on the quotiented space. No explicit check or justification is provided that SGD trajectories for these architectures satisfy this condition (e.g., absence of non-gradient components from mini-batch noise or residual symmetries). This assumption is load-bearing for the Burgers equation and shock-formation results, so the applicability claims require verification or qualification.

minor comments (1)

[Abstract] The abstract introduces 'local-entropy coarse-graining' without a brief inline definition or forward reference to its definition in the main text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address the single major comment below.

read point-by-point responses

Referee: [Applications to architectures (abstract and corresponding section)] The assertion that MLPs, CNNs, Transformers, and mean-field networks obey the Hamilton--Jacobi or Burgers-type equations (as stated in the abstract and the applications section) rests on the unexamined assumption that their raw parameter dynamics can be summarized by a gradient field on the quotiented space. No explicit check or justification is provided that SGD trajectories for these architectures satisfy this condition (e.g., absence of non-gradient components from mini-batch noise or residual symmetries). This assumption is load-bearing for the Burgers equation and shock-formation results, so the applicability claims require verification or qualification.

Authors: We agree that the gradient-field assumption is essential for the Burgers-type equation and the rigorous shock-formation result. The viscous Hamilton--Jacobi equation on the quotient manifold follows directly from symmetry quotienting and local-entropy coarse-graining without this assumption. The manuscript applies the general framework to MLPs, CNNs, Transformers, and mean-field networks to obtain the Hamilton--Jacobi dynamics; the Burgers equation is invoked only under the additional gradient-field hypothesis. We will revise the abstract and applications section to make this distinction explicit, qualify the Burgers and shock claims as conditional on the assumption, and note that empirical verification of the assumption for these architectures remains an open question for future work. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation conditional on explicit assumption with no self-referential reduction

full rationale

The provided abstract and context show a derivation that first obtains a viscous Hamilton-Jacobi equation via symmetry quotienting and local-entropy coarse-graining. It then states an explicit assumption ('under the assumption that the raw parameter dynamics can be summarized by a gradient field on the quotiented space') before deriving the Burgers-type equation and rigorous shock formation. Application to MLPs, CNNs, Transformers and mean-field networks is presented as obeying the equations, but this is framed under the same stated assumption rather than by redefining inputs to match outputs. No self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation chain appears. The assumption is load-bearing for the strongest claim yet is openly declared and not constructed from the target result itself. This satisfies the default expectation of non-circularity for a mathematically conditional derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on quotienting symmetries, local-entropy coarse-graining, and the explicit assumption that raw parameter dynamics reduce to a gradient field on the quotient manifold; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption The raw parameter dynamics can be summarized by a gradient field on the quotiented space
This is the stated assumption required for the Burgers-type equation and shock formation to hold.

pith-pipeline@v0.9.1-grok · 5704 in / 1137 out tokens · 23529 ms · 2026-06-27T02:16:15.927547+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 2 linked inside Pith

[1]

Path-sgd: Path- normalized optimization in deep neural networks

Neyshabur, Behnam, Russ R. Salakhutdinov, and Nati Srebro. "Path-sgd: Path- normalized optimization in deep neural networks."Advances in neural information processing systems28 (2015)

2015
[2]

A scale invariant flatness measure for deep network minima

Rangamani, Akshay, et al. "A scale invariant flatness measure for deep network minima."arXiv preprintarXiv:1902.02434 (2019)

Pith/arXiv arXiv 1902
[3]

Deep relaxation: partial differential equations for opti- mizing deep neural networks

Chaudhari, Pratik, et al. "Deep relaxation: partial differential equations for opti- mizing deep neural networks."Research in the Mathematical Sciences5.3 (2018): 30

2018
[4]

Stochastic modified equations and dynamics of stochastic gradient algorithms i: Mathematical foundations

Li, Qianxiao, and Cheng Tai. "Stochastic modified equations and dynamics of stochastic gradient algorithms i: Mathematical foundations."Journal of Machine Learning Research20.40 (2019): 1-47

2019
[5]

Stochastic modified flows, mean-field limits and dynamics of stochastic gradient descent

Gess, Benjamin, Sebastian Kassing, and Vitalii Konarovskyi. "Stochastic modified flows, mean-field limits and dynamics of stochastic gradient descent."Journal of Machine Learning Research25.30 (2024): 1-27

2024
[6]

Theoretical analysis of auto rate- tuning by batch normalization

Arora, Sanjeev, Zhiyuan Li, and Kaifeng Lyu. "Theoretical analysis of auto rate- tuning by batch normalization."arXiv preprintarXiv:1812.03981 (2018)

Pith/arXiv arXiv 2018
[7]

A mean field view of the landscape of two-layer neural networks

Mei, Song, Andrea Montanari, and Phan-Minh Nguyen. "A mean field view of the landscape of two-layer neural networks."Proceedings of the National Academy of Sciences115.33 (2018): E7665-E7671. Title Suppressed Due to Excessive Length 15

2018
[8]

Mean field analysis of deep neural networks

Sirignano, Justin, and Konstantinos Spiliopoulos. "Mean field analysis of deep neural networks."Mathematics of Operations Research47.1 (2022): 120-152

2022
[9]

Hide & seek: Transformer symmetries obscure sharpness & Riemannian geometry finds it

Da Silva, Marvin F., Felix Dangel, and Sageev Oore. "Hide & seek: Transformer symmetries obscure sharpness & Riemannian geometry finds it."arXiv preprint arXiv:2505.05409 (2025)

arXiv 2025
[10]

Evans, Lawrence C.Partial differential equations.Vol. 19. American mathematical society, 2022

2022
[11]

Leveque.Numerical methods for conservation laws

LeVeque, Randall J., and Randall J. Leveque.Numerical methods for conservation laws. Vol. 132. Basel: Birkhäuser, 1992

1992

[1] [1]

Path-sgd: Path- normalized optimization in deep neural networks

Neyshabur, Behnam, Russ R. Salakhutdinov, and Nati Srebro. "Path-sgd: Path- normalized optimization in deep neural networks."Advances in neural information processing systems28 (2015)

2015

[2] [2]

A scale invariant flatness measure for deep network minima

Rangamani, Akshay, et al. "A scale invariant flatness measure for deep network minima."arXiv preprintarXiv:1902.02434 (2019)

Pith/arXiv arXiv 1902

[3] [3]

Deep relaxation: partial differential equations for opti- mizing deep neural networks

Chaudhari, Pratik, et al. "Deep relaxation: partial differential equations for opti- mizing deep neural networks."Research in the Mathematical Sciences5.3 (2018): 30

2018

[4] [4]

Stochastic modified equations and dynamics of stochastic gradient algorithms i: Mathematical foundations

Li, Qianxiao, and Cheng Tai. "Stochastic modified equations and dynamics of stochastic gradient algorithms i: Mathematical foundations."Journal of Machine Learning Research20.40 (2019): 1-47

2019

[5] [5]

Stochastic modified flows, mean-field limits and dynamics of stochastic gradient descent

Gess, Benjamin, Sebastian Kassing, and Vitalii Konarovskyi. "Stochastic modified flows, mean-field limits and dynamics of stochastic gradient descent."Journal of Machine Learning Research25.30 (2024): 1-27

2024

[6] [6]

Theoretical analysis of auto rate- tuning by batch normalization

Arora, Sanjeev, Zhiyuan Li, and Kaifeng Lyu. "Theoretical analysis of auto rate- tuning by batch normalization."arXiv preprintarXiv:1812.03981 (2018)

Pith/arXiv arXiv 2018

[7] [7]

A mean field view of the landscape of two-layer neural networks

Mei, Song, Andrea Montanari, and Phan-Minh Nguyen. "A mean field view of the landscape of two-layer neural networks."Proceedings of the National Academy of Sciences115.33 (2018): E7665-E7671. Title Suppressed Due to Excessive Length 15

2018

[8] [8]

Mean field analysis of deep neural networks

Sirignano, Justin, and Konstantinos Spiliopoulos. "Mean field analysis of deep neural networks."Mathematics of Operations Research47.1 (2022): 120-152

2022

[9] [9]

Hide & seek: Transformer symmetries obscure sharpness & Riemannian geometry finds it

Da Silva, Marvin F., Felix Dangel, and Sageev Oore. "Hide & seek: Transformer symmetries obscure sharpness & Riemannian geometry finds it."arXiv preprint arXiv:2505.05409 (2025)

arXiv 2025

[10] [10]

Evans, Lawrence C.Partial differential equations.Vol. 19. American mathematical society, 2022

2022

[11] [11]

Leveque.Numerical methods for conservation laws

LeVeque, Randall J., and Randall J. Leveque.Numerical methods for conservation laws. Vol. 132. Basel: Birkhäuser, 1992

1992