StableGrad: Backward Scale Control without Batch Normalization

Alberto Fern\'andez-Hern\'andez; Cristian P\'erez-Corral; Enrique S. Quintana-Ort\'i; Jose I. Mestre; Manuel F. Dolz

arxiv: 2605.19856 · v1 · pith:Z3BU4E5Nnew · submitted 2026-05-19 · 💻 cs.LG · cs.AI

StableGrad: Backward Scale Control without Batch Normalization

Jose I. Mestre , Alberto Fern\'andez-Hern\'andez , Cristian P\'erez-Corral , Manuel F. Dolz , Enrique S. Quintana-Ort\'i This is my paper

Pith reviewed 2026-05-20 06:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords StableGradgradient scale controlbatch normalization alternativephysics-informed neural networksdeep network trainingoptimizer-level normalizationPINNs

0 comments

The pith

StableGrad rescales weight gradients after backpropagation to fix scale imbalances in deep networks without touching the forward model or using batch normalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes StableGrad as a way to keep activations and gradients from vanishing or exploding in very deep networks by correcting imbalances between layers at the optimizer stage. This matters because many applications, especially Physics-Informed Neural Networks, cannot use batch normalization or similar forward-pass fixes since those would distort the physical field and its derivatives that define the loss. Instead, StableGrad applies a corrective scaling only to the gradients after they have been computed, so the network's outputs and the training objective stay exactly the same. Experiments show the method improves accuracy on deep PINN benchmarks and prevents collapse when BatchNorm is removed from convolutional networks like ResNet and EfficientNet.

Core claim

StableGrad is an optimizer-level scale-control mechanism that corrects layer-wise weight-gradient imbalances without modifying the forward model. Because the normalization is applied only after backpropagation and before the optimizer update, the network output, its derivatives, and the physical residual remain unchanged. Analysis of the resulting training dynamics shows that this post-backprop rescaling stabilizes optimization in deep PINNs and in BatchNorm-free convolutional architectures under standard training settings.

What carries the argument

StableGrad: post-backpropagation rescaling of weight gradients to enforce balanced layer-wise magnitudes before the optimizer step.

If this is right

Deeper PINNs achieve higher solution accuracy on standard benchmarks while keeping the physical residual exactly as defined.
ResNet and EfficientNet models train to completion without Batch Normalization and without any other architectural modifications.
Network outputs and input derivatives stay identical to the unscaled case, preserving the meaning of the loss for physics-informed objectives.
The approach works as a drop-in replacement at the optimizer level for any gradient-based training loop.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same post-backprop correction could be combined with other optimizer choices such as Adam or SGD variants to further tune effective learning rates per layer.
Because no change occurs to the forward computation graph, StableGrad might extend to settings where the model must remain a pure function of its inputs, such as certain scientific computing or differentiable physics pipelines.
If the rescaling rule can be made adaptive to the current gradient statistics, it might reduce sensitivity to the choice of initial learning rate in very deep stacks.

Load-bearing premise

That applying scale correction only to the gradients after they are computed will reliably improve convergence without creating new instabilities or changing the optimization path in harmful ways.

What would settle it

Training the same deep PINN or BatchNorm-removed ResNet with and without StableGrad and finding that the version with StableGrad shows equal or worse accuracy and higher failure rates on the benchmark tasks.

Figures

Figures reproduced from arXiv: 2605.19856 by Alberto Fern\'andez-Hern\'andez, Cristian P\'erez-Corral, Enrique S. Quintana-Ort\'i, Jose I. Mestre, Manuel F. Dolz.

**Figure 2.** Figure 2: Training dynamics of deep CNNs with and without BatchNorm. The default architectures [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Train and validation losses for AdamW and AdamW+StableGrad on the controlled Burgers [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Train and validation residual losses for AdamW, AdamW with the spectral learning-rate [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Magnified view of the activation-scale instability observed at the beginning of training for [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

read the original abstract

Training very deep neural networks requires controlling the propagation of magnitudes across depth. Without such control, activations and gradients may vanish, explode, or enter unstable regimes that make optimization fail. Modern architectures often mitigate this problem through Batch Normalization, residual connections, or other normalization layers, which repeatedly re-scale or bypass intermediate representations. However, these mechanisms are not always appropriate. In Physics-Informed Neural Networks (PINNs), the network represents a continuous physical field and its input derivatives define the training objective, making batch-dependent normalization problematic because it can introduce non-local dependencies into the predicted field and its derivatives. We propose StableGrad, an optimizer-level scale-control mechanism that corrects layer-wise weight-gradient imbalances without modifying the forward model. Because the normalization is applied only after backpropagation and before the optimizer update, the network output, its derivatives, and the physical residual remain unchanged. We analyze the effective training dynamics induced by this rescaling and evaluate StableGrad on deep PINNs as the target application, with BatchNorm-free convolutional networks serving as a diagnostic stress test. On PINN benchmarks, StableGrad improves matched-depth solution accuracy and makes deeper models more reliable under standard optimization. On ResNet and EfficientNet architectures, where removing Batch Normalization normally leads to training collapse, StableGrad stabilizes optimization without introducing any other architectural change. These results show that optimizer-level control of weight-gradient scale can provide a practical alternative when forward normalization is unavailable or undesirable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StableGrad is a post-backprop per-layer gradient rescaling that stabilizes deep PINNs and BatchNorm-free conv nets while leaving the forward map and residuals exactly as they were.

read the letter

The punchline is that this is a targeted optimizer tweak for gradient scale control after backpropagation, which leaves the network output, derivatives, and PINN residuals unchanged. That's useful precisely because it avoids the problems with forward normalization in physics-informed settings. What the paper does well is identify a practical need in deep PINNs where BatchNorm isn't viable, then propose and test this rescaling approach. It reports better solution accuracy on matched-depth models and more reliable training for deeper ones. The conv-net stress tests on ResNet and EfficientNet without BatchNorm show stabilization, which adds some credibility as a general technique. The central mechanism holds up because the adjustment happens strictly after the backward pass. They mention analyzing the effective training dynamics, which is a good step. Soft spots include the level of detail on the dynamics analysis – it might be more empirical than theoretical. Also, the benchmarks could benefit from more ablation on the rescaling factors or comparisons to other gradient clipping methods. Nothing suggests the results are unreliable, but full verification would require the complete experimental logs. This work is for the PINN and scientific ML crowd, plus anyone interested in norm-free deep training. A reader focused on training stability in constrained architectures will find it relevant. It deserves peer review as the idea is straightforward, the claims are falsifiable, and it addresses a real pain point without overclaiming.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces StableGrad, an optimizer-level mechanism that rescales layer-wise weight gradients after backpropagation and before the parameter update. This corrects magnitude imbalances across depth while leaving the forward network, its derivatives, and any physical residual (in PINNs) unchanged. The authors analyze the induced effective dynamics and report empirical gains in solution accuracy for deep PINNs together with stabilization of BatchNorm-free ResNet and EfficientNet models under standard optimization.

Significance. If validated, the approach supplies a practical route to stable deep training when forward normalization layers are unavailable or undesirable, especially in physics-informed settings where batch-dependent operations would compromise derivative consistency. The strict post-backprop placement guarantees invariance of the model output and residual by construction, which is a clear technical strength. The stress-test results on convolutional architectures without BatchNorm further indicate that optimizer-level scale control can serve as a lightweight alternative to architectural modifications.

major comments (2)

[§4] §4 (Analysis of effective dynamics): the derivation of the modified update rule assumes that per-layer gradient norms are the dominant source of scale imbalance; a concrete counter-example or sensitivity study is needed to show that the rescaling does not inadvertently alter the relative learning rates across layers in a way that changes the optimization trajectory for non-convex PINN losses.
[§5.2] §5.2 (BatchNorm-free conv-net experiments): the claim that StableGrad prevents collapse on ResNet/EfficientNet without any other change is load-bearing for the broader applicability argument, yet the section reports only final accuracy and does not include gradient-norm histograms or training-curve statistics across multiple random seeds to rule out post-hoc hyper-parameter tuning.

minor comments (3)

[Eq. (3)] The notation for the per-layer scaling factor (Eq. (3)) uses an ambiguous norm symbol; please specify whether it is the Euclidean or Frobenius norm and whether any small epsilon is added for numerical stability.
[Figure 2] Figure 2 (gradient-norm evolution) would be clearer if error bands from at least three independent runs were added and if the y-axis were log-scaled to highlight the stabilization effect.
[§3] The abstract states that the physical residual remains unchanged, but the main text does not explicitly restate this invariance when discussing the PINN loss; a short reminder paragraph would improve readability for the target audience.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for minor revision. We address each major comment below and will incorporate the requested analyses into the revised manuscript to strengthen the presentation.

read point-by-point responses

Referee: [§4] §4 (Analysis of effective dynamics): the derivation of the modified update rule assumes that per-layer gradient norms are the dominant source of scale imbalance; a concrete counter-example or sensitivity study is needed to show that the rescaling does not inadvertently alter the relative learning rates across layers in a way that changes the optimization trajectory for non-convex PINN losses.

Authors: We agree that additional verification is warranted. Section 4 derives the effective dynamics under the premise that gradient-norm imbalances dominate scale issues in deep networks, which is consistent with the placement of the correction after backpropagation. To directly address concerns about non-convex PINN losses, we will add a sensitivity study in the revision. This will include both a brief theoretical note on how the per-layer rescaling preserves the sign and relative direction of updates while equalizing magnitudes, and empirical results on a representative non-convex PINN benchmark showing that the optimization trajectory remains stable without introducing adverse changes to relative learning rates across layers. revision: yes
Referee: [§5.2] §5.2 (BatchNorm-free conv-net experiments): the claim that StableGrad prevents collapse on ResNet/EfficientNet without any other change is load-bearing for the broader applicability argument, yet the section reports only final accuracy and does not include gradient-norm histograms or training-curve statistics across multiple random seeds to rule out post-hoc hyper-parameter tuning.

Authors: We acknowledge that the current §5.2 focuses on final accuracy metrics. To provide stronger evidence that stabilization occurs consistently and is not the result of post-hoc tuning, we will revise the section to include gradient-norm histograms throughout training and training curves reporting mean and standard deviation across at least five random seeds for both ResNet and EfficientNet architectures. These additions will directly support the claim that StableGrad enables stable optimization without other architectural modifications. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces StableGrad as an optimizer-level post-backpropagation rescaling of layer-wise weight gradients, applied strictly after the backward pass and before the parameter update. This placement ensures by explicit construction that the forward network output, its autodiff derivatives, and the PINN residual remain identical to the unscaled case, with the claimed invariance following directly from the timing rather than from any fitted parameter or self-referential definition. The subsequent analysis of induced effective dynamics and the empirical results on deep PINNs and BatchNorm-free networks are presented as separate, externally verifiable contributions without reducing to renamed known results, self-citation chains, or uniqueness theorems imported from prior author work. The derivation chain is therefore self-contained and independent of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the paper introduces no explicit free parameters or new invented entities; it relies on the domain assumption that post-backprop rescaling leaves forward outputs and derivatives unchanged.

axioms (1)

domain assumption Rescaling gradients after backpropagation leaves the forward pass, network outputs, and derivatives unchanged.
This premise is invoked to ensure physical residuals in PINNs remain unaffected by the scale control.

pith-pipeline@v0.9.0 · 5806 in / 1259 out tokens · 56248 ms · 2026-05-20T06:45:05.134334+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

egℓ = σout / (σℓ + ε) gℓ … KSG = J P J⊤ … Theorem 1 (Local decrease under StableGrad)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

StableGrad … applied only after backpropagation and before the optimizer update … leaves the forward model … unchanged

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 1 internal anchor

[1]

International Conference on Learning Representations , year=

Principled Weight Initialization for Hypernetworks , author=. International Conference on Learning Representations , year=

work page
[2]

and Simard, P

Bengio, Y. and Simard, P. and Frasconi, P. , journal=. Learning long-term dependencies with gradient descent is difficult , year=

work page
[3]

and McClelland, James L

Saxe, Andrew M. and McClelland, James L. and Ganguli, Surya , biburl =. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , url =

work page
[4]

and Ganguli, Surya , title =

Pennington, Jeffrey and Schoenholz, Samuel S. and Ganguli, Surya , title =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =. 2017 , isbn =

work page 2017
[5]

International Conference on Learning Representations , year=

Deep Information Propagation , author=. International Conference on Learning Representations , year=

work page
[6]

2009 , publisher=

Learning multiple layers of features from tiny images , author=. 2009 , publisher=

work page 2009
[7]

CoRR , volume =

Sifan Wang and Yujun Teng and Paris Perdikaris , title =. CoRR , volume =. 2020 , url =. 2001.04536 , timestamp =

work page arXiv 2020
[8]

When and why PINNs fail to train: A neural tangent kernel perspective , journal =

Sifan Wang and Xinling Yu and Paris Perdikaris , keywords =. When and why PINNs fail to train: A neural tangent kernel perspective , journal =. 2022 , issn =. doi:https://doi.org/10.1016/j.jcp.2021.110768 , url =

work page doi:10.1016/j.jcp.2021.110768 2022
[9]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

work page
[10]

2021 , eprint=

A Regularized Limited Memory BFGS method for Large-Scale Unconstrained Optimization and its Efficient Implementations , author=. 2021 , eprint=

work page 2021
[11]

2009 IEEE conference on computer vision and pattern recognition , pages=

Imagenet: A large-scale hierarchical image database , author=. 2009 IEEE conference on computer vision and pattern recognition , pages=. 2009 , organization=

work page 2009
[12]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

The Road Less Scheduled , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page
[13]

PINNacle: A Comprehensive Benchmark of Physics-Informed Neural Networks for Solving PDEs , url =

Hao, Zhongkai and Yao, Jiachen and Su, Chang and Su, Hang and Wang, Ziao and Lu, Fanzhi and Xia, Zeyu and Zhang, Yichi and Liu, Songming and Lu, Lu and Zhu, Jun , booktitle =. PINNacle: A Comprehensive Benchmark of Physics-Informed Neural Networks for Solving PDEs , url =. doi:10.52202/079017-2442 , editor =

work page doi:10.52202/079017-2442
[14]

Le , title =

Mingxing Tan and Quoc V. Le , title =. CoRR , volume =. 2021 , url =. 2104.00298 , timestamp =

work page arXiv 2021
[15]

Deep Residual Learning for Image Recognition

Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun , title =. CoRR , volume =. 2015 , url =. 1512.03385 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2015
[16]

2010 , editor =

Understanding the difficulty of training deep feedforward neural networks , author =. 2010 , editor =

work page 2010
[17]

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , year=

He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian , booktitle=. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , year=

work page
[18]

Proceedings of the 32nd International Conference on Machine Learning , pages =

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , author =. Proceedings of the 32nd International Conference on Machine Learning , pages =. 2015 , editor =

work page 2015
[19]

2018 , isbn =

Wu, Yuxin and He, Kaiming , title =. 2018 , isbn =. doi:10.1007/978-3-030-01261-8_1 , booktitle =

work page doi:10.1007/978-3-030-01261-8_1 2018
[20]

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Deep Residual Learning for Image Recognition , author=. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2016
[21]

ArXiv , year=

Layer Normalization , author=. ArXiv , year=

work page
[22]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Sinusoidal Initialization, Time for a New Start , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page
[23]

Raissi, P

Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations , journal =. 2019 , issn =. doi:https://doi.org/10.1016/j.jcp.2018.10.045 , url =

work page doi:10.1016/j.jcp.2018.10.045 2019

[1] [1]

International Conference on Learning Representations , year=

Principled Weight Initialization for Hypernetworks , author=. International Conference on Learning Representations , year=

work page

[2] [2]

and Simard, P

Bengio, Y. and Simard, P. and Frasconi, P. , journal=. Learning long-term dependencies with gradient descent is difficult , year=

work page

[3] [3]

and McClelland, James L

Saxe, Andrew M. and McClelland, James L. and Ganguli, Surya , biburl =. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , url =

work page

[4] [4]

and Ganguli, Surya , title =

Pennington, Jeffrey and Schoenholz, Samuel S. and Ganguli, Surya , title =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =. 2017 , isbn =

work page 2017

[5] [5]

International Conference on Learning Representations , year=

Deep Information Propagation , author=. International Conference on Learning Representations , year=

work page

[6] [6]

2009 , publisher=

Learning multiple layers of features from tiny images , author=. 2009 , publisher=

work page 2009

[7] [7]

CoRR , volume =

Sifan Wang and Yujun Teng and Paris Perdikaris , title =. CoRR , volume =. 2020 , url =. 2001.04536 , timestamp =

work page arXiv 2020

[8] [8]

When and why PINNs fail to train: A neural tangent kernel perspective , journal =

Sifan Wang and Xinling Yu and Paris Perdikaris , keywords =. When and why PINNs fail to train: A neural tangent kernel perspective , journal =. 2022 , issn =. doi:https://doi.org/10.1016/j.jcp.2021.110768 , url =

work page doi:10.1016/j.jcp.2021.110768 2022

[9] [9]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

work page

[10] [10]

2021 , eprint=

A Regularized Limited Memory BFGS method for Large-Scale Unconstrained Optimization and its Efficient Implementations , author=. 2021 , eprint=

work page 2021

[11] [11]

2009 IEEE conference on computer vision and pattern recognition , pages=

Imagenet: A large-scale hierarchical image database , author=. 2009 IEEE conference on computer vision and pattern recognition , pages=. 2009 , organization=

work page 2009

[12] [12]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

The Road Less Scheduled , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page

[13] [13]

PINNacle: A Comprehensive Benchmark of Physics-Informed Neural Networks for Solving PDEs , url =

Hao, Zhongkai and Yao, Jiachen and Su, Chang and Su, Hang and Wang, Ziao and Lu, Fanzhi and Xia, Zeyu and Zhang, Yichi and Liu, Songming and Lu, Lu and Zhu, Jun , booktitle =. PINNacle: A Comprehensive Benchmark of Physics-Informed Neural Networks for Solving PDEs , url =. doi:10.52202/079017-2442 , editor =

work page doi:10.52202/079017-2442

[14] [14]

Le , title =

Mingxing Tan and Quoc V. Le , title =. CoRR , volume =. 2021 , url =. 2104.00298 , timestamp =

work page arXiv 2021

[15] [15]

Deep Residual Learning for Image Recognition

Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun , title =. CoRR , volume =. 2015 , url =. 1512.03385 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2015

[16] [16]

2010 , editor =

Understanding the difficulty of training deep feedforward neural networks , author =. 2010 , editor =

work page 2010

[17] [17]

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , year=

He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian , booktitle=. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , year=

work page

[18] [18]

Proceedings of the 32nd International Conference on Machine Learning , pages =

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , author =. Proceedings of the 32nd International Conference on Machine Learning , pages =. 2015 , editor =

work page 2015

[19] [19]

2018 , isbn =

Wu, Yuxin and He, Kaiming , title =. 2018 , isbn =. doi:10.1007/978-3-030-01261-8_1 , booktitle =

work page doi:10.1007/978-3-030-01261-8_1 2018

[20] [20]

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Deep Residual Learning for Image Recognition , author=. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2016

[21] [21]

ArXiv , year=

Layer Normalization , author=. ArXiv , year=

work page

[22] [22]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Sinusoidal Initialization, Time for a New Start , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page

[23] [23]

Raissi, P

Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations , journal =. 2019 , issn =. doi:https://doi.org/10.1016/j.jcp.2018.10.045 , url =

work page doi:10.1016/j.jcp.2018.10.045 2019