SparseOpt: Addressing Normalization-induced Gradient Skew in Sparse Training

Ekansh Sharma; Mohammed Adnan; Rahul G. Krishnan; Rebekka Burkholz; Rohan Jain; Tom Jacobs; Yani Ioannou

arxiv: 2605.27541 · v1 · pith:NEKSPQJMnew · submitted 2026-05-26 · 💻 cs.LG

SparseOpt: Addressing Normalization-induced Gradient Skew in Sparse Training

Mohammed Adnan , Rohan Jain , Tom Jacobs , Ekansh Sharma , Rahul G. Krishnan , Rebekka Burkholz , Yani Ioannou This is my paper

Pith reviewed 2026-06-29 18:38 UTC · model grok-4.3

classification 💻 cs.LG

keywords dynamic sparse trainingbatch normalizationgradient skewsparse optimizerResNetCIFAR-100ImageNetconvergence

0 comments

The pith

Batch Normalization induces gradient skew that slows dynamic sparse training, which SparseOpt corrects for faster convergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates both analytically and empirically that Batch Normalization creates gradient skew in sparse networks during dynamic sparse training. It introduces SparseOpt as a sparsity-aware optimizer to counteract this effect. Experiments on ResNet models show faster convergence and better generalization on CIFAR-100 and ImageNet compared to standard methods. The study provides the first systematic look at how normalization interacts with sparse layers and changing topologies.

Core claim

Batch Normalization adversely affects sparse training, and SparseOpt, a sparsity-aware optimizer, addresses this to achieve consistently faster convergence and improved generalization on ResNet models across CIFAR-100 and ImageNet.

What carries the argument

SparseOpt, a sparsity-aware optimizer that corrects normalization-induced gradient skew in sparse layers during dynamic topology adaptation.

If this is right

Dynamic sparse training reaches target accuracy in fewer epochs when the optimizer accounts for normalization effects.
Sparse networks trained with the proposed method generalize better than those using standard optimizers.
Current normalization techniques have inherent limitations when paired with sparse connectivity and dynamic changes.
The interaction between Batch Normalization and sparse layers is a primary bottleneck limiting practical use of dynamic sparse training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend to other normalization methods such as Layer Normalization if similar skew patterns appear in sparse settings.
Testing on non-vision tasks or alternative sparse training algorithms could show whether the gradient correction generalizes beyond image classification.
If the correction works broadly, dynamic sparse training might close the performance gap with dense training on larger models without extra compute.

Load-bearing premise

The identified gradient skew from Batch Normalization is the dominant cause of slower convergence in dynamic sparse training.

What would settle it

If applying SparseOpt or removing Batch Normalization produces no measurable improvement in convergence speed or accuracy for the tested ResNet models on CIFAR-100 and ImageNet, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.27541 by Ekansh Sharma, Mohammed Adnan, Rahul G. Krishnan, Rebekka Burkholz, Rohan Jain, Tom Jacobs, Yani Ioannou.

**Figure 1.** Figure 1: Batch Normalization causes gradient skew in sparse layers. BN scales gradients based on the variance for neurons in a dense layer (a). However, in a sparse layer with masked weights (dashed lines) (b), this can have the effect of scaling each gradient component differently. This non-uniform scaling effectively skews, i.e. rotates and scales, the gradient for a sparse layer (c). In (b), neuron i = 1 with pr… view at source ↗

**Figure 2.** Figure 2: Theoretical vs. empirical effect of Batch Normalization on gradients. As observed theoretically in Section 3, the gradients of a sparse layer with BN can increase with sparsity, leading to instability during training. Here we show how our analytical scaling matches that of real gradients in Equation (11). This can then be written as: ∂L ∂x(b) i = 1 σi C (b) i , (9) where C (b) i is defined as: C (b) i := ∂… view at source ↗

**Figure 3.** Figure 3: Train accuracy (top-1) vs. epochs on ImageNet with RigL. As observed, our method significantly improves the training dynamics and convergence of RigL, especially for higher sparsities. 1 50 100 150 200 250 300 Training Epochs 0.3 0.4 0.5 0.6 0.7 Top-1 Test Accuracy Our Method Baseline (a) Sparsity = 0.90 1 50 100 150 200 250 300 Training Epochs 0.2 0.3 0.4 0.5 0.6 0.7 Top-1 Test Accuracy Our Method Baselin… view at source ↗

**Figure 4.** Figure 4: Test accuracy (top-1) vs. epochs on ImageNet with RigL. Our method converges faster and achieves higher accuracy with fewer training epochs, particularly at higher sparsity levels. With much longer training schedules, both methods converge to similar final accuracy. Detailed results can be found in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: RigL ITOP rate (Rm) vs. sparsity for ResNet50/ImageNet. The top row shows results when RigL uses the original gradients for mask exploration, while the bottom row uses the corrected gradients. Differences in ITOP indicate that BN through scaling of gradients influences which connections are regrown and thus affects mask exploration. 5.2. How does BN affect mask exploration? One key difference between DST a… view at source ↗

**Figure 6.** Figure 6: The 2D (β = 0) and 3D representations of the balance equation for a, γ and β initialized at balance for gradient flow. The a parameter can flip its sign in case of HAM while this is not possible for balanced GF. Experimental simulation We illustrate the consequences of the balance relationship in the presence of scaling. We consider a one neuron with multi-dimensional input and a mask. One neuron We train … view at source ↗

**Figure 7.** Figure 7: One neuron student teacher dynamics with η= 0.01, HAM is needed for a sign flip. Under small learning rate the convergence is similar independent of scaling [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: One neuron student teacher dynamics with large learning rate η= 0.1, HAM is needed for a sign flip. The scaling allows for faster convergence. are 200 i.i.d. samples from a Gaussian for each entry N(0,1). We train with constant learning rate η ∈[0.01,0.1] and train for 10000,1000 iterations, to ensure convergence. The BN parameters are initialized as γ0 = 1 and β0 = 0, this together with a 2 0 = 1 controls… view at source ↗

**Figure 9.** Figure 9: Multi neuron student with one dense and sparse neuron and learning rate η = 0.1. Again HAM is needed for the sign flip substantiating the balance equation. Rescaling can now lead to learning a different sparser representation with the dense (redundant) neuron turned off. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: RigL ITOP rate (Rm) vs. sparsity for ResNet20×{1}/CIFAR-100. The top row shows results when RigL uses the original gradients for mask exploration, while the bottom row uses the corrected gradients. Differences in ITOP indicate that BN through scaling of gradients influences which connections are regrown and thus affects mask exploration. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Test accuracy vs. total training epochs for ResNet20×{1}/CIFAR-100 with RigL. Our method consistently outperforms the baseline across all sparsity levels, achieving higher generalization accuracy and demonstrating improved rate of convergence. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Test accuracy vs. total training epochs for ResNet20×{1}/CIFAR-100 with SET. Our method consistently outperforms the baseline across all sparsity levels, achieving higher generalization accuracy and demonstrating improved rate of convergence. 90 180 270 Total Training Epochs 0.740 0.745 0.750 Test Accuracy Our Method Baseline (a) Sparsity = 0.90 90 180 270 Total Training Epochs 0.725 0.730 0.735 0.740 Tes… view at source ↗

**Figure 13.** Figure 13: Test accuracy (top-1) vs. total training epochs for ResNet50/ImageNet with RigL. Our method consistently outperforms the baseline across all sparsity levels, achieving higher generalization accuracy and demonstrating improved rate of convergence. 90 180 270 Total Training Epochs 0.740 0.745 0.750 Test Accuracy Our Method Baseline (a) Sparsity = 0.90 90 180 270 Total Training Epochs 0.720 0.725 0.730 0.735… view at source ↗

**Figure 14.** Figure 14: Test accuracy (top-1) vs. total training epochs for ResNet50/ImageNet with SET. Our method mostly outperforms the baseline across all sparsity levels, achieving higher generalization accuracy and demonstrating improved rate of convergence. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗

**Figure 15.** Figure 15: Train accuracy (top-1) vs. epochs on ImageNet with SET. As observed, our method significantly improves the training dynamics and convergence of SET, especially for higher sparsities. 1 50 100 150 200 250 300 Training Epochs 0.3 0.4 0.5 0.6 0.7 Top-1 Test Accuracy Our Method Baseline (a) Sparsity = 0.90 1 50 100 150 200 250 300 Training Epochs 0.3 0.4 0.5 0.6 0.7 Top-1 Test Accuracy Our Method Baseline (b)… view at source ↗

**Figure 16.** Figure 16: Test accuracy (top-1) vs. epochs on ImageNet with SET. Our method converges faster and achieves higher accuracy with fewer training epochs, particularly at higher sparsity levels. With much longer training schedules, both methods converge to similar final accuracy. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗

**Figure 17.** Figure 17: Test accuracy vs. total training epochs for ResNet20×{1}/CIFAR-100 with RigL. We evaluate the compatibility of our method w/ HAM optimization demonstrating improved rate of convergence across increasing sparsity levels. 100 200 300 500 Total Training Epochs 0.62 0.63 0.64 0.65 Test Accuracy Our Method w/ HAM Our Method w/o HAM (a) Sparsity = 0.90 100 200 300 500 Total Training Epochs 0.58 0.59 0.60 0.61 0… view at source ↗

**Figure 18.** Figure 18: Test accuracy vs. total training epochs for ResNet20×{1}/CIFAR-100 using SET. We evaluate the compatibility of our method w/ HAM optimization demonstrating improved rate of convergence across increasing sparsity levels. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗

**Figure 19.** Figure 19: Test accuracy vs. total training epochs for ResNet20×{1}/CIFAR-100 with RigL. We compare our proposed sparsity-aware gradient scaling method against the standard RigL baseline w/ gradient renormalization across increasing sparsity levels. 100 200 300 500 Total Training Epochs 0.54 0.56 0.58 0.60 0.62 Test Accuracy Our Method Baseline (a) Sparsity = 0.90 100 200 300 500 Total Training Epochs 0.46 0.48 0.50… view at source ↗

**Figure 20.** Figure 20: Test accuracy vs. total training epochs for ResNet20×{1}/CIFAR-100 with SET. We compare our proposed sparsity-aware gradient scaling method against the standard SET baseline w/ gradient renormalization across increasing sparsity levels. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_20.png] view at source ↗

**Figure 21.** Figure 21: Test accuracy (top-1) vs. total training epochs for ResNet50/ImageNet with RigL. We compare our proposed sparsity-aware gradient scaling method against the standard RigL baseline w/ gradient renormalization across increasing sparsity levels to only analyze the effect of gradient direction. As shown training with our method improves convergence rate, i.e. models achieve better generalization with less trai… view at source ↗

**Figure 22.** Figure 22: Test accuracy (top-1) vs. total training epochs for ResNet50/ImageNet with SET. We compare our proposed sparsity-aware gradient scaling method against the standard SET baseline w/ gradient renormalization across increasing sparsity levels to only analyze the effect of gradient direction. As shown training with our method improves convergence rate, i.e. models achieve better generalization with less traini… view at source ↗

**Figure 23.** Figure 23: Test accuracy (top-1) vs. epochs on ImageNet with RigL. Our method converges faster and achieves higher accuracy with fewer training epochs, particularly at higher sparsity levels. 1 50 100 150 200 250 300 Training Epochs 0.3 0.4 0.5 0.6 0.7 Top-1 Train Accuracy Our Method Baseline (a) Sparsity = 0.90 1 50 100 150 200 250 300 Training Epochs 0.2 0.3 0.4 0.5 0.6 Top-1 Train Accuracy Our Method Baseline (b)… view at source ↗

**Figure 24.** Figure 24: Train accuracy (top-1) vs. epochs on ImageNet with RigL. As observed, our method significantly improves the training dynamics and convergence of RigL, especially for higher sparsities. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_24.png] view at source ↗

**Figure 25.** Figure 25: Test accuracy vs. total training epochs for ResNet20×{1}/CIFAR-100 with RigL. We evaluate compatibility of our method w/ HAM optimization demonstrating improved rate of convergence across increasing sparsity levels. 100 200 300 500 Total Training Epochs 0.58 0.59 0.60 0.61 0.62 0.63 0.64 Test Accuracy Our Method w/ HAM Our Method w/o HAM (a) Sparsity = 0.90 100 200 300 500 Total Training Epochs 0.54 0.56 … view at source ↗

**Figure 26.** Figure 26: Test accuracy vs. total training epochs for ResNet20×{1}/CIFAR-100 with SET. We evaluate compatibility of our method w/ HAM optimization demonstrating improved rate of convergence across increasing sparsity levels. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_26.png] view at source ↗

read the original abstract

Dynamic Sparse Training (DST) methods train neural networks by maintaining sparsity while dynamically adapting the network topology. Despite the promise of reduced computation, DST methods converge significantly slower than dense training, often requiring comparable training time to achieve similar accuracy. We demonstrate both analytically and empirically that Batch Normalization (BN) adversely affects sparse training, and propose SparseOpt, a sparsity-aware optimizer, to address this. Experiments on ResNet models across CIFAR-100 and ImageNet demonstrate consistently faster convergence and improved generalization with our proposed method. Our work highlights the limitations of current normalization layers in sparse training and provides the first systematic study of the interaction between Batch Normalization, sparse layers, and DST, taking a significant step toward making DST practically competitive with dense training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SparseOpt targets BN gradient skew in DST with a new optimizer and claims faster convergence on ResNets, but the abstract gives no equations or controls so the core claim stays unverified.

read the letter

The paper's main move is to flag Batch Normalization as a source of gradient skew that slows dynamic sparse training, then introduce SparseOpt as a sparsity-aware optimizer to correct it. They position this as the first systematic look at BN-sparse layer-DST interactions and back it with experiments on ResNet models for CIFAR-100 and ImageNet that show quicker convergence and better generalization.

What stands out as new is the explicit focus on the BN-DST interaction plus the optimizer tweak itself. If the analytical demonstration actually isolates the skew mechanism and the empirical gains hold with proper baselines, this could help people trying to close the practicality gap in sparse training.

The experiments target standard architectures and datasets, which at least keeps the work grounded in real training setups rather than toy cases.

The soft spots are clear from the abstract alone. No equations appear, the analytical argument is not described, and there is no mention of error bars, statistical tests, or ablations that separate BN skew from other DST factors like topology updates or mask-induced variance. The stress-test concern lands: without those controls it is hard to know whether BN is the dominant bottleneck or whether SparseOpt's gains would transfer past ResNets on image classification. The claim that the method makes DST competitive therefore rests on evidence that cannot be inspected here.

This is for readers already working on dynamic sparse training or efficient optimizers. A serious referee should see it if the full paper supplies the missing derivations, ablations, and statistical details, because any optimizer that reliably narrows the DST-to-dense gap is worth checking even if revisions are needed.

Referee Report

2 major / 1 minor

Summary. The paper claims that Batch Normalization induces gradient skew that slows Dynamic Sparse Training (DST) convergence relative to dense training. It provides both an analytical demonstration of this effect and a sparsity-aware optimizer (SparseOpt) to correct it, with experiments on ResNet models showing faster convergence and better generalization on CIFAR-100 and ImageNet.

Significance. If the central claims hold, the work would be significant for identifying a previously under-studied interaction between normalization layers and sparsity, and for offering a concrete optimizer fix that could help close the convergence gap between DST and dense training. The positioning as the first systematic study of BN-sparse-DST interactions adds to its potential impact if the evidence is made inspectable.

major comments (2)

[Abstract] Abstract: the claim of an 'analytical demonstration' is unsupported because the abstract (and thus the central claim) contains no equations, no derivation outline, and no description of how the gradient skew is formally shown; without this the analytical part of the contribution cannot be evaluated.
[Abstract] Abstract / experimental claims: no baseline comparisons, error bars, or statistical details are mentioned, and no controls are described that isolate BN-induced skew from other DST factors such as topology updates or mask-induced variance; this leaves open whether the reported gains are attributable to the proposed correction or to other unstated differences.

minor comments (1)

[Abstract] The abstract does not name the specific ResNet variants, sparsity levels, or DST baselines used, which hinders immediate assessment of scope and reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. The abstract is intended as a concise overview, with full analytical derivations and experimental details provided in the body of the paper. We address each point below and will revise the abstract to better signal the location and nature of the supporting evidence.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of an 'analytical demonstration' is unsupported because the abstract (and thus the central claim) contains no equations, no derivation outline, and no description of how the gradient skew is formally shown; without this the analytical part of the contribution cannot be evaluated.

Authors: We agree the abstract does not contain equations or a derivation outline, as is conventional for abstracts. The analytical demonstration of BN-induced gradient skew appears in Section 3, including the formal derivation of how normalization layers produce skewed gradients under dynamic sparsity. We will revise the abstract to explicitly state that the analytical demonstration is detailed in Section 3, thereby making the claim traceable without lengthening the abstract excessively. revision: yes
Referee: [Abstract] Abstract / experimental claims: no baseline comparisons, error bars, or statistical details are mentioned, and no controls are described that isolate BN-induced skew from other DST factors such as topology updates or mask-induced variance; this leaves open whether the reported gains are attributable to the proposed correction or to other unstated differences.

Authors: Abstract length constraints preclude inclusion of error bars, full baseline tables, or control descriptions. The manuscript reports comparisons against standard DST optimizers (Section 4), multiple random seeds with error bars (Section 5), and ablation studies that isolate the BN-skew correction from topology updates and mask variance (Section 5.3). We will revise the abstract to note that experiments include standard baselines and BN-specific controls, directing readers to the relevant sections for details. revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivation self-contained

full rationale

The provided abstract and claims describe an analytical demonstration plus empirical results on ResNet models for CIFAR-100 and ImageNet, with no equations, fitted parameters, or self-citations shown that reduce any prediction to its own inputs by construction. The central premise (BN-induced gradient skew affecting DST) is presented as independently verified by analysis and experiments rather than defined circularly or imported via author self-citation chains. No load-bearing steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be extracted or audited.

pith-pipeline@v0.9.1-grok · 5668 in / 1003 out tokens · 30542 ms · 2026-06-29T18:38:45.876989+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 2 canonical work pages

[1]

IEEE Conference on, pp. 248–255. IEEE, 2009. URL https://ieeexplore.ieee.org/abst ract/document/5206848/. Duchi, J., Hazan, E., and Singer, Y . Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011. Evci, U., Elsen, E., Castro, P., and Gale, T. Rigging the lottery: Making all ticket...

work page doi:10.1109/cvpr.2016.90 2009
[2]

org/CorpusID:17043130

URL https://api.semanticscholar. org/CorpusID:17043130. Li, X.-L. Preconditioned stochastic gradient descent.IEEE Transactions on Neural Networks and Learning Systems, 29(5):1454–1466, May 2018. ISSN 2162-2388. doi: 10.1109/tnnls.2017.2672978. URL http://dx.doi .org/10.1109/TNNLS.2017.2672978. Liu, S., Yin, L., Mocanu, D. C., and Pechenizkiy, M. Do we act...

work page doi:10.1109/tnnls.2017.2672978 2018
[3]

This is summarized by the following observation

Clearly, if we now apply a mask towand rescale this does not affect the invariant. This is summarized by the following observation. ObservationA.2.The scaling and mask only change the gradient flow of wt. Therefore, it does not affect the invariance in Lemma A.1. HAM gradient flowConsider now the HAM gradient flow which is a Riemannian gradient flow with ...

2026
[4]

In Figures 6a and 6b we illustrate the invariance for a balanced initialization

making it possible to recover the ground truth (Gadhikar et al., 2025). In Figures 6a and 6b we illustrate the invariance for a balanced initialization. Note that invariant for gradient flow becomes singular ata=0, while HAM’s invariant does not. RemarkA.5.These balance equations can be easily extended to the multi-neuron case. We can see this from the ch...

2025
[5]

It is calculated relative to a base batch size of256: ηpeak =η base × B 256 , whereη base =0.1is the base learning rate provided in the arguments

Peak Learning Rate Scaling.The peak learning rate ( ηpeak) is dynamically scaled based on the global batch size (B) to ensure consistent convergence across different hardware configurations. It is calculated relative to a base batch size of256: ηpeak =η base × B 256 , whereη base =0.1is the base learning rate provided in the arguments
[6]

Warmup Phase.Training begins with a linear warmup phase lasting for 5 epochs( Twarmup). During this period, the learning rate increases linearly from a small initial value (ηinit) to the peak learning rate: ηt =Linear(η init,ηpeak,t)for0≤t < T warmup, 23 SparseOpt: Addressing Normalization-induced Gradient Skew in Sparse Training whereη init =1×10 −5
[7]

Cosine Decay Phase.After the warmup phase, the learning rate follows a standard cosine decay schedule for the remainder of the training duration (Ttotal −T warmup). The learning rate decays fromηpeak down to a final minimum value (ηend) by the last epoch: ηt =η end+ 1 2(ηpeak −ηend) 1+cos t−T warmup Ttotal −Twarmup π , whereη end =1×10 −5 andT total is th...
[8]

Warmup Phase.Training initiates with a linear warmup phase for the first5 epochs( Twarmup). The learning rate increases linearly from0to the base learning rate (η base): ηt =η base × t Twarmup for0≤t < T warmup, whereη base is the learning rate provided in the arguments (typically 0.1)
[9]

gradient clipping

Cosine Decay Phase.Following the warmup, the learning rate follows a standard cosine annealing schedule for the remaining epochs (Ttotal −Twarmup). The learning rate decays fromηbase to a final minimum value (ηend): ηt =η end+ 1 2(ηbase −ηend) 1+cos t−T warmup Ttotal −Twarmup π , whereη end =1×10 −6 andT total is the total number of training epochs (e.g.,...

[1] [1]

IEEE Conference on, pp. 248–255. IEEE, 2009. URL https://ieeexplore.ieee.org/abst ract/document/5206848/. Duchi, J., Hazan, E., and Singer, Y . Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011. Evci, U., Elsen, E., Castro, P., and Gale, T. Rigging the lottery: Making all ticket...

work page doi:10.1109/cvpr.2016.90 2009

[2] [2]

org/CorpusID:17043130

URL https://api.semanticscholar. org/CorpusID:17043130. Li, X.-L. Preconditioned stochastic gradient descent.IEEE Transactions on Neural Networks and Learning Systems, 29(5):1454–1466, May 2018. ISSN 2162-2388. doi: 10.1109/tnnls.2017.2672978. URL http://dx.doi .org/10.1109/TNNLS.2017.2672978. Liu, S., Yin, L., Mocanu, D. C., and Pechenizkiy, M. Do we act...

work page doi:10.1109/tnnls.2017.2672978 2018

[3] [3]

This is summarized by the following observation

Clearly, if we now apply a mask towand rescale this does not affect the invariant. This is summarized by the following observation. ObservationA.2.The scaling and mask only change the gradient flow of wt. Therefore, it does not affect the invariance in Lemma A.1. HAM gradient flowConsider now the HAM gradient flow which is a Riemannian gradient flow with ...

2026

[4] [4]

In Figures 6a and 6b we illustrate the invariance for a balanced initialization

making it possible to recover the ground truth (Gadhikar et al., 2025). In Figures 6a and 6b we illustrate the invariance for a balanced initialization. Note that invariant for gradient flow becomes singular ata=0, while HAM’s invariant does not. RemarkA.5.These balance equations can be easily extended to the multi-neuron case. We can see this from the ch...

2025

[5] [5]

It is calculated relative to a base batch size of256: ηpeak =η base × B 256 , whereη base =0.1is the base learning rate provided in the arguments

Peak Learning Rate Scaling.The peak learning rate ( ηpeak) is dynamically scaled based on the global batch size (B) to ensure consistent convergence across different hardware configurations. It is calculated relative to a base batch size of256: ηpeak =η base × B 256 , whereη base =0.1is the base learning rate provided in the arguments

[6] [6]

Warmup Phase.Training begins with a linear warmup phase lasting for 5 epochs( Twarmup). During this period, the learning rate increases linearly from a small initial value (ηinit) to the peak learning rate: ηt =Linear(η init,ηpeak,t)for0≤t < T warmup, 23 SparseOpt: Addressing Normalization-induced Gradient Skew in Sparse Training whereη init =1×10 −5

[7] [7]

Cosine Decay Phase.After the warmup phase, the learning rate follows a standard cosine decay schedule for the remainder of the training duration (Ttotal −T warmup). The learning rate decays fromηpeak down to a final minimum value (ηend) by the last epoch: ηt =η end+ 1 2(ηpeak −ηend) 1+cos t−T warmup Ttotal −Twarmup π , whereη end =1×10 −5 andT total is th...

[8] [8]

Warmup Phase.Training initiates with a linear warmup phase for the first5 epochs( Twarmup). The learning rate increases linearly from0to the base learning rate (η base): ηt =η base × t Twarmup for0≤t < T warmup, whereη base is the learning rate provided in the arguments (typically 0.1)

[9] [9]

gradient clipping

Cosine Decay Phase.Following the warmup, the learning rate follows a standard cosine annealing schedule for the remaining epochs (Ttotal −Twarmup). The learning rate decays fromηbase to a final minimum value (ηend): ηt =η end+ 1 2(ηbase −ηend) 1+cos t−T warmup Ttotal −Twarmup π , whereη end =1×10 −6 andT total is the total number of training epochs (e.g.,...