Robust and Fast Training via Per-Sample Clipping

Davide Nobile; Philipp Grohs

arxiv: 2605.02701 · v2 · pith:HPZ2TYLUnew · submitted 2026-05-04 · 🧮 math.OC · cs.LG· stat.ML

Robust and Fast Training via Per-Sample Clipping

Davide Nobile , Philipp Grohs This is my paper

Pith reviewed 2026-05-08 17:46 UTC · model grok-4.3

classification 🧮 math.OC cs.LGstat.ML

keywords per-sample clippingstochastic gradient descentheavy-tailed noisenon-convex optimizationconvergence ratesrobust traininggradient clippingdeep neural networks

0 comments

The pith

Per-sample gradient clipping in SGD achieves optimal convergence rates for non-convex problems under heavy-tailed noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops per-sample clipped SGD, a stochastic gradient method that clips the gradient contribution of each individual training sample before averaging within a mini-batch. It establishes that this estimator delivers the fastest known convergence rates in expectation for non-convex optimization when gradients exhibit heavy tails, and supplies matching high-probability bounds that lose only polylog factors in the failure probability. These guarantees matter because heavy-tailed gradient noise appears routinely in deep-network training, where standard SGD or batch-level clipping can converge more slowly or less reliably. The authors also report empirical gains over momentum SGD and conventional clipping when training AlexNet on CIFAR-100, even after the extra per-sample cost, and observe that clipping during gradient accumulation improves performance at negligible extra expense.

Core claim

By replacing the usual averaged gradient with a per-sample clipped version, the resulting PS-Clip-SGD algorithm attains optimal in-expectation convergence rates for non-convex stochastic optimization under heavy-tailed gradient noise and yields high-probability convergence guarantees that match those rates up to polylogarithmic factors in the failure probability.

What carries the argument

Per-sample clipped gradient estimator, which clips each sample's gradient individually before aggregation to control the influence of heavy-tailed outliers.

If this is right

Optimal in-expectation convergence rates are obtained for non-convex problems under the stated heavy-tailed noise model.
High-probability convergence bounds hold that differ from the expectation bounds by only polylog factors in the failure probability.
Empirical training of AlexNet on CIFAR-100 improves over both momentum SGD and batch-level clipping, even after accounting for per-sample overhead.
Applying clipping during gradient accumulation steps improves performance at almost zero extra cost, contrary to the usual practice of clipping only after accumulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may allow larger batch sizes in practice without sacrificing stability when tails are heavy.
It suggests that the timing of clipping relative to accumulation steps deserves systematic study across optimizers.
If the per-sample clipping cost can be amortized, the approach could extend naturally to other first-order methods that suffer from outlier gradients.

Load-bearing premise

The noise in the observed gradients must follow a heavy-tailed distribution possessing finite moments of a prescribed order.

What would settle it

On a synthetic non-convex problem whose gradient noise is independently verified to be light-tailed, PS-Clip-SGD would fail to match the claimed optimal rates and would perform no better than unclipped SGD.

Figures

Figures reproduced from arXiv: 2605.02701 by Davide Nobile, Philipp Grohs.

**Figure 1.** Figure 1: Performance of Normalized-SGD, Clip-SGD and PS-Clip-SGD for different noise regimes view at source ↗

**Figure 2.** Figure 2: (1 − δ)-quantile of the average gradient norm after T = 100 training steps, plotted against log(1/δ) for the three algorithms and different noise regimes. As before, due to the choice of parameters, Normalized SGD and Clip-SGD are indistinguishable in the plot. 4.2 Training AlexNet with per-sample clipping view at source ↗

**Figure 3.** Figure 3: Training and validation accuracies of SGD, Clip-SGD and PS-Clip-SGD, all with momen view at source ↗

**Figure 4.** Figure 4: Performance of Normalized-SGD, Clip-SGD and PS-Clip-SGD for different noise regimes view at source ↗

**Figure 5.** Figure 5: (1 − δ)-quantile of the average gradient norm after T = 100 training steps, plotted against log(1/δ) for the three algorithms and different noise regimes. The experiment is performed using the tuned hyperparameters from view at source ↗

**Figure 6.** Figure 6: Blue line, left y-axis: Average per-sample gradient norm in each epoch: 1 nbatches Pnbatches t=1 1 batch_size Pbatch_size i=1 |∇f(xt, ξ(i) t )|. Green line, right y-axis: average number of clipped gradients in a batch for PS-Clip-SGD: 1 nbatches Pnbatches t=1 1 batch_size Pbatch_size i=1 n (t,i) clipped. Note: nbatches indicates the total number of batches in an epoch, while n (t,i) clipped indicates the n… view at source ↗

read the original abstract

We propose a robust gradient estimator based on per-sample gradient clipping and analyze its properties both theoretically and empirically. We show that the resulting method, per-sample clipped SGD (PS-Clip-SGD), achieves optimal in-expectation convergence rates for non-convex optimization problems under heavy-tailed gradient noise. Moreover, we establish high-probability convergence guarantees that match the in-expectation rates up to polylogarithmic factors in the failure probability. We complement our theoretical results with multiple numerical experiments. In particular, we demonstrate that PS-Clip-SGD outperforms both vanilla SGD with momentum and standard gradient clipping when training AlexNet on the CIFAR-100 dataset, even after accounting for the additional computational time caused by per-sample clipping. We also empirically show that, in the presence of gradient accumulation, applying clipping at the mini-batch level can improve training performance while incurring virtually no additional computational cost. This finding is particularly interesting, as it contradicts the common practice of applying clipping only after all accumulation steps have been completed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Per-sample clipping delivers optimal in-expectation rates plus near-matching high-probability bounds for heavy-tailed non-convex SGD, with experiments showing practical gains on CIFAR even after compute overhead.

read the letter

The core advance here is a per-sample clipped SGD estimator that gets optimal convergence rates in expectation for non-convex objectives under heavy-tailed noise, plus high-probability bounds that lose only polylog factors in the failure probability. That combination looks like the real novelty beyond standard clipping results. The paper also reports that the method beats both momentum SGD and batch-level clipping on AlexNet/CIFAR-100, and they measure the extra wall-clock cost from per-sample operations so the comparison is not free. A secondary empirical note—that clipping inside the accumulation loop can help without much added cost—pushes against the usual post-accumulation practice and might matter for large-batch or distributed runs. The theory is grounded in standard moment assumptions on the noise, which keeps the claims conditional but internally consistent. Experiments are narrow (one model, one dataset), so the speedups are suggestive rather than definitive, but the controls for compute time are a plus. No load-bearing gaps appear in the stated scope. This is useful reading for people working on robust non-convex optimization or heavy-tailed training pipelines. It is worth sending to referees because the rates are explicit, the estimator is simple to code, and the accumulation observation is easy to check.

Referee Report

1 major / 2 minor

Summary. The paper proposes per-sample clipped SGD (PS-Clip-SGD) as a robust gradient estimator. It claims that this method achieves optimal in-expectation convergence rates for non-convex optimization under heavy-tailed gradient noise, along with high-probability convergence guarantees that match the in-expectation rates up to polylogarithmic factors in the failure probability. These theoretical results are supported by experiments demonstrating that PS-Clip-SGD outperforms vanilla SGD with momentum and standard gradient clipping when training AlexNet on CIFAR-100 (accounting for extra compute), and that mini-batch-level clipping during gradient accumulation can improve performance at negligible cost, contrary to common practice.

Significance. If the stated convergence results hold, the work provides a theoretically grounded clipping strategy with optimal rates under heavy-tailed noise assumptions that are relevant to deep learning. The matching high-probability bounds and the empirical observation on accumulation-stage clipping are practical strengths. The paper supplies conditional optimality claims and reproducible-style experiments as supporting elements.

major comments (1)

[Theoretical analysis] The central optimality claim in the abstract rests on the gradient noise satisfying explicit heavy-tailed moment bounds. The manuscript should include a dedicated subsection (likely in the theoretical analysis) that states the precise moment conditions (e.g., which p-moments are finite) and shows how they yield the claimed optimal rate; without this explicit linkage the applicability to the reported AlexNet/CIFAR-100 runs remains conditional rather than verified.

minor comments (2)

[Abstract] The abstract states that multiple numerical experiments were performed, yet only the AlexNet/CIFAR-100 run is described in detail; a one-sentence summary of the other experiments would improve completeness.
[Experiments] In the experimental section, the comparison to baselines should explicitly state whether hyper-parameters for vanilla SGD and standard clipping were re-tuned on the same compute budget as PS-Clip-SGD; the current description leaves open the possibility that the reported gains partly reflect unequal tuning effort.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and the minor revision recommendation. We address the single major comment below and will update the manuscript accordingly.

read point-by-point responses

Referee: [Theoretical analysis] The central optimality claim in the abstract rests on the gradient noise satisfying explicit heavy-tailed moment bounds. The manuscript should include a dedicated subsection (likely in the theoretical analysis) that states the precise moment conditions (e.g., which p-moments are finite) and shows how they yield the claimed optimal rate; without this explicit linkage the applicability to the reported AlexNet/CIFAR-100 runs remains conditional rather than verified.

Authors: We agree that a dedicated subsection would improve clarity and make the optimality claims self-contained. In the revised manuscript we will insert a new subsection (tentatively titled 'Moment Assumptions and Derivation of Optimal Rates') immediately after the problem setup in the theoretical analysis section. This subsection will (i) state the precise assumption that the stochastic gradient noise satisfies E[||noise||^p] ≤ σ^p for some p ∈ (1,2] and all samples, (ii) recall the standard heavy-tailed convergence result that yields the optimal in-expectation rate O(T^{-(p-1)/(2p-1)}) (or the specific rate proved in our theorems), and (iii) explicitly connect these conditions to the high-probability bounds. We will also add a short paragraph discussing why the CIFAR-100 experiments are consistent with the assumed regime. These additions require only a few paragraphs and do not alter any proofs or experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes per-sample clipped SGD (PS-Clip-SGD) and claims optimal in-expectation and high-probability convergence rates for non-convex problems under heavy-tailed gradient noise with moment bounds. These rates are derived from standard external optimization theory (e.g., typical SGD analyses adapted to clipping and tail assumptions) rather than reducing to self-definitions, fitted parameters renamed as predictions, or load-bearing self-citations within the paper. The abstract and context indicate the results are conditional on the stated noise model, with experiments as separate empirical validation. No self-definitional steps, ansatzes smuggled via citation, or renaming of known results as new derivations are present. The central claims remain independent of the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard non-convex smoothness assumptions and a heavy-tailed noise model whose precise parameters are not enumerated in the abstract.

axioms (2)

domain assumption Gradient noise is heavy-tailed with bounded moments sufficient for the clipping analysis
Invoked to obtain optimal rates; if violated, rates no longer hold.
standard math The objective is L-smooth and bounded below
Standard assumption in non-convex SGD analysis.

pith-pipeline@v0.9.0 · 5470 in / 1275 out tokens · 43116 ms · 2026-05-08T17:46:07.678858+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost (J(x)=½(x+x⁻¹)−1) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

g(x_t,ξ_t)=min(1, γ_t/|∇f(x_t,ξ_t)|)∇f(x_t,ξ_t)... clipping factors γ_k^(t):=min(1, αk^{1/β}/|∇f(x_t,ξ_t^{(k)})|)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Is Variational Monte Carlo Robust? Sharp Moment Thresholds and Heavy-tailed Stochastic Optimization
cs.LG 2026-06 unverdicted novelty 7.0

VMC local energy and gradient estimators are generically heavy-tailed for common ansatze due to nodal sets, but a new clipped variant converges in the low-moment regime.