Stochastic Rounding Increases Small Singular Values
Pith reviewed 2026-06-28 21:00 UTC · model grok-4.3
The pith
Stochastic rounding increases clusters of small singular values even for matrices with constant aspect ratios.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Stochastic rounding increases the smallest singular value of extremely tall-and-thin matrices, but this regularization effect holds for matrices with constant aspect ratio and moreover lifts entire clusters of singular values at the tail of the spectrum rather than only the minimal value.
What carries the argument
Stochastic rounding applied to matrix entries, which perturbs the singular value spectrum by lifting the tail cluster.
If this is right
- The regularization benefit of stochastic rounding applies to the wider class of matrices with bounded aspect ratio that arise in most applications.
- Stochastic rounding modifies not only the extremal singular value but a whole tail segment of the spectrum.
- Low-precision computations can therefore be expected to gain stability across a larger set of problem dimensions.
- The spectral effect supplies a uniform explanation for observed regularization in both extreme and moderate aspect-ratio regimes.
Where Pith is reading between the lines
- The same rounding mechanism may improve conditioning in iterative solvers that operate on square or mildly rectangular matrices.
- Similar tail-lifting behavior could appear under other quantization schemes once the aspect-ratio restriction is removed.
- Numerical experiments on fixed-aspect-ratio test matrices at varying bit widths would directly test the cluster-lifting prediction.
- If the effect scales with matrix size, stochastic rounding might serve as a built-in preconditioner for constant-aspect problems without extra computation.
Load-bearing premise
The stochastic rounding error model together with the random matrix ensembles used permits an analytic proof of the lifting effect for constant aspect ratios.
What would settle it
Generate a random matrix with fixed aspect ratio such as 2-to-1, apply stochastic rounding to its entries at a chosen precision, recompute its singular values, and check whether the lower cluster rises relative to the unrounded matrix.
Figures
read the original abstract
Over the past half-dozen years, stochastic rounding (SR) has regained significant attention as a quantization scheme for low-precision floating-point arithmetic, with applications spanning numerical analysis and modern machine learning systems. Recent work has shown that SR acts as an implicit regularizer by increasing the smallest singular value of extremely tall-and-thin (or, symmetrically, short-and-fat) matrices. In this work, we substantially sharpen and extend this understanding in two directions. First, we show that the regularization effect of SR is not restricted to extreme aspect ratio regimes: it persists for matrices with constant aspect ratio. Second, we demonstrate that SR does not merely regularize the smallest singular value, but instead lifts entire clusters of singular values at the tail of the spectrum. Together, these results provide a more general characterization of stochastic rounding as a spectral regularizer, revealing that its effects extend beyond extremal aspect ratios and act on a broader portion of the singular value spectrum.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that stochastic rounding (SR) acts as an implicit spectral regularizer whose effects are not limited to extreme aspect-ratio regimes: the regularization persists for matrices with fixed constant aspect ratio m/n = c. It further claims that SR lifts entire clusters of small singular values at the tail of the spectrum rather than only the smallest singular value.
Significance. If the central claims hold, the work supplies a broader analytic characterization of SR as a regularizer, with direct relevance to stability questions in low-precision linear algebra and machine-learning training. The absence of free parameters or ad-hoc fitted quantities in the derivations is a strength; the extension beyond the extreme-aspect-ratio setting, if rigorously established, would be a genuine advance over prior results.
major comments (2)
- [proof of the constant-aspect-ratio result (likely §3 or Theorem 2)] The load-bearing claim that the regularization effect persists for constant aspect ratio (away from 0 and infinity) requires showing that the additive SR perturbation produces a definite positive shift on the lower edge of the Marchenko-Pastur (or non-Gaussian analogue) bulk density. The manuscript must explicitly verify that the analysis does not reduce to an additional asymptotic regime in which m/n → 0 or ∞; otherwise the constant-ratio statement does not follow.
- [statement and proof of the cluster-lifting result (likely §4)] The second claim—that SR lifts an entire cluster of tail singular values rather than only the minimal one—needs a quantitative statement of the cluster size or the spectral interval that is shifted. Without an explicit bound on the number or location of affected singular values, the distinction from the “smallest singular value only” result remains unclear.
minor comments (2)
- [§2] Notation for the stochastic rounding model (additive perturbation variance tied to machine epsilon) should be introduced once and used consistently; the current presentation mixes several equivalent but non-identical formulations.
- [Figures 2–4] Figure captions should state the matrix ensemble, aspect ratio, and precision explicitly so that the plotted singular-value histograms can be reproduced without consulting the main text.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive report. The two major comments identify places where additional explicitness will strengthen the presentation. We address each point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [proof of the constant-aspect-ratio result (likely §3 or Theorem 2)] The load-bearing claim that the regularization effect persists for constant aspect ratio (away from 0 and infinity) requires showing that the additive SR perturbation produces a definite positive shift on the lower edge of the Marchenko-Pastur (or non-Gaussian analogue) bulk density. The manuscript must explicitly verify that the analysis does not reduce to an additional asymptotic regime in which m/n → 0 or ∞; otherwise the constant-ratio statement does not follow.
Authors: Theorem 2 is stated and proved for a fixed aspect ratio γ = m/n ∈ (0,∞) bounded away from the extremes; the Marchenko-Pastur (or non-Gaussian) density is parameterized by this fixed γ, and the first-order perturbation calculation yields a strictly positive shift at the lower edge that depends continuously on γ but does not vanish for any such fixed γ. The derivation never invokes an auxiliary limit γ → 0 or γ → ∞. We will insert an explicit sentence in the theorem statement and the surrounding discussion confirming that γ remains fixed throughout the argument. revision: yes
-
Referee: [statement and proof of the cluster-lifting result (likely §4)] The second claim—that SR lifts an entire cluster of tail singular values rather than only the minimal one—needs a quantitative statement of the cluster size or the spectral interval that is shifted. Without an explicit bound on the number or location of affected singular values, the distinction from the “smallest singular value only” result remains unclear.
Authors: The theorem in §4 already supplies explicit constants ε and δ (depending on the SR variance parameter and the matrix dimensions) such that every singular value lying in [0,ε] is increased by at least δ. The number of affected singular values is therefore at most the multiplicity of the interval [0,ε], which is controlled by the same constants. We will revise the theorem statement to display these bounds prominently and add a short remark contrasting the result with the single-smallest-singular-value case treated in prior work. revision: yes
Circularity Check
No circularity; derivation chain self-contained against external benchmarks
full rationale
The provided abstract and context present the core claims as analytical extensions of prior regularization results to constant aspect ratios and tail clusters of singular values, without any exhibited equations, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the new statements to the inputs by construction. No self-definitional steps, ansatz smuggling, or uniqueness theorems imported from overlapping authors are visible in the text. The Marchenko-Pastur reference and stochastic rounding model are treated as external inputs whose interaction is analyzed rather than presupposed, satisfying the criteria for an independent derivation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Oxford Uni- versity Press, Oxford, 2013
doi: 10.1093/acprof:oso/ 9780199535255.001.0001. URL https://doi.org/10.1093/acprof:oso/9780199535255. 001.0001. Christos Boutsikas, Petros Drineas, and Ilse CF Ipsen. Small singular values can increase in lower precision.SIAM Journal on Matrix Analysis and Applications, 45(3):1518–1540,
-
[2]
Effective Quantization of Muon Optimizer States.arXiv preprint arXiv:2509.23106,
Aman Gupta, Rafael Celente, Abhishek Shivanna, DT Braithwaite, Gregory Dexter, Shao Tang, Hiroto Udagawa, Daniel Silva, Rohan Ramanath, and S Sathiya Keerthi. Effective Quantization of Muon Optimizer States.arXiv preprint arXiv:2509.23106,
-
[3]
A comprehensive evaluation of quantization strategies for large language models
Renren Jin, Jiangcun Du, Wuwei Huang, Wei Liu, Jian Luan, Bin Wang, and Deyi Xiong. A comprehensive evaluation of quantization strategies for large language models. InFindings of the association for computational linguistics: ACL 2024, pages 12186–12215,
2024
-
[4]
A Study of BFLOAT16 for Deep Learning Training
Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja V ooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, et al. A study of BFLOAT16 for deep learning training.arXiv preprint arXiv:1905.12322,
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[5]
LOTION: Smoothing the Optimization Landscape for Quantized Training
Mujin Kwun, Depen Morwani, Huangyuan Su, Stephanie Gil, Nikhil Anand, and Sham M Kakade. LOTION: Smoothing the Optimization Landscape for Quantized Training. InOPT 2025: Opti- mization for Machine Learning. Michael Mahoney and Charles Martin. Traditional and heavy tailed self regularization in neural network models. InInternational Conference on Machine L...
2025
-
[6]
Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenth- waite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, et al. FP8 formats for deep learning.arXiv preprint arXiv:2209.05433,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Recipes for Pre-training LLMs with MXFP8.arXiv preprint arXiv:2506.08027,
Asit Mishra, Dusan Stosic, Simon Layton, and Paulius Micikevicius. Recipes for Pre-training LLMs with MXFP8.arXiv preprint arXiv:2506.08027,
-
[8]
URL https: //arxiv.org/abs/2407.05230. Phuc Tran and Van Vu. New perturbation bounds for low rank approximation of matrices via contour analysis.arXiv preprint arXiv:2511.08875,
-
[9]
Eigenvalue stability and new perturbation bounds for the extremal eigenvalues of a matrix
Phuc Tran and Van Vu. Davis–Kahan theorem under a moderate gap condition.Communications in Contemporary Mathematics, 28(01):2550035, 2026a. 12 Phuc Tran and Van Vu. Eigenvalue stability and new perturbation bounds for the extremal eigenvalues of a matrix, 2026b. URLhttps://arxiv.org/abs/2603.19758. Phuc Tran and Van Vu. New matrix perturbation bounds with...
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Albert Tseng, Tao Yu, and Youngsuk Park. Training LLMs with MXFP4.arXiv preprint arXiv:2502.20586,
-
[11]
Stochastic Rounding 2.0
13 A Appendix A.1 Prior Work Stochastic rounding (SR) is a probabilistic approach to rounding that has proven effective in large- scale computations and low-precision arithmetic. Despite its illustrious beginnings in the 1950s [von Neumann and Goldstine, 1947, Green, 1950, Forsythe, 1959], SR has been largely overlooked by the numerical analysis community...
1947
-
[12]
Σ1 0 0 Σ 2 0 0 # ,E=U
themselves noted that n >10 50 is required just to drop this factor below1/2. Deterministic bounds for small singular value clusters.Motivated by the empirical observation that downcasting a matrix to lower arithmetic precision tends to lift its smallest singular values, Boutsikas et al. [2024, Theorem 3.5] obtained a deterministic lower bound for acluste...
2024
-
[13]
7Up to lower-order terms
ProofProperties (23) and (25) are established in Lemma C.1 of [Saha et al., 2023]; property (24) is immediate from the definition; property (26) is proven in Appendix A.3. 7Up to lower-order terms. 15 A.3 Proof of the Sub-Gaussian Property of the Quantization Error Here we prove property (26): the quantization error QR,B(x)−x is sub-gaussian with ∥QR,B(x)...
2023
-
[14]
Now y⊤Yx=y ⊤U⊤Ex=u ⊤Ex= dX j=1 xjξj. Applying the same moment-generating-function computation, now to the independent variables ξj withPd j=1 x2 j = 1, we obtain E exp(λy⊤Yx) = dY j=1 E[exp(λx jξj)]≤exp λ2ρ2 2 dX j=1 x2 j = exp λ2ρ2 2 . By Markov’s inequality, P y⊤Yx> t =P exp λy⊤Yx >exp (λt) ≤ E exp(λy⊤Yx) exp (λt) ≤exp λ2ρ2 2 −λt . (31) Optimizi...
2026
-
[15]
Hence E " dX i=1 σi( ˜A)2 # = tr(A⊤A) +E tr(E⊤E) = dX i=1 σi(A)2 +E ∥E∥2 F . A.8 Proof of Lemma 4.1 Proof Step 1: Top- k energy of A as a contour integral.Applied to G, the trace contour-integral identity (10) gives kX i=1 σi(A)2 = 1 2πi I Γ ztr (R G(z))dz.(35) Step 2:Γalso separates the top-keigenvalues of ˜G.By Weyl’s inequality, λi( ˜G)−λ i(G) ≤ ∥∆∥2 <...
1949
-
[16]
Then v⊤ i E⊤Evj =x ⊤Bijx,B ij := 1 2 (viv⊤ j )⊗I n + (vjv⊤ i )⊗I n
Quadratic term.Writex= vec(E)∈R nd. Then v⊤ i E⊤Evj =x ⊤Bijx,B ij := 1 2 (viv⊤ j )⊗I n + (vjv⊤ i )⊗I n . Indeed, by the cyclic property of trace: v⊤ i E⊤Evj = tr(E⊤Evjv⊤ i ).(63) Recall [Petersen and Pedersen, 2008, Eqs. (520)–(521)], vec(AXB) = (B⊤ ⊗A) vec(X),(64) tr(A⊤B) = vec(A)⊤ vec(B),(65) First, use (65) withA=E,B=Ev jv⊤ i : tr(E⊤Evjv⊤ i ) = vec(E)⊤...
2008
-
[17]
= 3(a 2 +b 2 +c 2). 22 Second, use (64) withA=I n,X=E,B=v jv⊤ i : vec(Evjv⊤ i ) = (viv⊤ j )⊗I n vec(E).(67) Substituting the above two results, we get: v⊤ i E⊤Evj = vec(E)⊤ (viv⊤ j )⊗I n vec(E).(68) Finally, since a quadratic form depends only on the symmetric part of its matrix: v⊤ i E⊤Evj = 1 2 vec(E)⊤ (viv⊤ j )⊗I n + (viv⊤ j )⊗I n ⊤ vec(E)(69) = 1 2 ve...
2026
-
[18]
24 Step 1: Approximation.Fix ε= 1/4
for a review. 24 Step 1: Approximation.Fix ε= 1/4 . By [Vershynin, 2026, Corollary 4.2.11], there exist ε-nets N ⊂S n−1 andD ⊂S d−1 with bounded cardinality: |N | ≤9 n,|D| ≤9 d. By Lemma A.1,∥E∥ 2 can be bounded using the nets as ∥E∥2 ≤2 max x∈N,y∈D |x⊤Ey|.(90) Step 2: Concentration.Fixx∈ Nandy∈ D. Then x⊤Ey= nX i=1 dX j=1 xiyjEij, which is a sum of indep...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.