Stochastic Rounding Increases Small Singular Values

Linkai Ma; Petros Drineas; Tingzhou Yu

arxiv: 2606.00312 · v1 · pith:ABKY7L4Bnew · submitted 2026-05-29 · 🧮 math.NA · cs.LG· cs.NA

Stochastic Rounding Increases Small Singular Values

Linkai Ma , Tingzhou Yu , Petros Drineas This is my paper

Pith reviewed 2026-06-28 21:00 UTC · model grok-4.3

classification 🧮 math.NA cs.LGcs.NA

keywords stochastic roundingsingular valuesspectral regularizationlow-precision arithmeticmatrix aspect ratiorandom matricesnumerical stability

0 comments

The pith

Stochastic rounding increases clusters of small singular values even for matrices with constant aspect ratios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that stochastic rounding acts as a spectral regularizer whose effects are not limited to extremely tall-and-thin or short-and-fat matrices. Instead the regularization persists when the matrix dimensions stay in fixed proportion. It further shows that the effect raises entire groups of small singular values near the bottom of the spectrum rather than acting only on the single smallest one. A reader would care because this broadens the settings in which low-precision arithmetic can be expected to improve numerical stability without requiring specially shaped inputs.

Core claim

Stochastic rounding increases the smallest singular value of extremely tall-and-thin matrices, but this regularization effect holds for matrices with constant aspect ratio and moreover lifts entire clusters of singular values at the tail of the spectrum rather than only the minimal value.

What carries the argument

Stochastic rounding applied to matrix entries, which perturbs the singular value spectrum by lifting the tail cluster.

If this is right

The regularization benefit of stochastic rounding applies to the wider class of matrices with bounded aspect ratio that arise in most applications.
Stochastic rounding modifies not only the extremal singular value but a whole tail segment of the spectrum.
Low-precision computations can therefore be expected to gain stability across a larger set of problem dimensions.
The spectral effect supplies a uniform explanation for observed regularization in both extreme and moderate aspect-ratio regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rounding mechanism may improve conditioning in iterative solvers that operate on square or mildly rectangular matrices.
Similar tail-lifting behavior could appear under other quantization schemes once the aspect-ratio restriction is removed.
Numerical experiments on fixed-aspect-ratio test matrices at varying bit widths would directly test the cluster-lifting prediction.
If the effect scales with matrix size, stochastic rounding might serve as a built-in preconditioner for constant-aspect problems without extra computation.

Load-bearing premise

The stochastic rounding error model together with the random matrix ensembles used permits an analytic proof of the lifting effect for constant aspect ratios.

What would settle it

Generate a random matrix with fixed aspect ratio such as 2-to-1, apply stochastic rounding to its entries at a chosen precision, recompute its singular values, and check whether the lower cluster rises relative to the unrounded matrix.

Figures

Figures reproduced from arXiv: 2606.00312 by Linkai Ma, Petros Drineas, Tingzhou Yu.

**Figure 1.** Figure 1: An illustration of the contour Γ. Lemma 4.1 (Top-k energy expansion) Suppose ∥∆∥2 < g/2. Then E "X k i=1 σi(A˜ ) 2 # − X k i=1 σi(A) 2 = X∞ ℓ=1 Iℓ, Iℓ := 1 2πi I Γ z E h tr (RG(z)∆) ℓ RG(z) i dz. (17) The next subsection integrates and bounds the leading terms I1, I2. For the higher-order tail P ℓ≥3 Iℓ, we provide an upper bound via a contour-integral argument. We collect five lemmas: a closed-form comp… view at source ↗

read the original abstract

Over the past half-dozen years, stochastic rounding (SR) has regained significant attention as a quantization scheme for low-precision floating-point arithmetic, with applications spanning numerical analysis and modern machine learning systems. Recent work has shown that SR acts as an implicit regularizer by increasing the smallest singular value of extremely tall-and-thin (or, symmetrically, short-and-fat) matrices. In this work, we substantially sharpen and extend this understanding in two directions. First, we show that the regularization effect of SR is not restricted to extreme aspect ratio regimes: it persists for matrices with constant aspect ratio. Second, we demonstrate that SR does not merely regularize the smallest singular value, but instead lifts entire clusters of singular values at the tail of the spectrum. Together, these results provide a more general characterization of stochastic rounding as a spectral regularizer, revealing that its effects extend beyond extremal aspect ratios and act on a broader portion of the singular value spectrum.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims SR lifts clusters of small singular values even at constant aspect ratio, but that extension looks like it may not be fully supported without checking the bulk-edge analysis.

read the letter

Colleague,

The two extensions are the main thing here: stochastic rounding regularizes clusters of small singular values rather than just the smallest one, and the effect holds for matrices with fixed aspect ratio instead of only the extreme tall-thin or short-fat cases from earlier papers.

What the work does is take the prior observation about SR as an implicit regularizer and try to make it apply more broadly. The abstract frames this as a more general spectral view, which is a reasonable sharpening if the math goes through. It also ties the idea to low-precision arithmetic in both numerics and ML, which is a natural audience.

The soft spot is the constant-aspect-ratio claim. When m/n is fixed away from zero or infinity, the singular values have a bulk density from the Marchenko-Pastur law or its analogue. Lifting an entire tail cluster requires showing that the SR perturbation (additive noise with variance set by machine epsilon) produces a definite upward shift on the lower edge of that density. If the analysis only controls the extreme singular value or only works after taking further limits, the fixed-ratio statement does not follow. The abstract states the results but gives no derivations, so it is not possible to tell whether the model and ensemble actually deliver the claimed lift or whether an extra asymptotic assumption is doing the work.

This is for people who study quantization effects on matrix spectra or who use low-precision arithmetic in linear algebra routines. A reader who wants to see how implicit regularization generalizes beyond the extreme-ratio regime would get something from it, provided the proofs close the gap on the bulk edge.

It deserves peer review. The claims are specific enough that referees can check the derivations directly, and the topic is relevant enough that the time is worth spending even if revisions are needed.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that stochastic rounding (SR) acts as an implicit spectral regularizer whose effects are not limited to extreme aspect-ratio regimes: the regularization persists for matrices with fixed constant aspect ratio m/n = c. It further claims that SR lifts entire clusters of small singular values at the tail of the spectrum rather than only the smallest singular value.

Significance. If the central claims hold, the work supplies a broader analytic characterization of SR as a regularizer, with direct relevance to stability questions in low-precision linear algebra and machine-learning training. The absence of free parameters or ad-hoc fitted quantities in the derivations is a strength; the extension beyond the extreme-aspect-ratio setting, if rigorously established, would be a genuine advance over prior results.

major comments (2)

[proof of the constant-aspect-ratio result (likely §3 or Theorem 2)] The load-bearing claim that the regularization effect persists for constant aspect ratio (away from 0 and infinity) requires showing that the additive SR perturbation produces a definite positive shift on the lower edge of the Marchenko-Pastur (or non-Gaussian analogue) bulk density. The manuscript must explicitly verify that the analysis does not reduce to an additional asymptotic regime in which m/n → 0 or ∞; otherwise the constant-ratio statement does not follow.
[statement and proof of the cluster-lifting result (likely §4)] The second claim—that SR lifts an entire cluster of tail singular values rather than only the minimal one—needs a quantitative statement of the cluster size or the spectral interval that is shifted. Without an explicit bound on the number or location of affected singular values, the distinction from the “smallest singular value only” result remains unclear.

minor comments (2)

[§2] Notation for the stochastic rounding model (additive perturbation variance tied to machine epsilon) should be introduced once and used consistently; the current presentation mixes several equivalent but non-identical formulations.
[Figures 2–4] Figure captions should state the matrix ensemble, aspect ratio, and precision explicitly so that the plotted singular-value histograms can be reproduced without consulting the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive report. The two major comments identify places where additional explicitness will strengthen the presentation. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [proof of the constant-aspect-ratio result (likely §3 or Theorem 2)] The load-bearing claim that the regularization effect persists for constant aspect ratio (away from 0 and infinity) requires showing that the additive SR perturbation produces a definite positive shift on the lower edge of the Marchenko-Pastur (or non-Gaussian analogue) bulk density. The manuscript must explicitly verify that the analysis does not reduce to an additional asymptotic regime in which m/n → 0 or ∞; otherwise the constant-ratio statement does not follow.

Authors: Theorem 2 is stated and proved for a fixed aspect ratio γ = m/n ∈ (0,∞) bounded away from the extremes; the Marchenko-Pastur (or non-Gaussian) density is parameterized by this fixed γ, and the first-order perturbation calculation yields a strictly positive shift at the lower edge that depends continuously on γ but does not vanish for any such fixed γ. The derivation never invokes an auxiliary limit γ → 0 or γ → ∞. We will insert an explicit sentence in the theorem statement and the surrounding discussion confirming that γ remains fixed throughout the argument. revision: yes
Referee: [statement and proof of the cluster-lifting result (likely §4)] The second claim—that SR lifts an entire cluster of tail singular values rather than only the minimal one—needs a quantitative statement of the cluster size or the spectral interval that is shifted. Without an explicit bound on the number or location of affected singular values, the distinction from the “smallest singular value only” result remains unclear.

Authors: The theorem in §4 already supplies explicit constants ε and δ (depending on the SR variance parameter and the matrix dimensions) such that every singular value lying in [0,ε] is increased by at least δ. The number of affected singular values is therefore at most the multiplicity of the interval [0,ε], which is controlled by the same constants. We will revise the theorem statement to display these bounds prominently and add a short remark contrasting the result with the single-smallest-singular-value case treated in prior work. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation chain self-contained against external benchmarks

full rationale

The provided abstract and context present the core claims as analytical extensions of prior regularization results to constant aspect ratios and tail clusters of singular values, without any exhibited equations, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the new statements to the inputs by construction. No self-definitional steps, ansatz smuggling, or uniqueness theorems imported from overlapping authors are visible in the text. The Marchenko-Pastur reference and stochastic rounding model are treated as external inputs whose interaction is analyzed rather than presupposed, satisfying the criteria for an independent derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract only; no information on free parameters, axioms, or invented entities is available.

pith-pipeline@v0.9.1-grok · 5690 in / 983 out tokens · 23448 ms · 2026-06-28T21:00:09.202216+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 8 canonical work pages · 3 internal anchors

[1]

Oxford Uni- versity Press, Oxford, 2013

doi: 10.1093/acprof:oso/ 9780199535255.001.0001. URL https://doi.org/10.1093/acprof:oso/9780199535255. 001.0001. Christos Boutsikas, Petros Drineas, and Ilse CF Ipsen. Small singular values can increase in lower precision.SIAM Journal on Matrix Analysis and Applications, 45(3):1518–1540,

work page doi:10.1093/acprof:oso/
[2]

Effective Quantization of Muon Optimizer States.arXiv preprint arXiv:2509.23106,

Aman Gupta, Rafael Celente, Abhishek Shivanna, DT Braithwaite, Gregory Dexter, Shao Tang, Hiroto Udagawa, Daniel Silva, Rohan Ramanath, and S Sathiya Keerthi. Effective Quantization of Muon Optimizer States.arXiv preprint arXiv:2509.23106,

work page arXiv
[3]

A comprehensive evaluation of quantization strategies for large language models

Renren Jin, Jiangcun Du, Wuwei Huang, Wei Liu, Jian Luan, Bin Wang, and Deyi Xiong. A comprehensive evaluation of quantization strategies for large language models. InFindings of the association for computational linguistics: ACL 2024, pages 12186–12215,

2024
[4]

A Study of BFLOAT16 for Deep Learning Training

Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja V ooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, et al. A study of BFLOAT16 for deep learning training.arXiv preprint arXiv:1905.12322,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[5]

LOTION: Smoothing the Optimization Landscape for Quantized Training

Mujin Kwun, Depen Morwani, Huangyuan Su, Stephanie Gil, Nikhil Anand, and Sham M Kakade. LOTION: Smoothing the Optimization Landscape for Quantized Training. InOPT 2025: Opti- mization for Machine Learning. Michael Mahoney and Charles Martin. Traditional and heavy tailed self regularization in neural network models. InInternational Conference on Machine L...

2025
[6]

FP8 Formats for Deep Learning

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenth- waite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, et al. FP8 formats for deep learning.arXiv preprint arXiv:2209.05433,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Recipes for Pre-training LLMs with MXFP8.arXiv preprint arXiv:2506.08027,

Asit Mishra, Dusan Stosic, Simon Layton, and Paulius Micikevicius. Recipes for Pre-training LLMs with MXFP8.arXiv preprint arXiv:2506.08027,

work page arXiv
[8]

Phuc Tran and Van Vu

URL https: //arxiv.org/abs/2407.05230. Phuc Tran and Van Vu. New perturbation bounds for low rank approximation of matrices via contour analysis.arXiv preprint arXiv:2511.08875,

work page arXiv
[9]

Eigenvalue stability and new perturbation bounds for the extremal eigenvalues of a matrix

Phuc Tran and Van Vu. Davis–Kahan theorem under a moderate gap condition.Communications in Contemporary Mathematics, 28(01):2550035, 2026a. 12 Phuc Tran and Van Vu. Eigenvalue stability and new perturbation bounds for the extremal eigenvalues of a matrix, 2026b. URLhttps://arxiv.org/abs/2603.19758. Phuc Tran and Van Vu. New matrix perturbation bounds with...

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Alexandre B

Albert Tseng, Tao Yu, and Youngsuk Park. Training LLMs with MXFP4.arXiv preprint arXiv:2502.20586,

work page arXiv
[11]

Stochastic Rounding 2.0

13 A Appendix A.1 Prior Work Stochastic rounding (SR) is a probabilistic approach to rounding that has proven effective in large- scale computations and low-precision arithmetic. Despite its illustrious beginnings in the 1950s [von Neumann and Goldstine, 1947, Green, 1950, Forsythe, 1959], SR has been largely overlooked by the numerical analysis community...

1947
[12]

Σ1 0 0 Σ 2 0 0 # ,E=U

themselves noted that n >10 50 is required just to drop this factor below1/2. Deterministic bounds for small singular value clusters.Motivated by the empirical observation that downcasting a matrix to lower arithmetic precision tends to lift its smallest singular values, Boutsikas et al. [2024, Theorem 3.5] obtained a deterministic lower bound for acluste...

2024
[13]

7Up to lower-order terms

ProofProperties (23) and (25) are established in Lemma C.1 of [Saha et al., 2023]; property (24) is immediate from the definition; property (26) is proven in Appendix A.3. 7Up to lower-order terms. 15 A.3 Proof of the Sub-Gaussian Property of the Quantization Error Here we prove property (26): the quantization error QR,B(x)−x is sub-gaussian with ∥QR,B(x)...

2023
[14]

Now y⊤Yx=y ⊤U⊤Ex=u ⊤Ex= dX j=1 xjξj. Applying the same moment-generating-function computation, now to the independent variables ξj withPd j=1 x2 j = 1, we obtain E exp(λy⊤Yx) = dY j=1 E[exp(λx jξj)]≤exp   λ2ρ2 2 dX j=1 x2 j   = exp λ2ρ2 2 . By Markov’s inequality, P y⊤Yx> t =P exp λy⊤Yx >exp (λt) ≤ E exp(λy⊤Yx) exp (λt) ≤exp λ2ρ2 2 −λt . (31) Optimizi...

2026
[15]

Hence E " dX i=1 σi( ˜A)2 # = tr(A⊤A) +E tr(E⊤E) = dX i=1 σi(A)2 +E ∥E∥2 F . A.8 Proof of Lemma 4.1 Proof Step 1: Top- k energy of A as a contour integral.Applied to G, the trace contour-integral identity (10) gives kX i=1 σi(A)2 = 1 2πi I Γ ztr (R G(z))dz.(35) Step 2:Γalso separates the top-keigenvalues of ˜G.By Weyl’s inequality, λi( ˜G)−λ i(G) ≤ ∥∆∥2 <...

1949
[16]

Then v⊤ i E⊤Evj =x ⊤Bijx,B ij := 1 2 (viv⊤ j )⊗I n + (vjv⊤ i )⊗I n

Quadratic term.Writex= vec(E)∈R nd. Then v⊤ i E⊤Evj =x ⊤Bijx,B ij := 1 2 (viv⊤ j )⊗I n + (vjv⊤ i )⊗I n . Indeed, by the cyclic property of trace: v⊤ i E⊤Evj = tr(E⊤Evjv⊤ i ).(63) Recall [Petersen and Pedersen, 2008, Eqs. (520)–(521)], vec(AXB) = (B⊤ ⊗A) vec(X),(64) tr(A⊤B) = vec(A)⊤ vec(B),(65) First, use (65) withA=E,B=Ev jv⊤ i : tr(E⊤Evjv⊤ i ) = vec(E)⊤...

2008
[17]

= 3(a 2 +b 2 +c 2). 22 Second, use (64) withA=I n,X=E,B=v jv⊤ i : vec(Evjv⊤ i ) = (viv⊤ j )⊗I n vec(E).(67) Substituting the above two results, we get: v⊤ i E⊤Evj = vec(E)⊤ (viv⊤ j )⊗I n vec(E).(68) Finally, since a quadratic form depends only on the symmetric part of its matrix: v⊤ i E⊤Evj = 1 2 vec(E)⊤ (viv⊤ j )⊗I n + (viv⊤ j )⊗I n ⊤ vec(E)(69) = 1 2 ve...

2026
[18]

24 Step 1: Approximation.Fix ε= 1/4

for a review. 24 Step 1: Approximation.Fix ε= 1/4 . By [Vershynin, 2026, Corollary 4.2.11], there exist ε-nets N ⊂S n−1 andD ⊂S d−1 with bounded cardinality: |N | ≤9 n,|D| ≤9 d. By Lemma A.1,∥E∥ 2 can be bounded using the nets as ∥E∥2 ≤2 max x∈N,y∈D |x⊤Ey|.(90) Step 2: Concentration.Fixx∈ Nandy∈ D. Then x⊤Ey= nX i=1 dX j=1 xiyjEij, which is a sum of indep...

2026

[1] [1]

Oxford Uni- versity Press, Oxford, 2013

doi: 10.1093/acprof:oso/ 9780199535255.001.0001. URL https://doi.org/10.1093/acprof:oso/9780199535255. 001.0001. Christos Boutsikas, Petros Drineas, and Ilse CF Ipsen. Small singular values can increase in lower precision.SIAM Journal on Matrix Analysis and Applications, 45(3):1518–1540,

work page doi:10.1093/acprof:oso/

[2] [2]

Effective Quantization of Muon Optimizer States.arXiv preprint arXiv:2509.23106,

Aman Gupta, Rafael Celente, Abhishek Shivanna, DT Braithwaite, Gregory Dexter, Shao Tang, Hiroto Udagawa, Daniel Silva, Rohan Ramanath, and S Sathiya Keerthi. Effective Quantization of Muon Optimizer States.arXiv preprint arXiv:2509.23106,

work page arXiv

[3] [3]

A comprehensive evaluation of quantization strategies for large language models

Renren Jin, Jiangcun Du, Wuwei Huang, Wei Liu, Jian Luan, Bin Wang, and Deyi Xiong. A comprehensive evaluation of quantization strategies for large language models. InFindings of the association for computational linguistics: ACL 2024, pages 12186–12215,

2024

[4] [4]

A Study of BFLOAT16 for Deep Learning Training

Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja V ooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, et al. A study of BFLOAT16 for deep learning training.arXiv preprint arXiv:1905.12322,

work page internal anchor Pith review Pith/arXiv arXiv 1905

[5] [5]

LOTION: Smoothing the Optimization Landscape for Quantized Training

Mujin Kwun, Depen Morwani, Huangyuan Su, Stephanie Gil, Nikhil Anand, and Sham M Kakade. LOTION: Smoothing the Optimization Landscape for Quantized Training. InOPT 2025: Opti- mization for Machine Learning. Michael Mahoney and Charles Martin. Traditional and heavy tailed self regularization in neural network models. InInternational Conference on Machine L...

2025

[6] [6]

FP8 Formats for Deep Learning

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenth- waite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, et al. FP8 formats for deep learning.arXiv preprint arXiv:2209.05433,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Recipes for Pre-training LLMs with MXFP8.arXiv preprint arXiv:2506.08027,

Asit Mishra, Dusan Stosic, Simon Layton, and Paulius Micikevicius. Recipes for Pre-training LLMs with MXFP8.arXiv preprint arXiv:2506.08027,

work page arXiv

[8] [8]

Phuc Tran and Van Vu

URL https: //arxiv.org/abs/2407.05230. Phuc Tran and Van Vu. New perturbation bounds for low rank approximation of matrices via contour analysis.arXiv preprint arXiv:2511.08875,

work page arXiv

[9] [9]

Eigenvalue stability and new perturbation bounds for the extremal eigenvalues of a matrix

Phuc Tran and Van Vu. Davis–Kahan theorem under a moderate gap condition.Communications in Contemporary Mathematics, 28(01):2550035, 2026a. 12 Phuc Tran and Van Vu. Eigenvalue stability and new perturbation bounds for the extremal eigenvalues of a matrix, 2026b. URLhttps://arxiv.org/abs/2603.19758. Phuc Tran and Van Vu. New matrix perturbation bounds with...

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Alexandre B

Albert Tseng, Tao Yu, and Youngsuk Park. Training LLMs with MXFP4.arXiv preprint arXiv:2502.20586,

work page arXiv

[11] [11]

Stochastic Rounding 2.0

13 A Appendix A.1 Prior Work Stochastic rounding (SR) is a probabilistic approach to rounding that has proven effective in large- scale computations and low-precision arithmetic. Despite its illustrious beginnings in the 1950s [von Neumann and Goldstine, 1947, Green, 1950, Forsythe, 1959], SR has been largely overlooked by the numerical analysis community...

1947

[12] [12]

Σ1 0 0 Σ 2 0 0 # ,E=U

themselves noted that n >10 50 is required just to drop this factor below1/2. Deterministic bounds for small singular value clusters.Motivated by the empirical observation that downcasting a matrix to lower arithmetic precision tends to lift its smallest singular values, Boutsikas et al. [2024, Theorem 3.5] obtained a deterministic lower bound for acluste...

2024

[13] [13]

7Up to lower-order terms

ProofProperties (23) and (25) are established in Lemma C.1 of [Saha et al., 2023]; property (24) is immediate from the definition; property (26) is proven in Appendix A.3. 7Up to lower-order terms. 15 A.3 Proof of the Sub-Gaussian Property of the Quantization Error Here we prove property (26): the quantization error QR,B(x)−x is sub-gaussian with ∥QR,B(x)...

2023

[14] [14]

Now y⊤Yx=y ⊤U⊤Ex=u ⊤Ex= dX j=1 xjξj. Applying the same moment-generating-function computation, now to the independent variables ξj withPd j=1 x2 j = 1, we obtain E exp(λy⊤Yx) = dY j=1 E[exp(λx jξj)]≤exp   λ2ρ2 2 dX j=1 x2 j   = exp λ2ρ2 2 . By Markov’s inequality, P y⊤Yx> t =P exp λy⊤Yx >exp (λt) ≤ E exp(λy⊤Yx) exp (λt) ≤exp λ2ρ2 2 −λt . (31) Optimizi...

2026

[15] [15]

Hence E " dX i=1 σi( ˜A)2 # = tr(A⊤A) +E tr(E⊤E) = dX i=1 σi(A)2 +E ∥E∥2 F . A.8 Proof of Lemma 4.1 Proof Step 1: Top- k energy of A as a contour integral.Applied to G, the trace contour-integral identity (10) gives kX i=1 σi(A)2 = 1 2πi I Γ ztr (R G(z))dz.(35) Step 2:Γalso separates the top-keigenvalues of ˜G.By Weyl’s inequality, λi( ˜G)−λ i(G) ≤ ∥∆∥2 <...

1949

[16] [16]

Then v⊤ i E⊤Evj =x ⊤Bijx,B ij := 1 2 (viv⊤ j )⊗I n + (vjv⊤ i )⊗I n

Quadratic term.Writex= vec(E)∈R nd. Then v⊤ i E⊤Evj =x ⊤Bijx,B ij := 1 2 (viv⊤ j )⊗I n + (vjv⊤ i )⊗I n . Indeed, by the cyclic property of trace: v⊤ i E⊤Evj = tr(E⊤Evjv⊤ i ).(63) Recall [Petersen and Pedersen, 2008, Eqs. (520)–(521)], vec(AXB) = (B⊤ ⊗A) vec(X),(64) tr(A⊤B) = vec(A)⊤ vec(B),(65) First, use (65) withA=E,B=Ev jv⊤ i : tr(E⊤Evjv⊤ i ) = vec(E)⊤...

2008

[17] [17]

= 3(a 2 +b 2 +c 2). 22 Second, use (64) withA=I n,X=E,B=v jv⊤ i : vec(Evjv⊤ i ) = (viv⊤ j )⊗I n vec(E).(67) Substituting the above two results, we get: v⊤ i E⊤Evj = vec(E)⊤ (viv⊤ j )⊗I n vec(E).(68) Finally, since a quadratic form depends only on the symmetric part of its matrix: v⊤ i E⊤Evj = 1 2 vec(E)⊤ (viv⊤ j )⊗I n + (viv⊤ j )⊗I n ⊤ vec(E)(69) = 1 2 ve...

2026

[18] [18]

24 Step 1: Approximation.Fix ε= 1/4

for a review. 24 Step 1: Approximation.Fix ε= 1/4 . By [Vershynin, 2026, Corollary 4.2.11], there exist ε-nets N ⊂S n−1 andD ⊂S d−1 with bounded cardinality: |N | ≤9 n,|D| ≤9 d. By Lemma A.1,∥E∥ 2 can be bounded using the nets as ∥E∥2 ≤2 max x∈N,y∈D |x⊤Ey|.(90) Step 2: Concentration.Fixx∈ Nandy∈ D. Then x⊤Ey= nX i=1 dX j=1 xiyjEij, which is a sum of indep...

2026