Finite-Sample Performance of Gradient Descent in Logistic Regression with Gaussian Design

Arya Mazumdar; Junren Chen

arxiv: 2606.21683 · v1 · pith:ECMTFQ2Tnew · submitted 2026-06-19 · 📊 stat.ML · cs.IT· cs.LG· math.IT

Finite-Sample Performance of Gradient Descent in Logistic Regression with Gaussian Design

Junren Chen , Arya Mazumdar This is my paper

Pith reviewed 2026-06-26 12:27 UTC · model grok-4.3

classification 📊 stat.ML cs.ITcs.LGmath.IT

keywords logistic regressiongradient descentfinite-sample analysisGaussian designconvergence ratesparameter estimation

0 comments

The pith

Gradient descent on logistic regression reaches an l2 error of order sqrt of theta-star norm to the fifth times d over n with linear convergence under small stepsize.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies gradient descent applied to the logistic loss for estimating a fixed unknown parameter theta-star from n i.i.d. Gaussian design samples in logistic regression. It proves that with a small constant stepsize and zero initialization the iterates converge linearly to a neighborhood of theta-star whose radius scales as the square root of (theta-star norm to the fifth power times dimension over samples). This supplies explicit non-asymptotic finite-sample rates that improve on prior results lacking such error bounds or showing slower convergence. The proof rests on verifying an approximate invertibility condition for the gradient via uniform deviation bounds and eigenvalue analysis of the population Hessian.

Core claim

Under small O(1) stepsize and zero initialization, gradient descent linearly converges to a small neighborhood of theta* achieving an l2 error of order O(sqrt(||theta*||_2^5 d/n)). A faster local linear convergence to the same statistical error holds under a large Theta(||theta*||) stepsize. The gradient of the logistic loss satisfies an approximate invertibility condition that enables the contraction mapping for both empirical and population iterates.

What carries the argument

Approximate invertibility condition (AIC) on the gradient of the logistic loss, established by uniform control of empirical-population deviations via covering and peeling, followed by contraction of population GD iterates via eigenvalues of population Hessian matrices.

If this is right

Gradient descent attains the stated non-asymptotic l2 estimation rate under the given initialization and stepsize.
Local linear convergence accelerates when the stepsize is scaled up to Theta of the norm of theta-star.
A novel efficient estimator constructed from recent work reaches the sharper rate Theta of sqrt of (norm of theta-star times d over n), showing prior non-asymptotic bounds have suboptimal dependence on the signal strength.
In many regimes the tight statistical error rate is therefore Theta of sqrt of (norm of theta-star times d over n).

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The improved rate achieved by the novel estimator suggests that a refined stepsize schedule or analysis might allow plain gradient descent to reach the same dependence on theta-star norm.
The approximate invertibility technique may extend to other generalized linear models whose link functions produce comparable Hessian eigenvalue behavior.
The guarantees are most relevant in regimes where dimension d grows with sample size n but the signal strength remains moderate.

Load-bearing premise

The gradient of the logistic loss satisfies an approximate invertibility condition that supports contraction mapping for the iterates.

What would settle it

An explicit counter-example or simulation in which the empirical gradient deviates from its population counterpart enough to violate the approximate invertibility condition, causing GD iterates to fail to contract linearly into the claimed neighborhood, would falsify the result.

Figures

Figures reproduced from arXiv: 2606.21683 by Arya Mazumdar, Junren Chen.

**Figure 2.** Figure 2: Comparing Algorithm 2 with small stepsize GD. 46 [PITH_FULL_IMAGE:figures/full_fig_p046_2.png] view at source ↗

read the original abstract

We consider the parameter estimation problem in logistic regression with Gaussian design: the estimation of a fixed unknown parameter $\theta^*\in \mathbb{R}^d$ ($\|\theta^*\|_2\ge 1$) from $n$ i.i.d. samples $\{(x_i,y_i)\}_{i=1}^n$, where $x_i\sim N(0,I_d)$ and $y_i|x_i \sim {\rm Bernoulli}(1/(1+\exp(-x_i^\top \theta^*)))$. Our main aim is to characterize the finite-sample estimation performance and convergence behavior of gradient descent (GD) on the maximum likelihood objective (i.e., the logistic loss). Under small $O(1)$ stepsize and $0$ initialization, we show that GD linearly converges to a small neighborhood of $\theta^*$ achieving an $\ell_2$ error of order $O(\sqrt{\|\theta^*\|_2^5d/n})$. This substantially goes beyond existing theoretical results that lack non-asymptotic estimation error rate and exhibit much slower parameter convergence. We also establish a faster local linear convergence to the same statistical error under a large $\Theta(\|\theta^*\|_2)$ stepsize. The main technical component is to show that the gradient of the logistic loss satisfies a certain approximate invertibility condition (AIC). To that end, we uniformly control the deviation of the gradient from its population counterpart by covering and peeling arguments, and then show that the population GD is a contraction by a delicate analysis based on the eigenvalues of population Hessian matrices. Finally, we build upon the recent work Matsumoto and Mazumdar (2025) and devise a novel efficient estimator that attains a sharper rate in high dimensions. This indicates that the existing non-asymptotic guarantees exhibit sub-optimal dependence on $\|\theta^*\|_2$, and that in many regimes $\Theta(\sqrt{\|\theta^*\|_2d/n})$ is the tight estimation error rate. Numerical examples are provided to corroborate our theoretical results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives the first explicit non-asymptotic rate for GD on logistic loss under Gaussian design, but the ||θ*||^5 term in the bound is the part that needs checking.

read the letter

The main takeaway is that this work supplies an explicit finite-sample ℓ2 error bound for gradient descent on logistic regression with Gaussian design: linear convergence from zero initialization with O(1) step size to a neighborhood of size O(√(||θ*||₂^5 d/n)). That rate, plus the local linear convergence result with larger step size, is new relative to the literature cited in the abstract.

They establish it by proving an approximate invertibility condition on the empirical gradient. The argument splits into uniform deviation control via covering and peeling, followed by eigenvalue analysis of the population Hessian to get contraction. They also use their own 2025 result to build a different estimator that reaches the tighter O(√(||θ*||₂ d/n)) rate, which is a useful secondary point showing the GD bound is not optimal.

The soft spot is the ||θ*||^5 dependence. It suggests either the deviation term or the contraction modulus worsens polynomially with signal strength, and it is not obvious that an O(1) step size keeps the contraction factor below 1 once ||θ*|| grows. The abstract assumes ||θ*|| ≥ 1 but gives no upper bound, so the regime where the result is useful remains unclear. The AIC is the load-bearing assumption and the sketch in the abstract does not let me verify it directly.

This is for readers working on non-asymptotic optimization and sample complexity for GLMs. The citation pattern is clean and the approach engages the existing results honestly. The paper deserves a serious referee so the derivations around AIC and the norm dependence can be checked in detail.

Referee Report

3 major / 2 minor

Summary. The paper analyzes finite-sample behavior of gradient descent (GD) on the logistic loss for parameter estimation in logistic regression with Gaussian design x_i ~ N(0,I_d). With small O(1) stepsize and zero initialization, GD is claimed to converge linearly to a neighborhood of θ* with ℓ2 error O(√(||θ*||_2^5 d/n)). A faster local linear convergence is shown under Θ(||θ*||_2) stepsize. The key technical step is establishing an approximate invertibility condition (AIC) on the gradient via covering/peeling for uniform deviation bounds and eigenvalue analysis of the population Hessian; a sharper estimator attaining O(√(||θ*||_2 d/n)) is constructed by building on Matsumoto and Mazumdar (2025).

Significance. If the non-asymptotic rates and AIC hold, the work supplies explicit finite-sample guarantees for GD in logistic regression that track the dependence on ||θ*||_2 (previously missing in the literature) and indicates near-optimality of the statistical radius in certain regimes. The combination of uniform deviation arguments with contraction mapping via population Hessian eigenvalues is a standard and appropriate technique for this setting.

major comments (3)

[AIC proof (covering/peeling + population contraction)] The central contraction-mapping argument rests on the approximate invertibility condition (AIC). The claimed statistical radius contains an ||θ*||_2^5 factor inside the square root; this indicates that either the uniform deviation bound or the contraction modulus in the AIC proof degrades polynomially with ||θ*||_2. It is therefore necessary to verify explicitly that an O(1) stepsize still yields a contraction factor strictly less than 1 uniformly in the regime ||θ*||_2 ≫ 1 (see the population-Hessian eigenvalue analysis).
[Population Hessian analysis] The eigenvalue analysis of the population Hessian that establishes contractivity of the population GD map must be checked for its dependence on ||θ*||_2. The logistic Hessian involves factors of the form σ'(x^T θ) x x^T whose smallest eigenvalues can deteriorate when ||θ*||_2 grows, potentially violating the uniform contraction needed for the claimed linear convergence rate.
[Local convergence statement] The faster local linear convergence result under the larger Θ(||θ*||_2) stepsize is stated to reach the same statistical radius; the local neighborhood size in which this larger stepsize is valid must be shown to contain the statistical ball of radius O(√(||θ*||_2^5 d/n)) without circularity.

minor comments (2)

[Notation section] Notation for the logistic loss and its gradient should be introduced once and used consistently; the transition between empirical and population quantities is occasionally ambiguous.
[Numerical examples] The numerical experiments section would benefit from explicit reporting of the observed dependence of the error on ||θ*||_2 to corroborate the ||θ*||_2^5 factor.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment point by point below, providing clarifications from the existing analysis and indicating where revisions will be made.

read point-by-point responses

Referee: [AIC proof (covering/peeling + population contraction)] The central contraction-mapping argument rests on the approximate invertibility condition (AIC). The claimed statistical radius contains an ||θ*||_2^5 factor inside the square root; this indicates that either the uniform deviation bound or the contraction modulus in the AIC proof degrades polynomially with ||θ*||_2. It is therefore necessary to verify explicitly that an O(1) stepsize still yields a contraction factor strictly less than 1 uniformly in the regime ||θ*||_2 ≫ 1 (see the population-Hessian eigenvalue analysis).

Authors: The polynomial dependence on ||θ*||_2 in the statistical radius arises exclusively from the uniform deviation bounds derived via the covering and peeling arguments (which must control the tails of the logistic gradient under large signal strength). The contraction modulus itself remains strictly less than 1 with an O(1) stepsize, uniformly over ||θ*||_2 ≥ 1. This follows from the population Hessian eigenvalue analysis, which yields a contraction factor of at most 1 − c for an absolute constant c > 0 independent of ||θ*||_2. We will insert an explicit lemma stating this uniform lower bound on the contraction factor to make the separation between deviation and contraction transparent. revision: yes
Referee: [Population Hessian analysis] The eigenvalue analysis of the population Hessian that establishes contractivity of the population GD map must be checked for its dependence on ||θ*||_2. The logistic Hessian involves factors of the form σ'(x^T θ) x x^T whose smallest eigenvalues can deteriorate when ||θ*||_2 grows, potentially violating the uniform contraction needed for the claimed linear convergence rate.

Authors: Although individual realizations of σ'(x^T θ)xx^T can be small for large ||θ*||_2, the population expectation E[σ'(x^T θ*)xx^T] admits a uniform positive lower bound on its minimal eigenvalue (at least 1/8) for all ||θ*||_2 ≥ 1 under Gaussian design. This is established by reducing to a one-dimensional expectation over the marginal projection and using the fact that the logistic variance function remains sufficiently positive in expectation. The O(1) stepsize is chosen smaller than the reciprocal of the uniform upper bound on the Hessian operator norm, preserving the contraction. We will expand the eigenvalue section with the full uniform bounds and the explicit stepsize restriction. revision: yes
Referee: [Local convergence statement] The faster local linear convergence result under the larger Θ(||θ*||_2) stepsize is stated to reach the same statistical radius; the local neighborhood size in which this larger stepsize is valid must be shown to contain the statistical ball of radius O(√(||θ*||_2^5 d/n)) without circularity.

Authors: The region of validity for the Θ(||θ*||_2) stepsize is a ball of radius Ω(1/||θ*||_2) around θ*, whose size is independent of n and d. Because the statistical radius O(√(||θ*||_2^5 d/n)) tends to zero as n → ∞ (for fixed d and ||θ*||_2), there exists N0 such that for all n ≥ N0 the statistical ball lies strictly inside the local region. The local contraction is proved on the population map; the finite-sample deviation is controlled separately via the AIC and does not enter the definition of the local neighborhood. We will add an explicit comparison of the two radii in the local-convergence section to eliminate any appearance of circularity. revision: yes

Circularity Check

0 steps flagged

Minor self-citation for secondary sharper estimator; central GD convergence derived independently

full rationale

The paper's core derivation establishes linear convergence of GD to a neighborhood of size O(√(||θ*||₂⁵ d/n)) by proving the approximate invertibility condition (AIC) via uniform deviation bounds (covering/peeling) on the empirical gradient and eigenvalue analysis of the population Hessian to show contraction. This analysis is self-contained and does not reduce to any fitted input, self-definition, or prior self-citation. The only self-citation appears in the final paragraph, where Matsumoto and Mazumdar (2025) is invoked solely to construct an auxiliary efficient estimator with improved rate; this does not support or justify the main finite-sample GD claim. No other patterns (self-definitional, fitted predictions, ansatz smuggling, or renaming) are present. The result is therefore a standard first-principles analysis with one non-load-bearing self-citation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The analysis rests on the standard Gaussian design assumption and introduces the AIC as a paper-specific technical condition; no data-fitted free parameters are used in the stated rates.

axioms (2)

domain assumption Covariates x_i are i.i.d. N(0, I_d)
Standard modeling assumption for the design matrix in high-dimensional logistic regression.
ad hoc to paper Gradient of logistic loss satisfies approximate invertibility condition (AIC)
Central technical device introduced to prove contraction of both empirical and population GD iterates.

pith-pipeline@v0.9.1-grok · 5902 in / 1399 out tokens · 26811 ms · 2026-06-26T12:27:11.892479+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 1 linked inside Pith

[1]

Robust uniform recovery of structured signals from nonlinear observations.arXiv preprint arXiv:2604.20075, 2026

Pedro Abdalla, Radu Balan, and Junren Chen. Robust uniform recovery of structured signals from nonlinear observations.arXiv preprint arXiv:2604.20075, 2026

Pith/arXiv arXiv 2026
[2]

Model selection and minimax estimation in generalized linear models.IEEE Transactions on Information Theory, 62(6):3721–3730, 2016

Felix Abramovich and Vadim Grinshtein. Model selection and minimax estimation in generalized linear models.IEEE Transactions on Information Theory, 62(6):3721–3730, 2016

2016
[3]

Gradient descent converges linearly for logistic regres- sion on separable data

Kyriakos Axiotis and Maxim Sviridenko. Gradient descent converges linearly for logistic regres- sion on separable data. InInternational Conference on Machine Learning, pages 1302–1319. PMLR, 2023

2023
[4]

Rademacher and gaussian complexities: Risk bounds and structural results.Journal of machine learning research, 3(Nov):463–482, 2002

Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results.Journal of machine learning research, 3(Nov):463–482, 2002

2002
[5]

Oxford University Press, February 2013

Stéphane Boucheron, Gábor Lugosi, and Pascal Massart.Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, February 2013

2013
[6]

The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression.The Annals of Statistics, 48(1):27–42, 2020

Emmanuel J Candès and Pragya Sur. The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression.The Annals of Statistics, 48(1):27–42, 2020. 11

2020
[7]

Finite-sample performance of the maximum likelihood estimator in logistic regression.arXiv preprint arXiv:2411.02137, 2024

Hugo Chardon, Matthieu Lerasle, and Jaouad Mourtada. Finite-sample performance of the maximum likelihood estimator in logistic regression.arXiv preprint arXiv:2411.02137, 2024

arXiv 2024
[8]

A unified approach to statistical estima- tion under nonlinear observations: tensor estimation and low-rank factorization.arXiv preprint arXiv:2510.16965, 2025

Junren Chen, Lijun Ding, Dong Xia, and Ming Yuan. A unified approach to statistical estima- tion under nonlinear observations: tensor estimation and low-rank factorization.arXiv preprint arXiv:2510.16965, 2025

arXiv 2025
[9]

One-bit phase retrieval: Optimal rates and efficient algorithms

Junren Chen and Ming Yuan. One-bit phase retrieval: Optimal rates and efficient algorithms. IEEE Transactions on Information Theory, 2026

2026
[10]

Gradient descent on neural networks typically occurs at the edge of stability

Jeremy Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. InInternational Conference on Learning Representations, 2021

2021
[11]

Nbiht: An efficient algorithm for 1-bit compressed sensing with optimal error decay rate.IEEE Transactions on Information Theory, 68(2):1157–1177, 2021

Michael P Friedlander, Halyun Jeong, Yaniv Plan, and Özgür Yılmaz. Nbiht: An efficient algorithm for 1-bit compressed sensing with optimal error decay rate.IEEE Transactions on Information Theory, 68(2):1157–1177, 2021

2021
[12]

Academic press, 2013

Morris W Hirsch, Stephen Smale, and Robert L Devaney.Differential equations, dynamical systems, and an introduction to chaos. Academic press, 2013

2013
[13]

On the sample complexity of parameter estimation in logistic regression with normal design

Daniel Hsu and Arya Mazumdar. On the sample complexity of parameter estimation in logistic regression with normal design. InThe Thirty Seventh Annual Conference on Learning Theory, pages 2418–2437. PMLR, 2024

2024
[14]

Robust 1- bit compressive sensing via binary stable embeddings of sparse vectors.IEEE Transactions on Information Theory, 59(4):2082–2102, 2013

Laurent Jacques, Jason N Laska, Petros T Boufounos, and Richard G Baraniuk. Robust 1- bit compressive sensing via binary stable embeddings of sparse vectors.IEEE Transactions on Information Theory, 59(4):2082–2102, 2013

2082
[15]

The implicit bias of gradient descent on nonseparable data

Ziwei Ji and Matus Telgarsky. The implicit bias of gradient descent on nonseparable data. In Conference on learning theory, pages 1772–1798. PMLR, 2019

2019
[16]

Efficient learning of gener- alized linear and single index models with isotonic regression.Advances in Neural Information Processing Systems, 24, 2011

Sham M Kakade, Varun Kanade, Ohad Shamir, and Adam Kalai. Efficient learning of gener- alized linear and single index models with isotonic regression.Advances in Neural Information Processing Systems, 24, 2011

2011
[17]

The isotron algorithm: High-dimensional isotonic regres- sion

Adam Tauman Kalai and Ravi Sastry. The isotron algorithm: High-dimensional isotonic regres- sion. InCOLT, volume 1, page 9, 2009

2009
[18]

Finite sample rates for logistic regression with small noise or few samples.Sankhya A, pages 1–70, 2024

Felix Kuchelmeister and Sara van de Geer. Finite sample rates for logistic regression with small noise or few samples.Sankhya A, pages 1–70, 2024

2024
[19]

Springer

Michel Ledoux and Michel Talagrand.Probability in Banach Spaces: isoperimetry and processes, volume 23. Springer
[20]

On the sample complexity of pac learning half-spaces against the uniform distribution.IEEE Transactions on Neural Networks, 6(6):1556–1559, 1995

Philip M Long. On the sample complexity of pac learning half-spaces against the uniform distribution.IEEE Transactions on Neural Networks, 6(6):1556–1559, 1995

1995
[21]

Robust multivariate mean estimation: The optimality of trimmed mean.The Annals of Statistics, 49(1):393–410, 2021

Gábor Lugosi and Shahar Mendelson. Robust multivariate mean estimation: The optimality of trimmed mean.The Annals of Statistics, 49(1):393–410, 2021. 12

2021
[22]

Binary iterative hard thresholding converges with optimal number of measurements for 1-bit compressed sensing.Journal of the ACM, 71(5):1–64, 2024

Namiko Matsumoto and Arya Mazumdar. Binary iterative hard thresholding converges with optimal number of measurements for 1-bit compressed sensing.Journal of the ACM, 71(5):1–64, 2024

2024
[23]

Learning sparse generalized linear models with binary outcomes via iterative hard thresholding

Namiko Matsumoto and Arya Mazumdar. Learning sparse generalized linear models with binary outcomes via iterative hard thresholding. InProceedings of Thirty Eighth Conference on Learning Theory, volume 291 ofProceedings of Machine Learning Research, pages 3933–4032. PMLR, 30 Jun–04 Jul 2025

2025
[24]

Nelder.Generalized linear models

Peter McCullagh and John A. Nelder.Generalized linear models. Routledge, 2019

2019
[25]

Gradient descent on logistic regression: Do large step-sizes work with data on the sphere?arXiv preprint arXiv:2507.11228, 2025

Si Yi Meng, Baptiste Goujaud, Antonio Orvieto, and Christopher De Sa. Gradient descent on logistic regression: Do large step-sizes work with data on the sphere?arXiv preprint arXiv:2507.11228, 2025

arXiv 2025
[26]

Gradient descent on logistic regression with non-separable data and large step sizes.arXiv preprint arXiv:2406.05033, 2024

Si Yi Meng, Antonio Orvieto, Daniel Yiming Cao, and Christopher De Sa. Gradient descent on logistic regression with non-separable data and large step sizes.arXiv preprint arXiv:2406.05033, 2024

arXiv 2024
[27]

Convergence of gradient descent on separable data

Mor Shpigel Nacson, Jason Lee, Suriya Gunasekar, Pedro Henrique Pamplona Savarese, Nathan Srebro, and Daniel Soudry. Convergence of gradient descent on separable data. InThe 22nd International Conference on Artificial Intelligence and Statistics, pages 3420–3428. PMLR, 2019

2019
[28]

Springer, 2018

Yurii Nesterov et al.Lectures on convex optimization, volume 137. Springer, 2018

2018
[29]

Phase retrieval using alternating mini- mization.Advances in Neural Information Processing Systems, 26, 2013

Praneeth Netrapalli, Prateek Jain, and Sujay Sanghavi. Phase retrieval using alternating mini- mization.Advances in Neural Information Processing Systems, 26, 2013

2013
[30]

High-dimensionalestimationwithgeometric constraints.Information and Inference: A Journal of the IMA, 6(1):1–40, 2017

YanivPlan, RomanVershynin, andElenaYudovina. High-dimensionalestimationwithgeometric constraints.Information and Inference: A Journal of the IMA, 6(1):1–40, 2017

2017
[31]

Onpaclearningusingwinnow, perceptron, andaperceptron-likealgorithm

RoccoAServedio. Onpaclearningusingwinnow, perceptron, andaperceptron-likealgorithm. In Proceedings of the Twelfth Annual Conference on Computational learning theory, pages 296–307, 1999

1999
[32]

The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

2018
[33]

A bound for the error in the normal approximation to the distribution of a sum of dependent random variables

Charles Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. InProceedings of the sixth Berkeley symposium on mathematical statistics and probability, volume 2: Probability theory, volume 6, pages 583–603. University of California Press, 1972

1972
[34]

Use of exchangeable pairs in the analysis of simulations.Lecture Notes-Monograph Series, pages 1–26, 2004

Charles Stein, Persi Diaconis, Susan Holmes, and Gesine Reinert. Use of exchangeable pairs in the analysis of simulations.Lecture Notes-Monograph Series, pages 1–26, 2004

2004
[35]

From logistic regression to the perceptron algorithm: Exploring gradient descent with large step sizes

Alexander Tyurin. From logistic regression to the perceptron algorithm: Exploring gradient descent with large step sizes. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 20938–20946, 2025. 13

2025
[36]

Cambridge University Press, 2018

Roman Vershynin.High-dimensional probability: An introduction with applications in data sci- ence, volume 47. Cambridge University Press, 2018

2018
[37]

Large stepsize gradient descent for logistic loss: Non-monotonicity of the loss improves optimization efficiency

Jingfeng Wu, Peter L Bartlett, Matus Telgarsky, and Bin Yu. Large stepsize gradient descent for logistic loss: Non-monotonicity of the loss improves optimization efficiency. InThe Thirty Seventh Annual Conference on Learning Theory, pages 5019–5073. PMLR, 2024

2024
[38]

Implicit bias of gradient descent for logistic regression at the edge of stability.Advances in Neural Information Processing Systems, 36:74229–74256, 2023

Jingfeng Wu, Vladimir Braverman, and Jason D Lee. Implicit bias of gradient descent for logistic regression at the edge of stability.Advances in Neural Information Processing Systems, 36:74229–74256, 2023

2023
[39]

Large stepsizes accelerate gradient descent for regularized logistic regression.arXiv preprint arXiv:2506.02336, 2025

Jingfeng Wu, Pierre Marion, and Peter Bartlett. Large stepsizes accelerate gradient descent for regularized logistic regression.arXiv preprint arXiv:2506.02336, 2025

arXiv 2025
[40]

Gradient descent converges arbitrarily fast for logistic regression via large and adaptive stepsizes

Ruiqi Zhang, Jingfeng Wu, and Peter Bartlett. Gradient descent converges arbitrarily fast for logistic regression via large and adaptive stepsizes. InForty-second International Conference on Machine Learning, 2025

2025
[41]

localization

Ruiqi Zhang, Jingfeng Wu, Licong Lin, and Peter L Bartlett. Minimax optimal convergence of gradient descent in logistic regression via large and adaptive stepsizes.arXiv preprint arXiv:2504.04105, 2025. A Proof of Theorem 1 (GD with Small Stepsize) The main ingredient is an approximate invertibility condition (AIC) that controls ∥u−θ ∗ −η·h θ∗(u)∥2,∀u∈B d...

arXiv 2025
[42]

C.1 Bounding| ˆλ− ∥θ ∗∥2|(norm estimation error) In order to bound|ˆλ− ∥θ ∗∥2|= q−1(ˆv⊤ˆr)− ∥θ∗∥2 ,we shall start with bounding ˆv⊤ˆr−q(∥θ∗∥2)

In fact, by assuming the bound in Lemma 1 holds (this is within the probability promised in Theorem 3), then we have ∥θ∗∥2 ˆr− θ∗ ∥θ∗∥2 2 ≲polylog(n) r ∥θ∗∥2d n + ∥θ∗∥2d n .(70) In the remainder of this appendix, we focus on bounding the norm estimation error,|ˆλ− ∥θ ∗∥2|. C.1 Bounding| ˆλ− ∥θ ∗∥2|(norm estimation error) In order to bound|ˆλ− ∥θ ∗∥2|= q−1...

[1] [1]

Robust uniform recovery of structured signals from nonlinear observations.arXiv preprint arXiv:2604.20075, 2026

Pedro Abdalla, Radu Balan, and Junren Chen. Robust uniform recovery of structured signals from nonlinear observations.arXiv preprint arXiv:2604.20075, 2026

Pith/arXiv arXiv 2026

[2] [2]

Model selection and minimax estimation in generalized linear models.IEEE Transactions on Information Theory, 62(6):3721–3730, 2016

Felix Abramovich and Vadim Grinshtein. Model selection and minimax estimation in generalized linear models.IEEE Transactions on Information Theory, 62(6):3721–3730, 2016

2016

[3] [3]

Gradient descent converges linearly for logistic regres- sion on separable data

Kyriakos Axiotis and Maxim Sviridenko. Gradient descent converges linearly for logistic regres- sion on separable data. InInternational Conference on Machine Learning, pages 1302–1319. PMLR, 2023

2023

[4] [4]

Rademacher and gaussian complexities: Risk bounds and structural results.Journal of machine learning research, 3(Nov):463–482, 2002

Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results.Journal of machine learning research, 3(Nov):463–482, 2002

2002

[5] [5]

Oxford University Press, February 2013

Stéphane Boucheron, Gábor Lugosi, and Pascal Massart.Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, February 2013

2013

[6] [6]

The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression.The Annals of Statistics, 48(1):27–42, 2020

Emmanuel J Candès and Pragya Sur. The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression.The Annals of Statistics, 48(1):27–42, 2020. 11

2020

[7] [7]

Finite-sample performance of the maximum likelihood estimator in logistic regression.arXiv preprint arXiv:2411.02137, 2024

Hugo Chardon, Matthieu Lerasle, and Jaouad Mourtada. Finite-sample performance of the maximum likelihood estimator in logistic regression.arXiv preprint arXiv:2411.02137, 2024

arXiv 2024

[8] [8]

A unified approach to statistical estima- tion under nonlinear observations: tensor estimation and low-rank factorization.arXiv preprint arXiv:2510.16965, 2025

Junren Chen, Lijun Ding, Dong Xia, and Ming Yuan. A unified approach to statistical estima- tion under nonlinear observations: tensor estimation and low-rank factorization.arXiv preprint arXiv:2510.16965, 2025

arXiv 2025

[9] [9]

One-bit phase retrieval: Optimal rates and efficient algorithms

Junren Chen and Ming Yuan. One-bit phase retrieval: Optimal rates and efficient algorithms. IEEE Transactions on Information Theory, 2026

2026

[10] [10]

Gradient descent on neural networks typically occurs at the edge of stability

Jeremy Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. InInternational Conference on Learning Representations, 2021

2021

[11] [11]

Nbiht: An efficient algorithm for 1-bit compressed sensing with optimal error decay rate.IEEE Transactions on Information Theory, 68(2):1157–1177, 2021

Michael P Friedlander, Halyun Jeong, Yaniv Plan, and Özgür Yılmaz. Nbiht: An efficient algorithm for 1-bit compressed sensing with optimal error decay rate.IEEE Transactions on Information Theory, 68(2):1157–1177, 2021

2021

[12] [12]

Academic press, 2013

Morris W Hirsch, Stephen Smale, and Robert L Devaney.Differential equations, dynamical systems, and an introduction to chaos. Academic press, 2013

2013

[13] [13]

On the sample complexity of parameter estimation in logistic regression with normal design

Daniel Hsu and Arya Mazumdar. On the sample complexity of parameter estimation in logistic regression with normal design. InThe Thirty Seventh Annual Conference on Learning Theory, pages 2418–2437. PMLR, 2024

2024

[14] [14]

Robust 1- bit compressive sensing via binary stable embeddings of sparse vectors.IEEE Transactions on Information Theory, 59(4):2082–2102, 2013

Laurent Jacques, Jason N Laska, Petros T Boufounos, and Richard G Baraniuk. Robust 1- bit compressive sensing via binary stable embeddings of sparse vectors.IEEE Transactions on Information Theory, 59(4):2082–2102, 2013

2082

[15] [15]

The implicit bias of gradient descent on nonseparable data

Ziwei Ji and Matus Telgarsky. The implicit bias of gradient descent on nonseparable data. In Conference on learning theory, pages 1772–1798. PMLR, 2019

2019

[16] [16]

Efficient learning of gener- alized linear and single index models with isotonic regression.Advances in Neural Information Processing Systems, 24, 2011

Sham M Kakade, Varun Kanade, Ohad Shamir, and Adam Kalai. Efficient learning of gener- alized linear and single index models with isotonic regression.Advances in Neural Information Processing Systems, 24, 2011

2011

[17] [17]

The isotron algorithm: High-dimensional isotonic regres- sion

Adam Tauman Kalai and Ravi Sastry. The isotron algorithm: High-dimensional isotonic regres- sion. InCOLT, volume 1, page 9, 2009

2009

[18] [18]

Finite sample rates for logistic regression with small noise or few samples.Sankhya A, pages 1–70, 2024

Felix Kuchelmeister and Sara van de Geer. Finite sample rates for logistic regression with small noise or few samples.Sankhya A, pages 1–70, 2024

2024

[19] [19]

Springer

Michel Ledoux and Michel Talagrand.Probability in Banach Spaces: isoperimetry and processes, volume 23. Springer

[20] [20]

On the sample complexity of pac learning half-spaces against the uniform distribution.IEEE Transactions on Neural Networks, 6(6):1556–1559, 1995

Philip M Long. On the sample complexity of pac learning half-spaces against the uniform distribution.IEEE Transactions on Neural Networks, 6(6):1556–1559, 1995

1995

[21] [21]

Robust multivariate mean estimation: The optimality of trimmed mean.The Annals of Statistics, 49(1):393–410, 2021

Gábor Lugosi and Shahar Mendelson. Robust multivariate mean estimation: The optimality of trimmed mean.The Annals of Statistics, 49(1):393–410, 2021. 12

2021

[22] [22]

Binary iterative hard thresholding converges with optimal number of measurements for 1-bit compressed sensing.Journal of the ACM, 71(5):1–64, 2024

Namiko Matsumoto and Arya Mazumdar. Binary iterative hard thresholding converges with optimal number of measurements for 1-bit compressed sensing.Journal of the ACM, 71(5):1–64, 2024

2024

[23] [23]

Learning sparse generalized linear models with binary outcomes via iterative hard thresholding

Namiko Matsumoto and Arya Mazumdar. Learning sparse generalized linear models with binary outcomes via iterative hard thresholding. InProceedings of Thirty Eighth Conference on Learning Theory, volume 291 ofProceedings of Machine Learning Research, pages 3933–4032. PMLR, 30 Jun–04 Jul 2025

2025

[24] [24]

Nelder.Generalized linear models

Peter McCullagh and John A. Nelder.Generalized linear models. Routledge, 2019

2019

[25] [25]

Gradient descent on logistic regression: Do large step-sizes work with data on the sphere?arXiv preprint arXiv:2507.11228, 2025

Si Yi Meng, Baptiste Goujaud, Antonio Orvieto, and Christopher De Sa. Gradient descent on logistic regression: Do large step-sizes work with data on the sphere?arXiv preprint arXiv:2507.11228, 2025

arXiv 2025

[26] [26]

Gradient descent on logistic regression with non-separable data and large step sizes.arXiv preprint arXiv:2406.05033, 2024

Si Yi Meng, Antonio Orvieto, Daniel Yiming Cao, and Christopher De Sa. Gradient descent on logistic regression with non-separable data and large step sizes.arXiv preprint arXiv:2406.05033, 2024

arXiv 2024

[27] [27]

Convergence of gradient descent on separable data

Mor Shpigel Nacson, Jason Lee, Suriya Gunasekar, Pedro Henrique Pamplona Savarese, Nathan Srebro, and Daniel Soudry. Convergence of gradient descent on separable data. InThe 22nd International Conference on Artificial Intelligence and Statistics, pages 3420–3428. PMLR, 2019

2019

[28] [28]

Springer, 2018

Yurii Nesterov et al.Lectures on convex optimization, volume 137. Springer, 2018

2018

[29] [29]

Phase retrieval using alternating mini- mization.Advances in Neural Information Processing Systems, 26, 2013

Praneeth Netrapalli, Prateek Jain, and Sujay Sanghavi. Phase retrieval using alternating mini- mization.Advances in Neural Information Processing Systems, 26, 2013

2013

[30] [30]

High-dimensionalestimationwithgeometric constraints.Information and Inference: A Journal of the IMA, 6(1):1–40, 2017

YanivPlan, RomanVershynin, andElenaYudovina. High-dimensionalestimationwithgeometric constraints.Information and Inference: A Journal of the IMA, 6(1):1–40, 2017

2017

[31] [31]

Onpaclearningusingwinnow, perceptron, andaperceptron-likealgorithm

RoccoAServedio. Onpaclearningusingwinnow, perceptron, andaperceptron-likealgorithm. In Proceedings of the Twelfth Annual Conference on Computational learning theory, pages 296–307, 1999

1999

[32] [32]

The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

2018

[33] [33]

A bound for the error in the normal approximation to the distribution of a sum of dependent random variables

Charles Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. InProceedings of the sixth Berkeley symposium on mathematical statistics and probability, volume 2: Probability theory, volume 6, pages 583–603. University of California Press, 1972

1972

[34] [34]

Use of exchangeable pairs in the analysis of simulations.Lecture Notes-Monograph Series, pages 1–26, 2004

Charles Stein, Persi Diaconis, Susan Holmes, and Gesine Reinert. Use of exchangeable pairs in the analysis of simulations.Lecture Notes-Monograph Series, pages 1–26, 2004

2004

[35] [35]

From logistic regression to the perceptron algorithm: Exploring gradient descent with large step sizes

Alexander Tyurin. From logistic regression to the perceptron algorithm: Exploring gradient descent with large step sizes. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 20938–20946, 2025. 13

2025

[36] [36]

Cambridge University Press, 2018

Roman Vershynin.High-dimensional probability: An introduction with applications in data sci- ence, volume 47. Cambridge University Press, 2018

2018

[37] [37]

Large stepsize gradient descent for logistic loss: Non-monotonicity of the loss improves optimization efficiency

Jingfeng Wu, Peter L Bartlett, Matus Telgarsky, and Bin Yu. Large stepsize gradient descent for logistic loss: Non-monotonicity of the loss improves optimization efficiency. InThe Thirty Seventh Annual Conference on Learning Theory, pages 5019–5073. PMLR, 2024

2024

[38] [38]

Implicit bias of gradient descent for logistic regression at the edge of stability.Advances in Neural Information Processing Systems, 36:74229–74256, 2023

Jingfeng Wu, Vladimir Braverman, and Jason D Lee. Implicit bias of gradient descent for logistic regression at the edge of stability.Advances in Neural Information Processing Systems, 36:74229–74256, 2023

2023

[39] [39]

Large stepsizes accelerate gradient descent for regularized logistic regression.arXiv preprint arXiv:2506.02336, 2025

Jingfeng Wu, Pierre Marion, and Peter Bartlett. Large stepsizes accelerate gradient descent for regularized logistic regression.arXiv preprint arXiv:2506.02336, 2025

arXiv 2025

[40] [40]

Gradient descent converges arbitrarily fast for logistic regression via large and adaptive stepsizes

Ruiqi Zhang, Jingfeng Wu, and Peter Bartlett. Gradient descent converges arbitrarily fast for logistic regression via large and adaptive stepsizes. InForty-second International Conference on Machine Learning, 2025

2025

[41] [41]

localization

Ruiqi Zhang, Jingfeng Wu, Licong Lin, and Peter L Bartlett. Minimax optimal convergence of gradient descent in logistic regression via large and adaptive stepsizes.arXiv preprint arXiv:2504.04105, 2025. A Proof of Theorem 1 (GD with Small Stepsize) The main ingredient is an approximate invertibility condition (AIC) that controls ∥u−θ ∗ −η·h θ∗(u)∥2,∀u∈B d...

arXiv 2025

[42] [42]

C.1 Bounding| ˆλ− ∥θ ∗∥2|(norm estimation error) In order to bound|ˆλ− ∥θ ∗∥2|= q−1(ˆv⊤ˆr)− ∥θ∗∥2 ,we shall start with bounding ˆv⊤ˆr−q(∥θ∗∥2)

In fact, by assuming the bound in Lemma 1 holds (this is within the probability promised in Theorem 3), then we have ∥θ∗∥2 ˆr− θ∗ ∥θ∗∥2 2 ≲polylog(n) r ∥θ∗∥2d n + ∥θ∗∥2d n .(70) In the remainder of this appendix, we focus on bounding the norm estimation error,|ˆλ− ∥θ ∗∥2|. C.1 Bounding| ˆλ− ∥θ ∗∥2|(norm estimation error) In order to bound|ˆλ− ∥θ ∗∥2|= q−1...