Rethinking Neural Network Learning Rates: A Stackelberg Perspective
Pith reviewed 2026-05-19 14:37 UTC · model grok-4.3
pith:Q7RP75NJ Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{Q7RP75NJ}
Prints a linked pith:Q7RP75NJ badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Assigning a smaller learning rate to body layers and a larger learning rate to the final layer is equivalent to two-time-scale alternating gradient descent on a Stackelberg reformulation of neural network training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reformulating neural network training as a Stackelberg game with the final layer as leader turns the non-uniform learning rate schedule into a two-time-scale alternating gradient descent procedure. Finite-time convergence holds under broad conditions that accommodate constraints and non-smooth activations. On some problems the Stackelberg objective supplies stronger optimization structure than the original objective, while numerical analysis shows it produces substantially sharper local curvature especially in early training and therefore more informative gradients.
What carries the argument
Stackelberg reformulation of the training objective with the final layer as leader whose objective depends on the best response of the body layers.
Load-bearing premise
Modeling layer interactions in neural network training as a Stackelberg game with the final layer as leader preserves the original optimization landscape well enough for convergence and curvature results to apply back to standard training.
What would settle it
Train identical networks with uniform and non-uniform rates on a low-dimensional convex problem whose global minimum is known, then verify whether the non-uniform schedule reaches that minimum at the rate predicted by the sharper-curvature analysis.
Figures
read the original abstract
Neural networks are typically trained with a single learning rate across all layers. While recent empirical evidence suggests that assigning layer-specific learning rates can accelerate training, a principled understanding of the conditions and mechanisms under which non-uniform learning rates are beneficial remains limited. In this work, we investigate non-uniform learning rates through the lens of Stackelberg optimization. Specifically, we demonstrate that training neural networks with a smaller learning rate for the body layers and a larger learning rate for the final layer can be interpreted as a two-time-scale alternating gradient descent algorithm applied to a Stackelberg reformulation of the original objective. We establish finite-time convergence guarantees for the algorithm under broad conditions that accommodate constraint sets and non-smooth activation functions. Beyond convergence, we identify two mechanisms by which non-uniform learning rates can outperform uniform learning rates: (i) we show that certain problem instances induce a Stackelberg objective with stronger optimization structure than the original objective, yielding faster convergence to globally optimal solutions, (ii) our numerical analysis reveals that the Stackelberg objective can exhibit substantially sharper local curvature, especially in early training, which leads to more informative gradients and learning acceleration. Experiments in supervised learning and reinforcement learning support our findings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reinterprets the empirical practice of using smaller learning rates for body layers and a larger learning rate for the final layer in neural network training as a two-time-scale alternating gradient descent procedure applied to a Stackelberg reformulation of the original empirical risk objective, with the final layer as leader. It claims finite-time convergence guarantees for this algorithm under broad conditions that include constraint sets and non-smooth activation functions. Two mechanisms are identified for potential outperformance over uniform learning rates: (i) certain problem instances yield a Stackelberg objective with stronger optimization structure that enables faster convergence to global optima, and (ii) the Stackelberg objective exhibits substantially sharper local curvature (especially early in training), producing more informative gradients. Experiments in supervised learning and reinforcement learning are presented in support.
Significance. If the finite-time convergence results hold rigorously and the identified mechanisms are shown to transfer meaningfully to standard neural network training, the work could provide a principled game-theoretic lens for selecting layer-specific learning rates. The explicit accommodation of non-smooth activations and constraints in the convergence analysis is a strength relative to many existing bilevel or two-time-scale analyses in the literature.
major comments (2)
- The curvature-acceleration mechanism (abstract and numerical analysis) asserts that the Stackelberg objective exhibits substantially sharper local curvature than the original loss, leading to acceleration. However, because the best-response map is set-valued and non-differentiable for common non-smooth activations such as ReLU, the effective Hessian or gradient of the leader's objective is not obviously well-defined or directly comparable to the geometry of the original empirical risk; this undermines transfer of the curvature claims back to standard training dynamics.
- The finite-time convergence guarantees are stated to hold under broad conditions that accommodate constraints and non-smooth activations. The manuscript should clarify (in the convergence section) whether these guarantees apply exactly to the practical implementation of layer-specific learning rates or require additional regularity conditions on the best-response map that may not be satisfied in typical neural network settings; this is load-bearing for the claim that the reformulation explains observed benefits.
minor comments (1)
- Notation distinguishing the Stackelberg leader objective from standard bilevel optimization should be made more explicit to avoid potential confusion with existing literature on hyperparameter optimization.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below and indicate where revisions will be made to improve clarity without altering the core claims.
read point-by-point responses
-
Referee: The curvature-acceleration mechanism (abstract and numerical analysis) asserts that the Stackelberg objective exhibits substantially sharper local curvature than the original loss, leading to acceleration. However, because the best-response map is set-valued and non-differentiable for common non-smooth activations such as ReLU, the effective Hessian or gradient of the leader's objective is not obviously well-defined or directly comparable to the geometry of the original empirical risk; this undermines transfer of the curvature claims back to standard training dynamics.
Authors: We agree that the set-valued and non-differentiable nature of the best-response map for non-smooth activations such as ReLU means that a classical Hessian of the leader's objective is not well-defined. Our curvature analysis relies on numerical evaluation of the effective objective (via finite differences on sampled trajectories) rather than analytical Hessian comparison. We will revise the abstract and numerical analysis sections to explicitly state that the sharper-curvature observation is an empirical finding in the implemented dynamics and does not rest on differentiability of the best-response map. This preserves the reported acceleration while acknowledging the limitation on analytical transfer to non-smooth cases. revision: partial
-
Referee: The finite-time convergence guarantees are stated to hold under broad conditions that accommodate constraints and non-smooth activations. The manuscript should clarify (in the convergence section) whether these guarantees apply exactly to the practical implementation of layer-specific learning rates or require additional regularity conditions on the best-response map that may not be satisfied in typical neural network settings; this is load-bearing for the claim that the reformulation explains observed benefits.
Authors: The finite-time convergence results apply to the two-time-scale alternating gradient descent procedure on the Stackelberg objective under the stated assumptions, which already accommodate non-unique best responses and non-smooth activations through appropriate measurable selections; no additional regularity (such as Lipschitz continuity of the best-response) is imposed. The practical layer-specific learning rates correspond to a single inner gradient step approximation of this procedure. We will add a clarifying paragraph in the convergence section that distinguishes the idealized algorithm from its practical approximation and notes that the guarantees do not rely on conditions beyond those already listed. revision: yes
Circularity Check
No significant circularity: Stackelberg reformulation and convergence analysis are independent of inputs
full rationale
The paper frames non-uniform learning rates as an interpretation via two-time-scale alternating GD on a Stackelberg reformulation of the original objective. Finite-time convergence is then derived for this algorithm under general conditions (constraints, non-smooth activations). Mechanisms such as stronger optimization structure and sharper curvature are analyzed on the reformulated objective. No quoted step reduces a claimed prediction or first-principles result to a fitted parameter, self-definition, or load-bearing self-citation chain; the reformulation is introduced as a modeling lens rather than derived from the target claims. The derivation chain remains self-contained against external alternating-GD benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Finite-time convergence of two-time-scale alternating gradient descent holds under the stated conditions on constraint sets and non-smooth activations
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
training neural networks with a smaller learning rate for the body layers and a larger learning rate for the final layer can be interpreted as a two-time-scale alternating gradient descent algorithm applied to a Stackelberg reformulation
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we establish finite-time convergence guarantees ... under ... non-smooth activation functions
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Barboni, R., Peyr ´e, G., and Vialard, F.-X. Ultra-fast fea- ture learning for the training of two-layer neural net- works in the two-timescale regime.arXiv preprint arXiv:2504.18208,
-
[2]
Closed-Form Last Layer Optimization
Galashov, A., Da Costa, N., Xu, L., Hennig, P., and Gretton, A. Closed-form last layer optimization.arXiv preprint arXiv:2510.04606,
work page internal anchor Pith review Pith/arXiv arXiv
- [3]
-
[4]
Hao, J., Gong, X., Xu, J., Wang, Z., and Liu, M. Noise- adaptive layerwise learning rates: Accelerating geometry- aware optimization for deep neural network training. arXiv preprint arXiv:2510.14009,
-
[5]
Large batch training does not need warmup.arXiv preprint arXiv:2002.01576,
Huo, Z., Gu, B., and Huang, H. Large batch training does not need warmup.arXiv preprint arXiv:2002.01576,
-
[6]
Martinez, F., Li, T., Lu, Y ., and Chen, J. Stackelberg cou- pling of online representation learning and reinforcement learning.arXiv preprint arXiv:2508.07452,
-
[7]
Large Batch Training of Convolutional Networks
You, Y ., Gitman, I., and Ginsburg, B. Large batch training of convolutional networks.arXiv preprint arXiv:1708.03888,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Zeng, S. and Doan, T. T. Accelerated multi-time-scale stochastic approximation: Optimal complexity and appli- cations in reinforcement learning and multi-agent games. arXiv preprint arXiv:2409.07767,
-
[9]
P., Bhatt, S., Ardon, L., Ganesh, S., and Koppel, A
Zeng, S., Evans, B. P., Bhatt, S., Ardon, L., Ganesh, S., and Koppel, A. Learning in stackelberg mean field games: A non-asymptotic analysis.arXiv preprint arXiv:2509.15392,
-
[10]
10 Rethinking Neural Network Learning Rates: A Stackelberg Perspective Supplementary Material Rethinking Neural Network Learning Rates: A Stackelberg Perspective Contents A Frequently Used Notations and Intermediate Results 12 B Proof of Theorems 12 B.1 Proof of Theorem 3.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...
work page 2019
-
[11]
14 Rethinking Neural Network Learning Rates: A Stackelberg Perspective B.2
2/5 , where in the last inequality we plug in the step size conditionα 0 ≤β 0 ≤1. 14 Rethinking Neural Network Learning Rates: A Stackelberg Perspective B.2. Proof of Theorem 3.7 The per-iteration convergence of the last-layer weights is analyzed in Proposition B.2. Below we establish the convergence of the body-layer weights. The proof of Theorem 3.7 com...
work page 2024
-
[12]
(33) For the fifth term of (29), we have 2⟨wk −w ⋆(Mk)−β k∇wf(M k, wk), w⋆(Mk)−w ⋆(Mk+1)⟩ ≤ λβk 2 ∥wk −w ⋆(Mk)−β k∇wf(M k, wk)∥2 + 2 λβk ∥w⋆(Mk)−w ⋆(Mk+1)∥2 ≤ λβk 2 (1− 3λβk 2 )∥wk −w ⋆(Mk)∥2 + 2L2(L2 Φ +σ 2)α2 k λ3βk ≤ λβk 2 ∥wk −w ⋆(Mk)∥2 + 2L2(L2 Φ +σ 2)α2 k λ3βk ,(34) where the second inequality follows from (30) and (32). Collecting the bounds from (...
work page 2019
-
[13]
24 Rethinking Neural Network Learning Rates: A Stackelberg Perspective E
λ . 24 Rethinking Neural Network Learning Rates: A Stackelberg Perspective E. Gradient-Based Temporal Difference Learning Under Neural Network Function Approximation We first present the expressions of the gradients the MSPBE and then make the derivation. ∇M f(M, w) = 2E π h γ ψM(s)⊤µ(M, w) ∇M ψM(s′)⊤w − ψM(s)⊤µ(M, w) ∇M ψM(s)⊤w + r(s, a) +γψ M(s′)⊤w−ψ M(...
work page 2009
-
[14]
The requirement on sampling i.i.d
that we can chooseζk to decay equally fast with respect tok as βk (up to a multiplicative factor difference) and extend our analysis to still guarantee theO(k −2/3)convergence rate. The requirement on sampling i.i.d. from dπ can also be replaced by Markovian sampling according to the state transition. It is a very well-known result in the literature that ...
work page 2019
-
[15]
If µk were estimated exactly accurate in every iteration (meaning that µk =µ(M k, wk)), the stochastic gradient of Mk given by line 5 of Algorithm 1 would be an unbiased estimate of the true gradient ∇M f(M, w) , and the iteration-wise 26 Rethinking Neural Network Learning Rates: A Stackelberg Perspective convergence analysis from Proposition B.1 would ap...
work page 2025
-
[16]
2/5 with β0 ζ0 ≤ q 2C1C3λA ˆρ (4ˆρ−5ρ)C4 . Then, the sum of the red highlighted terms are also non-positive, and the inequality above simplifies to αkE[∥∇Φ1/ˆρ(Mk)∥2] ≤ 4ˆρ (4ˆρ−5ρ)E[Φ1/ˆρ(Mk)−Φ 1/ˆρ(Mk+1)] + 8L2 ˆρ2αk (4ˆρ−5ρ)ρλβk E[∥wk −w ⋆(Mk)∥2 − ∥wk+1 −w ⋆(Mk+1)∥2] + C7αk ζk E[∥µk −µ(M k, wk)∥2 − ∥µk+1 −µ(M k+1, wk+1)∥2] +O α2 k +α kβk +α kζk + α3 k ...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.