Rethinking Neural Network Learning Rates: A Stackelberg Perspective

arxiv: 2605.15530 · v1 · pith:Q7RP75NJnew · submitted 2026-05-15 · 💻 cs.LG

Rethinking Neural Network Learning Rates: A Stackelberg Perspective

Sihan Zeng , Sujay Bhatt , Sumitra Ganesh This is my paper

Pith reviewed 2026-05-19 14:37 UTC · model grok-4.3

classification 💻 cs.LG

keywords neural network traininglearning ratesStackelberg optimizationtwo-time-scale gradient descentconvergence guaranteesnon-uniform learning ratessupervised learningreinforcement learning

0 comments p. Extension

pith:Q7RP75NJ Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{Q7RP75NJ}

Prints a linked pith:Q7RP75NJ badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Assigning a smaller learning rate to body layers and a larger learning rate to the final layer is equivalent to two-time-scale alternating gradient descent on a Stackelberg reformulation of neural network training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that non-uniform learning rates with smaller values for body layers and a larger value for the final layer amount to running two-time-scale alternating gradient descent on a Stackelberg game reformulation of the training objective. In this reformulation the final layer acts as leader and optimizes an objective defined on the best responses of the earlier layers. Finite-time convergence guarantees are proved for the algorithm under conditions that include constraint sets and non-smooth activations. The authors identify two mechanisms for outperformance: the Stackelberg objective can possess stronger global optimization structure than the original loss, and it can exhibit substantially sharper local curvature early in training that supplies more informative gradients. Experiments on supervised learning and reinforcement learning tasks support both the convergence claims and the practical speed-ups.

Core claim

Reformulating neural network training as a Stackelberg game with the final layer as leader turns the non-uniform learning rate schedule into a two-time-scale alternating gradient descent procedure. Finite-time convergence holds under broad conditions that accommodate constraints and non-smooth activations. On some problems the Stackelberg objective supplies stronger optimization structure than the original objective, while numerical analysis shows it produces substantially sharper local curvature especially in early training and therefore more informative gradients.

What carries the argument

Stackelberg reformulation of the training objective with the final layer as leader whose objective depends on the best response of the body layers.

Load-bearing premise

Modeling layer interactions in neural network training as a Stackelberg game with the final layer as leader preserves the original optimization landscape well enough for convergence and curvature results to apply back to standard training.

What would settle it

Train identical networks with uniform and non-uniform rates on a low-dimensional convex problem whose global minimum is known, then verify whether the non-uniform schedule reaches that minimum at the rate predicted by the sharper-curvature analysis.

Figures

Figures reproduced from arXiv: 2605.15530 by Sihan Zeng, Sujay Bhatt, Sumitra Ganesh.

**Figure 1.** Figure 1: Optimization landscape and contour. The Stackelberg objective (bottom row) consistently shows a strong gradient direction away from the origin and a sharper curvature (measured by the largest eigenvalue and trace of the Hessian) when the iterates are far from convergence (iterations 100 and earlier according to [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 2.** Figure 2: Objective function convergence under single learning rate and non-uniform learning rates, using the same training trajectories as those for generating [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Non-uniform learning rates for regression. 0 10 20 30 40 50 Epoch 0.960 0.965 0.970 0.975 0.980 0.985 Classification Accuracy MNIST 0 10 20 30 40 50 Epoch 0.84 0.85 0.86 0.87 0.88 0.89 0.90 FASHION 0 20 40 60 80 100 Epoch 0.4 0.5 0.6 0.7 0.8 CIFAR Uniform Small Uniform Large Non-Uniform [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 5.** Figure 5: Non-uniform learning rates for policy optimization in Atari games. “Non-Uniform” indicates that we use learning rate α for the body and a larger rate β for the head. “Uniform Small” and “Uniform Large” use a single learning rate equal to α and β, respectively. For a fixed M, let ΠM denote the orthogonal projection onto ΨM, which is the representation space induced by M ΨM ≜    − ψM(s1) ⊤ − . . . − ψM(s|… view at source ↗

read the original abstract

Neural networks are typically trained with a single learning rate across all layers. While recent empirical evidence suggests that assigning layer-specific learning rates can accelerate training, a principled understanding of the conditions and mechanisms under which non-uniform learning rates are beneficial remains limited. In this work, we investigate non-uniform learning rates through the lens of Stackelberg optimization. Specifically, we demonstrate that training neural networks with a smaller learning rate for the body layers and a larger learning rate for the final layer can be interpreted as a two-time-scale alternating gradient descent algorithm applied to a Stackelberg reformulation of the original objective. We establish finite-time convergence guarantees for the algorithm under broad conditions that accommodate constraint sets and non-smooth activation functions. Beyond convergence, we identify two mechanisms by which non-uniform learning rates can outperform uniform learning rates: (i) we show that certain problem instances induce a Stackelberg objective with stronger optimization structure than the original objective, yielding faster convergence to globally optimal solutions, (ii) our numerical analysis reveals that the Stackelberg objective can exhibit substantially sharper local curvature, especially in early training, which leads to more informative gradients and learning acceleration. Experiments in supervised learning and reinforcement learning support our findings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The Stackelberg framing gives finite-time convergence for non-uniform layer rates under broad conditions, but the curvature and acceleration claims hinge on how well the reformulated objective matches standard training dynamics.

read the letter

The paper recasts smaller learning rates on body layers and larger ones on the final layer as two-time-scale alternating gradient descent on a Stackelberg game, with the last layer as leader. They prove finite-time convergence for this setup under conditions that include constraint sets and non-smooth activations. That is the main new piece: an explicit game-theoretic structure plus two claimed mechanisms for why it can beat uniform rates—one about stronger global optimization properties on some instances, the other about sharper local curvature early in training that yields better gradients.

Referee Report

2 major / 1 minor

Summary. The manuscript reinterprets the empirical practice of using smaller learning rates for body layers and a larger learning rate for the final layer in neural network training as a two-time-scale alternating gradient descent procedure applied to a Stackelberg reformulation of the original empirical risk objective, with the final layer as leader. It claims finite-time convergence guarantees for this algorithm under broad conditions that include constraint sets and non-smooth activation functions. Two mechanisms are identified for potential outperformance over uniform learning rates: (i) certain problem instances yield a Stackelberg objective with stronger optimization structure that enables faster convergence to global optima, and (ii) the Stackelberg objective exhibits substantially sharper local curvature (especially early in training), producing more informative gradients. Experiments in supervised learning and reinforcement learning are presented in support.

Significance. If the finite-time convergence results hold rigorously and the identified mechanisms are shown to transfer meaningfully to standard neural network training, the work could provide a principled game-theoretic lens for selecting layer-specific learning rates. The explicit accommodation of non-smooth activations and constraints in the convergence analysis is a strength relative to many existing bilevel or two-time-scale analyses in the literature.

major comments (2)

The curvature-acceleration mechanism (abstract and numerical analysis) asserts that the Stackelberg objective exhibits substantially sharper local curvature than the original loss, leading to acceleration. However, because the best-response map is set-valued and non-differentiable for common non-smooth activations such as ReLU, the effective Hessian or gradient of the leader's objective is not obviously well-defined or directly comparable to the geometry of the original empirical risk; this undermines transfer of the curvature claims back to standard training dynamics.
The finite-time convergence guarantees are stated to hold under broad conditions that accommodate constraints and non-smooth activations. The manuscript should clarify (in the convergence section) whether these guarantees apply exactly to the practical implementation of layer-specific learning rates or require additional regularity conditions on the best-response map that may not be satisfied in typical neural network settings; this is load-bearing for the claim that the reformulation explains observed benefits.

minor comments (1)

Notation distinguishing the Stackelberg leader objective from standard bilevel optimization should be made more explicit to avoid potential confusion with existing literature on hyperparameter optimization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below and indicate where revisions will be made to improve clarity without altering the core claims.

read point-by-point responses

Referee: The curvature-acceleration mechanism (abstract and numerical analysis) asserts that the Stackelberg objective exhibits substantially sharper local curvature than the original loss, leading to acceleration. However, because the best-response map is set-valued and non-differentiable for common non-smooth activations such as ReLU, the effective Hessian or gradient of the leader's objective is not obviously well-defined or directly comparable to the geometry of the original empirical risk; this undermines transfer of the curvature claims back to standard training dynamics.

Authors: We agree that the set-valued and non-differentiable nature of the best-response map for non-smooth activations such as ReLU means that a classical Hessian of the leader's objective is not well-defined. Our curvature analysis relies on numerical evaluation of the effective objective (via finite differences on sampled trajectories) rather than analytical Hessian comparison. We will revise the abstract and numerical analysis sections to explicitly state that the sharper-curvature observation is an empirical finding in the implemented dynamics and does not rest on differentiability of the best-response map. This preserves the reported acceleration while acknowledging the limitation on analytical transfer to non-smooth cases. revision: partial
Referee: The finite-time convergence guarantees are stated to hold under broad conditions that accommodate constraints and non-smooth activations. The manuscript should clarify (in the convergence section) whether these guarantees apply exactly to the practical implementation of layer-specific learning rates or require additional regularity conditions on the best-response map that may not be satisfied in typical neural network settings; this is load-bearing for the claim that the reformulation explains observed benefits.

Authors: The finite-time convergence results apply to the two-time-scale alternating gradient descent procedure on the Stackelberg objective under the stated assumptions, which already accommodate non-unique best responses and non-smooth activations through appropriate measurable selections; no additional regularity (such as Lipschitz continuity of the best-response) is imposed. The practical layer-specific learning rates correspond to a single inner gradient step approximation of this procedure. We will add a clarifying paragraph in the convergence section that distinguishes the idealized algorithm from its practical approximation and notes that the guarantees do not rely on conditions beyond those already listed. revision: yes

Circularity Check

0 steps flagged

No significant circularity: Stackelberg reformulation and convergence analysis are independent of inputs

full rationale

The paper frames non-uniform learning rates as an interpretation via two-time-scale alternating GD on a Stackelberg reformulation of the original objective. Finite-time convergence is then derived for this algorithm under general conditions (constraints, non-smooth activations). Mechanisms such as stronger optimization structure and sharper curvature are analyzed on the reformulated objective. No quoted step reduces a claimed prediction or first-principles result to a fitted parameter, self-definition, or load-bearing self-citation chain; the reformulation is introduced as a modeling lens rather than derived from the target claims. The derivation chain remains self-contained against external alternating-GD benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard assumptions from non-smooth optimization and two-time-scale analysis; no new free parameters or invented entities are introduced in the abstract.

axioms (1)

standard math Finite-time convergence of two-time-scale alternating gradient descent holds under the stated conditions on constraint sets and non-smooth activations
Invoked when establishing the convergence guarantees for the Stackelberg algorithm

pith-pipeline@v0.9.0 · 5743 in / 1270 out tokens · 47442 ms · 2026-05-19T14:37:19.333466+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

training neural networks with a smaller learning rate for the body layers and a larger learning rate for the final layer can be interpreted as a two-time-scale alternating gradient descent algorithm applied to a Stackelberg reformulation
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we establish finite-time convergence guarantees ... under ... non-smooth activation functions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 2 internal anchors

[1]

Ultra-fast fea- ture learning for the training of two-layer neural net- works in the two-timescale regime.arXiv preprint arXiv:2504.18208,

Barboni, R., Peyr ´e, G., and Vialard, F.-X. Ultra-fast fea- ture learning for the training of two-layer neural net- works in the two-timescale regime.arXiv preprint arXiv:2504.18208,

work page arXiv
[2]

Closed-Form Last Layer Optimization

Galashov, A., Da Costa, N., Xu, L., Hennig, P., and Gretton, A. Closed-form last layer optimization.arXiv preprint arXiv:2510.04606,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Ginsburg, B., Castonguay, P., Hrinchuk, O., Kuchaiev, O., Lavrukhin, V ., Leary, R., Li, J., Nguyen, H., Zhang, Y ., and Cohen, J. M. Stochastic gradient methods with layer- wise adaptive moments for training of deep networks. arXiv preprint arXiv:1905.11286,

work page arXiv 1905
[4]

Noise- adaptive layerwise learning rates: Accelerating geometry- aware optimization for deep neural network training

Hao, J., Gong, X., Xu, J., Wang, Z., and Liu, M. Noise- adaptive layerwise learning rates: Accelerating geometry- aware optimization for deep neural network training. arXiv preprint arXiv:2510.14009,

work page arXiv
[5]

Large batch training does not need warmup.arXiv preprint arXiv:2002.01576,

Huo, Z., Gu, B., and Huang, H. Large batch training does not need warmup.arXiv preprint arXiv:2002.01576,

work page arXiv 2002
[6]

Stackelberg cou- pling of online representation learning and reinforcement learning.arXiv preprint arXiv:2508.07452,

Martinez, F., Li, T., Lu, Y ., and Chen, J. Stackelberg cou- pling of online representation learning and reinforcement learning.arXiv preprint arXiv:2508.07452,

work page arXiv
[7]

Large Batch Training of Convolutional Networks

You, Y ., Gitman, I., and Ginsburg, B. Large batch training of convolutional networks.arXiv preprint arXiv:1708.03888,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

and Doan, T

Zeng, S. and Doan, T. T. Accelerated multi-time-scale stochastic approximation: Optimal complexity and appli- cations in reinforcement learning and multi-agent games. arXiv preprint arXiv:2409.07767,

work page arXiv
[9]

P., Bhatt, S., Ardon, L., Ganesh, S., and Koppel, A

Zeng, S., Evans, B. P., Bhatt, S., Ardon, L., Ganesh, S., and Koppel, A. Learning in stackelberg mean field games: A non-asymptotic analysis.arXiv preprint arXiv:2509.15392,

work page arXiv
[10]

12 B.2 Proof of Theorem 3.7

10 Rethinking Neural Network Learning Rates: A Stackelberg Perspective Supplementary Material Rethinking Neural Network Learning Rates: A Stackelberg Perspective Contents A Frequently Used Notations and Intermediate Results 12 B Proof of Theorems 12 B.1 Proof of Theorem 3.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page 2019
[11]

14 Rethinking Neural Network Learning Rates: A Stackelberg Perspective B.2

2/5 , where in the last inequality we plug in the step size conditionα 0 ≤β 0 ≤1. 14 Rethinking Neural Network Learning Rates: A Stackelberg Perspective B.2. Proof of Theorem 3.7 The per-iteration convergence of the last-layer weights is analyzed in Proposition B.2. Below we establish the convergence of the body-layer weights. The proof of Theorem 3.7 com...

work page 2024
[12]

(33) For the fifth term of (29), we have 2⟨wk −w ⋆(Mk)−β k∇wf(M k, wk), w⋆(Mk)−w ⋆(Mk+1)⟩ ≤ λβk 2 ∥wk −w ⋆(Mk)−β k∇wf(M k, wk)∥2 + 2 λβk ∥w⋆(Mk)−w ⋆(Mk+1)∥2 ≤ λβk 2 (1− 3λβk 2 )∥wk −w ⋆(Mk)∥2 + 2L2(L2 Φ +σ 2)α2 k λ3βk ≤ λβk 2 ∥wk −w ⋆(Mk)∥2 + 2L2(L2 Φ +σ 2)α2 k λ3βk ,(34) where the second inequality follows from (30) and (32). Collecting the bounds from (...

work page 2019
[13]

24 Rethinking Neural Network Learning Rates: A Stackelberg Perspective E

λ . 24 Rethinking Neural Network Learning Rates: A Stackelberg Perspective E. Gradient-Based Temporal Difference Learning Under Neural Network Function Approximation We first present the expressions of the gradients the MSPBE and then make the derivation. ∇M f(M, w) = 2E π h γ ψM(s)⊤µ(M, w) ∇M ψM(s′)⊤w − ψM(s)⊤µ(M, w) ∇M ψM(s)⊤w + r(s, a) +γψ M(s′)⊤w−ψ M(...

work page 2009
[14]

The requirement on sampling i.i.d

that we can chooseζk to decay equally fast with respect tok as βk (up to a multiplicative factor difference) and extend our analysis to still guarantee theO(k −2/3)convergence rate. The requirement on sampling i.i.d. from dπ can also be replaced by Markovian sampling according to the state transition. It is a very well-known result in the literature that ...

work page 2019
[15]

If µk were estimated exactly accurate in every iteration (meaning that µk =µ(M k, wk)), the stochastic gradient of Mk given by line 5 of Algorithm 1 would be an unbiased estimate of the true gradient ∇M f(M, w) , and the iteration-wise 26 Rethinking Neural Network Learning Rates: A Stackelberg Perspective convergence analysis from Proposition B.1 would ap...

work page 2025
[16]

2/5 with β0 ζ0 ≤ q 2C1C3λA ˆρ (4ˆρ−5ρ)C4 . Then, the sum of the red highlighted terms are also non-positive, and the inequality above simplifies to αkE[∥∇Φ1/ˆρ(Mk)∥2] ≤ 4ˆρ (4ˆρ−5ρ)E[Φ1/ˆρ(Mk)−Φ 1/ˆρ(Mk+1)] + 8L2 ˆρ2αk (4ˆρ−5ρ)ρλβk E[∥wk −w ⋆(Mk)∥2 − ∥wk+1 −w ⋆(Mk+1)∥2] + C7αk ζk E[∥µk −µ(M k, wk)∥2 − ∥µk+1 −µ(M k+1, wk+1)∥2] +O α2 k +α kβk +α kζk + α3 k ...

work page 2022

[1] [1]

Ultra-fast fea- ture learning for the training of two-layer neural net- works in the two-timescale regime.arXiv preprint arXiv:2504.18208,

Barboni, R., Peyr ´e, G., and Vialard, F.-X. Ultra-fast fea- ture learning for the training of two-layer neural net- works in the two-timescale regime.arXiv preprint arXiv:2504.18208,

work page arXiv

[2] [2]

Closed-Form Last Layer Optimization

Galashov, A., Da Costa, N., Xu, L., Hennig, P., and Gretton, A. Closed-form last layer optimization.arXiv preprint arXiv:2510.04606,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Ginsburg, B., Castonguay, P., Hrinchuk, O., Kuchaiev, O., Lavrukhin, V ., Leary, R., Li, J., Nguyen, H., Zhang, Y ., and Cohen, J. M. Stochastic gradient methods with layer- wise adaptive moments for training of deep networks. arXiv preprint arXiv:1905.11286,

work page arXiv 1905

[4] [4]

Noise- adaptive layerwise learning rates: Accelerating geometry- aware optimization for deep neural network training

Hao, J., Gong, X., Xu, J., Wang, Z., and Liu, M. Noise- adaptive layerwise learning rates: Accelerating geometry- aware optimization for deep neural network training. arXiv preprint arXiv:2510.14009,

work page arXiv

[5] [5]

Large batch training does not need warmup.arXiv preprint arXiv:2002.01576,

Huo, Z., Gu, B., and Huang, H. Large batch training does not need warmup.arXiv preprint arXiv:2002.01576,

work page arXiv 2002

[6] [6]

Stackelberg cou- pling of online representation learning and reinforcement learning.arXiv preprint arXiv:2508.07452,

Martinez, F., Li, T., Lu, Y ., and Chen, J. Stackelberg cou- pling of online representation learning and reinforcement learning.arXiv preprint arXiv:2508.07452,

work page arXiv

[7] [7]

Large Batch Training of Convolutional Networks

You, Y ., Gitman, I., and Ginsburg, B. Large batch training of convolutional networks.arXiv preprint arXiv:1708.03888,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

and Doan, T

Zeng, S. and Doan, T. T. Accelerated multi-time-scale stochastic approximation: Optimal complexity and appli- cations in reinforcement learning and multi-agent games. arXiv preprint arXiv:2409.07767,

work page arXiv

[9] [9]

P., Bhatt, S., Ardon, L., Ganesh, S., and Koppel, A

Zeng, S., Evans, B. P., Bhatt, S., Ardon, L., Ganesh, S., and Koppel, A. Learning in stackelberg mean field games: A non-asymptotic analysis.arXiv preprint arXiv:2509.15392,

work page arXiv

[10] [10]

12 B.2 Proof of Theorem 3.7

10 Rethinking Neural Network Learning Rates: A Stackelberg Perspective Supplementary Material Rethinking Neural Network Learning Rates: A Stackelberg Perspective Contents A Frequently Used Notations and Intermediate Results 12 B Proof of Theorems 12 B.1 Proof of Theorem 3.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page 2019

[11] [11]

14 Rethinking Neural Network Learning Rates: A Stackelberg Perspective B.2

2/5 , where in the last inequality we plug in the step size conditionα 0 ≤β 0 ≤1. 14 Rethinking Neural Network Learning Rates: A Stackelberg Perspective B.2. Proof of Theorem 3.7 The per-iteration convergence of the last-layer weights is analyzed in Proposition B.2. Below we establish the convergence of the body-layer weights. The proof of Theorem 3.7 com...

work page 2024

[12] [12]

(33) For the fifth term of (29), we have 2⟨wk −w ⋆(Mk)−β k∇wf(M k, wk), w⋆(Mk)−w ⋆(Mk+1)⟩ ≤ λβk 2 ∥wk −w ⋆(Mk)−β k∇wf(M k, wk)∥2 + 2 λβk ∥w⋆(Mk)−w ⋆(Mk+1)∥2 ≤ λβk 2 (1− 3λβk 2 )∥wk −w ⋆(Mk)∥2 + 2L2(L2 Φ +σ 2)α2 k λ3βk ≤ λβk 2 ∥wk −w ⋆(Mk)∥2 + 2L2(L2 Φ +σ 2)α2 k λ3βk ,(34) where the second inequality follows from (30) and (32). Collecting the bounds from (...

work page 2019

[13] [13]

24 Rethinking Neural Network Learning Rates: A Stackelberg Perspective E

λ . 24 Rethinking Neural Network Learning Rates: A Stackelberg Perspective E. Gradient-Based Temporal Difference Learning Under Neural Network Function Approximation We first present the expressions of the gradients the MSPBE and then make the derivation. ∇M f(M, w) = 2E π h γ ψM(s)⊤µ(M, w) ∇M ψM(s′)⊤w − ψM(s)⊤µ(M, w) ∇M ψM(s)⊤w + r(s, a) +γψ M(s′)⊤w−ψ M(...

work page 2009

[14] [14]

The requirement on sampling i.i.d

that we can chooseζk to decay equally fast with respect tok as βk (up to a multiplicative factor difference) and extend our analysis to still guarantee theO(k −2/3)convergence rate. The requirement on sampling i.i.d. from dπ can also be replaced by Markovian sampling according to the state transition. It is a very well-known result in the literature that ...

work page 2019

[15] [15]

If µk were estimated exactly accurate in every iteration (meaning that µk =µ(M k, wk)), the stochastic gradient of Mk given by line 5 of Algorithm 1 would be an unbiased estimate of the true gradient ∇M f(M, w) , and the iteration-wise 26 Rethinking Neural Network Learning Rates: A Stackelberg Perspective convergence analysis from Proposition B.1 would ap...

work page 2025

[16] [16]

2/5 with β0 ζ0 ≤ q 2C1C3λA ˆρ (4ˆρ−5ρ)C4 . Then, the sum of the red highlighted terms are also non-positive, and the inequality above simplifies to αkE[∥∇Φ1/ˆρ(Mk)∥2] ≤ 4ˆρ (4ˆρ−5ρ)E[Φ1/ˆρ(Mk)−Φ 1/ˆρ(Mk+1)] + 8L2 ˆρ2αk (4ˆρ−5ρ)ρλβk E[∥wk −w ⋆(Mk)∥2 − ∥wk+1 −w ⋆(Mk+1)∥2] + C7αk ζk E[∥µk −µ(M k, wk)∥2 − ∥µk+1 −µ(M k+1, wk+1)∥2] +O α2 k +α kβk +α kζk + α3 k ...

work page 2022