arxiv: 2605.09209 · v1 · submitted 2026-05-09 · 🧮 math.OC · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Select-then-differentiate: Solving Bilevel Optimization with Manifold Lower-level Solution Sets

Negar Kiyavash, Niao He, Saeed Masiha, Zebang Shen

Pith reviewed 2026-05-12 02:59 UTC · model grok-4.3

classification 🧮 math.OC cs.AI

keywords bilevel optimizationmanifold minimizershypergradient computationPolyak-Lojasiewicz conditionoptimistic bilevelpseudoinverseselect-then-differentiateconvergence analysis

0 comments

The pith

Uniqueness of the optimistic selection suffices for differentiability of the hyper-objective in bilevel optimization with manifold lower-level solutions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses optimistic bilevel optimization where the lower-level problem has a non-isolated manifold of minimizers instead of a unique solution. It proves that under a local Polyak-Łojasiewicz condition, differentiability of the hyper-objective holds as long as the optimistic selection among those minimizers is unique. This leads to an explicit hyper-gradient formula using the pseudoinverse that extends previous results limited to singleton minimizers. The work also characterizes the conditions for local smoothness and proposes the HG-MS algorithm that selects the optimistic solution first and then differentiates, with convergence depending on the manifold's intrinsic dimension rather than the ambient one.

Core claim

Under a local Polyak-Łojasiewicz condition, the hyper-objective in optimistic bilevel optimization is differentiable whenever the optimistic selection from the lower-level solution manifold is unique. This yields an explicit pseudoinverse-based hyper-gradient formula. Non-degeneracy of the selected minimizer ensures local smoothness, while non-uniqueness can introduce non-differentiable points and non-degeneracy can eliminate Hölder regularity.

What carries the argument

The unique optimistic selection from the manifold of lower-level minimizers, which permits computation of the hyper-gradient via the pseudoinverse of the Jacobian of the selected point's stationarity conditions.

If this is right

The hyper-objective is locally smooth if the selected minimizer is non-degenerate.
Non-uniqueness of the optimistic selection creates multiple points where the hyper-objective is non-differentiable.
Failure of non-degeneracy destroys all positive Hölder regularity of the hyper-gradient.
HG-MS converges to a stationary point of the optimistic objective with iteration complexity governed by the intrinsic dimension of the solution manifold.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This framework could apply to bilevel problems in machine learning involving symmetries or multiple equivalent solutions at the lower level.
Numerical verification of the hyper-gradient formula on low-dimensional manifold examples would provide a direct test.
The select-then-differentiate idea might combine with manifold optimization algorithms for further speedups in high-dimensional settings.
In the LLM reweighting application, it suggests robustness to cases where source models have equivalent performance optima.

Load-bearing premise

The lower-level objective satisfies a local Polyak-Łojasiewicz condition and admits a unique non-degenerate optimistic selection from its manifold of minimizers.

What would settle it

A concrete counterexample lower-level function with a line of minimizers where two points tie for the optimistic upper-level value, showing the hyper-objective has a kink or non-differentiable point at the corresponding upper-level input.

Figures

Figures reproduced from arXiv: 2605.09209 by Negar Kiyavash, Niao He, Saeed Masiha, Zebang Shen.

**Figure 1.** Figure 1: Illustration of Example 3.1. Left: selection switches between two endpoints of S(θ) according to the sign of a(θ). Right: F(θ) = −r(θ)|a(θ)| is obtained by taking the smaller value between the two branches ±r(θ)a(θ), creating kinks at ties a(θ) = 0 (black dots). Non-unique minima-selection can create kinks in the hyper-objective. Even when S(θ) is a C∞ connected manifold and varies smoothly with θ, minimiz… view at source ↗

**Figure 2.** Figure 2: Hyper-gradient error proof for Lemma 5.2. This figure summarizes the proof strategy presented in Appendix E.2: tube invertibility, deterministic stability, its on-tube specialization, and off-tube clipping control are combined to prove Lemma 5.2. Using α ≤ 1/LF gives 3 4α − LF 2 ≥ 1 4α , hence F(θt+1) ≤ F(θt) − 1 4α ∥∆t∥ 2 + α∥et∥ 2 = F(θt) − α 4 ∥Get∥ 2 + α∥et∥ 2 . Summing over t = 0, . . . , T − 1 and us… view at source ↗

**Figure 3.** Figure 3: Four-step selection-error proof for Lemma 5.3. Each lemma, proposition, corollary, or definition used in Appendix E.3 appears in its own box. The larger step boxes group the results according to the proof strategy stated at the beginning of Appendix E.3; dashed arrows mark supporting assumptions or intermediate deterministic superquantile steps. Notation for the selection proof. We fix notation for the sel… view at source ↗

**Figure 4.** Figure 4: Data hyper-cleaning (MNIST). Train (top row) and test (bottom row) accuracies vs. time at corruption rates ρ ∈ {0.4, 0.6, 0.8}. Accuracies are plotted as mean ± 95% confidence intervals (fixed budget per algorithm: 60 s for ρ ∈ {0.4, 0.6} with 5 runs, and 120 s for ρ = 0.8 with 5 runs). mean test accuracy from 64.5% (HG-BASELINE) to 66.1%, and outperforms the penalty/alternating baselines under the same ti… view at source ↗

**Figure 5.** Figure 5: Imbalanced loss tuning (MNIST). Train/test balanced accuracy vs. time (mean and 95% CI; 120s per algorithm). Relation to the manifold theory. This benchmark tests the select-then-differentiate principle in a finite-budget nonconvex training pipeline, where the inner problems are not solved exactly and different optimizer branches can reach candidate models with different validation losses and balanced accu… view at source ↗

read the original abstract

We study optimistic bilevel optimization when the lower-level problem has a non-isolated manifold of minimizers. In this setting, the hyper-objective may be non-differentiable because the upper-level criterion must choose among multiple lower-level solutions. Under a local Polyak--{\L}ojasiewicz (P{\L}) condition, we show that differentiability does not require the lower-level solution set to be a singleton: uniqueness of the optimistic selection is sufficient. This yields an explicit pseudoinverse-based hyper-gradient formula extending the classical singleton-minimizer result. We further characterize the regularity of the hyper-objective: non-degeneracy of the selected minimizer along the solution manifold yields local smoothness, while failure of uniqueness can create many non-differentiable points and failure of non-degeneracy can destroy all positive H\"older regularity of the hyper-gradient. Motivated by this theory, we propose HG-MS, a select-then-differentiate method combining explicit optimistic selection with efficient pseudoinverse-based hyper-gradient computation. Despite the nonconvex nature of optimistic selection over the lower-level solution manifold, we show that HG-MS converges to a stationary point of the optimistic objective with complexity governed by the intrinsic dimension of the solution manifold rather than its ambient dimension. Empirically, we test a practical variant of HG-MS for matched-budget LLM source reweighting. This variant preserves the select-then-differentiate principle and obtains the best GSM8K/MATH scores across the tested backbones, along with competitive or best MT-Bench instruction-following results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows how to differentiate through bilevel problems with manifold lower-level solutions by relying on unique optimistic selection rather than a singleton minimizer.

read the letter

This paper's key point is that in bilevel optimization, you don't need the lower-level problem to have a unique minimizer for the hyper-objective to be differentiable. Uniqueness of the optimistic selection among the manifold of solutions is enough, under a local Polyak-Łojasiewicz condition. They give an explicit hyper-gradient formula using the pseudoinverse that generalizes the usual case. They do a good job developing the theory around regularity too. Non-degeneracy along the manifold gives local smoothness, while degeneracy can kill Hölder continuity. That seems useful for understanding when things go wrong. The HG-MS algorithm follows from this: it does the selection first then differentiates, and they show convergence to a stationary point with complexity tied to the intrinsic dimension of the manifold. That's a solid algorithmic contribution because it avoids paying for the ambient dimension. The empirical test on matched-budget LLM source reweighting is a nice real-world check. Their practical variant gets strong results on GSM8K, MATH, and MT-Bench across backbones. It shows the idea can translate to practice. The soft spots are in the assumptions and the selection step. Local PL is common but not universal, and ensuring the optimistic selection is unique might require extra work or checks that aren't always easy. The selection itself is nonconvex, so the practical version probably uses some approximation or heuristic, which could affect guarantees. The paper doesn't seem to have extensive comparisons on standard bilevel benchmarks, so it's not clear how it stacks up against other methods in controlled settings. This is aimed at researchers in bilevel optimization and machine learning applications like hyperparameter optimization or data reweighting. Anyone dealing with lower-level problems that have multiple solutions will find the framework helpful. It has enough new theory and a motivated algorithm to deserve a serious referee. I would recommend sending it out for peer review. The central claims look consistent and the extension is meaningful.

Referee Report

0 major / 3 minor

Summary. The manuscript studies optimistic bilevel optimization when the lower-level problem has a non-isolated manifold of minimizers. Under a local Polyak-Łojasiewicz condition, it shows that differentiability of the hyper-objective requires only uniqueness of the optimistic selection (rather than a singleton lower-level solution set) and derives an explicit pseudoinverse-based hyper-gradient formula. It further characterizes regularity: non-degeneracy of the selected point yields local smoothness while degeneracy can destroy Hölder regularity. The paper proposes the HG-MS algorithm (select-then-differentiate with pseudoinverse gradients), proves convergence to a stationary point of the optimistic objective with complexity scaling in the intrinsic dimension of the manifold, and reports empirical results on a practical variant for matched-budget LLM source reweighting.

Significance. If the central claims hold, the work meaningfully extends differentiable bilevel optimization to manifold-valued lower-level solution sets, a setting that arises in many applications. The explicit pseudoinverse formula and the intrinsic-dimension complexity bound are concrete technical advances over classical singleton results. The empirical demonstration on LLM reweighting provides evidence of practical utility. The combination of local PL analysis, selection uniqueness, and manifold-aware convergence is a clear strength.

minor comments (3)

The statement of the local PL condition and the precise definition of the optimistic selection operator should be cross-referenced to the main theorem on differentiability (likely around the hyper-gradient formula) to make the logical dependence explicit.
In the empirical section on LLM source reweighting, the description of the practical HG-MS variant would benefit from a short paragraph clarifying how the manifold selection step is approximated under the matched-budget constraint, including any discretization or sampling used.
A few sentences comparing the derived pseudoinverse formula to existing implicit-differentiation approaches for non-unique lower-level problems would help situate the contribution for readers.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation for minor revision. The report highlights the key contributions on pseudoinverse hyper-gradients under manifold lower-level solutions and the intrinsic-dimension complexity of HG-MS, which aligns with our intent. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper derives differentiability of the hyper-objective from the local PL condition plus uniqueness of the optimistic selection, yielding a pseudoinverse hyper-gradient that reduces to the classical singleton case by standard sensitivity analysis on the tangent space. Convergence of HG-MS follows directly from manifold geometry and intrinsic dimension after selection, with no reduction of any claim to a fitted parameter, self-definition, or load-bearing self-citation. All steps are self-contained against implicit-function and manifold sensitivity results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest primarily on the local PL condition (a standard but non-trivial domain assumption in nonconvex optimization) and the uniqueness of optimistic selection; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Local Polyak-Łojasiewicz (PL) condition on the lower-level problem
Invoked to guarantee that differentiability holds when the optimistic selection is unique.

pith-pipeline@v0.9.0 · 5593 in / 1355 out tokens · 80331 ms · 2026-05-12T02:59:55.444664+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
Under a local Polyak–Łojasiewicz (PŁ) condition, we show that differentiability does not require the lower-level solution set to be a singleton: uniqueness of the optimistic selection is sufficient. This yields an explicit pseudoinverse-based hyper-gradient formula
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear
S(θ) is a connected compact C² embedded submanifold of R^d without boundary

Reference graph

Works this paper leans on

231 extracted references · 231 canonical work pages · 4 internal anchors

[1]

Criscitiello, Chris and Rebjock, Quentin and Boumal, Nicolas , title =

work page
[2]

Foundations of Computational Mathematics , pages=

Analysis of langevin monte carlo from poincare to log-sobolev , author=. Foundations of Computational Mathematics , pages=. 2024 , publisher=

work page 2024
[3]

arXiv preprint arXiv:2409.10323 , year=

On the Hardness of Meaningful Local Guarantees in Nonsmooth Nonconvex Optimization , author=. arXiv preprint arXiv:2409.10323 , year=

work page arXiv
[4]

The Thirty Sixth Annual Conference on Learning Theory , pages=

Deterministic nonsmooth nonconvex optimization , author=. The Thirty Sixth Annual Conference on Learning Theory , pages=. 2023 , organization=

work page 2023
[5]

International Conference on Machine Learning , pages=

On the finite-time complexity and practical computation of approximate stationarity concepts of lipschitz functions , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[6]

Advances in Neural Information Processing Systems , volume=

Oracle complexity in nonsmooth nonconvex optimization , author=. Advances in Neural Information Processing Systems , volume=

work page
[7]

International Conference on Machine Learning , pages=

Complexity of finding stationary points of nonconvex nonsmooth functions , author=. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020
[8]

arXiv preprint arXiv:1611.01540 , year=

Topology and geometry of half-rectified network optimization , author=. arXiv preprint arXiv:1611.01540 , year=

work page arXiv
[9]

Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

Empirical analysis of the hessian of over-parametrized neural networks , author=. arXiv preprint arXiv:1706.04454 , year=

work page Pith review arXiv
[10]

arXiv preprint arXiv:1802.06384 , year=

Spurious valleys in two-layer neural network optimization landscapes , author=. arXiv preprint arXiv:1802.06384 , year=

work page arXiv
[11]

International Conference on Machine Learning , pages=

Understanding the loss surface of neural networks for binary classification , author=. International Conference on Machine Learning , pages=. 2018 , organization=

work page 2018
[12]

Advances in neural information processing systems , volume=

Loss surfaces, mode connectivity, and fast ensembling of dnns , author=. Advances in neural information processing systems , volume=

work page
[13]

International conference on machine learning , pages=

On connected sublevel sets in deep learning , author=. International conference on machine learning , pages=. 2019 , organization=

work page 2019
[14]

Advances in neural information processing systems , volume=

Explaining landscape connectivity of low-cost solutions for multilayer nets , author=. Advances in neural information processing systems , volume=

work page
[15]

arXiv preprint arXiv:2404.06391 , year=

Exploring Neural Network Landscapes: Star-Shaped and Geodesic Connectivity , author=. arXiv preprint arXiv:2404.06391 , year=

work page arXiv
[16]

Advances in neural information processing systems , volume=

On the local minima of the empirical risk , author=. Advances in neural information processing systems , volume=

work page
[17]

arXiv preprint arXiv:2501.00429 , year=

Poincare Inequality for Local Log-Polyak-Lojasiewicz Measures: Non-asymptotic Analysis in Low-temperature Regime , author=. arXiv preprint arXiv:2501.00429 , year=

work page arXiv
[18]

1997 , publisher=

Techniques in fractal geometry , author=. 1997 , publisher=

work page 1997
[19]

Mathematical programming , volume=

Smooth minimization of non-smooth functions , author=. Mathematical programming , volume=. 2005 , publisher=

work page 2005
[20]

SIAM Journal on Optimization , volume=

Smoothing and first order methods: A unified framework , author=. SIAM Journal on Optimization , volume=. 2012 , publisher=

work page 2012
[21]

Mathematical Programming , volume=

Cubic regularization of Newton method and its global performance , author=. Mathematical Programming , volume=. 2006 , publisher=

work page 2006
[22]

2019 , publisher=

High-dimensional statistics: A non-asymptotic viewpoint , author=. 2019 , publisher=

work page 2019
[23]

arXiv preprint arXiv:1501.01571 , year=

An introduction to matrix concentration inequalities , author=. arXiv preprint arXiv:1501.01571 , year=

work page arXiv
[24]

Statistics & probability letters , volume=

On the best constant in Marcinkiewicz--Zygmund inequality , author=. Statistics & probability letters , volume=. 2001 , publisher=

work page 2001
[25]

arXiv preprint arXiv:2107.11433 , year=

A general sample complexity analysis of vanilla policy gradient , author=. arXiv preprint arXiv:2107.11433 , year=

work page arXiv
[26]

2018 , publisher=

Reinforcement learning: An introduction , author=. 2018 , publisher=

work page 2018
[27]

SIAM Journal on Control and Optimization , volume=

Global convergence of policy gradient methods to (almost) locally optimal policies , author=. SIAM Journal on Control and Optimization , volume=. 2020 , publisher=

work page 2020
[28]

International conference on machine learning , pages=

Hessian aided policy gradient , author=. International conference on machine learning , pages=. 2019 , organization=

work page 2019
[29]

arXiv preprint arXiv:2012.01491 , year=

Sample complexity of policy gradient finding second-order stationary points , author=. arXiv preprint arXiv:2012.01491 , year=

work page arXiv 2012
[30]

Advances in Neural Information Processing Systems , volume=

An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods , author=. Advances in Neural Information Processing Systems , volume=

work page
[31]

arXiv preprint arXiv:2110.10116 , year=

On the Global Convergence of Momentum-based Policy Gradient , author=. arXiv preprint arXiv:2110.10116 , year=

work page arXiv
[32]

Journal of Machine Learning Research , volume=

On the theory of policy gradient methods: Optimality, approximation, and distribution shift , author=. Journal of Machine Learning Research , volume=. 2021 , publisher=

work page 2021
[33]

arXiv preprint arXiv:2106.04096 , year=

Linear convergence of entropy-regularized natural policy gradient with linear function approximation , author=. arXiv preprint arXiv:2106.04096 , year=

work page arXiv
[34]

Neural policy gradient methods: Global optimality and rates of convergence.arXiv preprint arXiv:1909.01150, 2019

Neural policy gradient methods: Global optimality and rates of convergence , author=. arXiv preprint arXiv:1909.01150 , year=

work page arXiv 1909
[35]

Approximately optimal approximate reinforcement learning , author=. In Proc. 19th International Conference on Machine Learning , year=

work page
[36]

Zhurnal Vychislitel'noi Matematiki i Matematicheskoi Fiziki , volume=

Gradient methods for minimizing functionals , author=. Zhurnal Vychislitel'noi Matematiki i Matematicheskoi Fiziki , volume=. 1963 , publisher=

work page 1963
[37]

Linear convergence of gradient and proximal-gradient methods under the polyak-

Karimi, Hamed and Nutini, Julie and Schmidt, Mark , booktitle=. Linear convergence of gradient and proximal-gradient methods under the polyak-. 2016 , organization=

work page 2016
[38]

Advances in neural information processing systems , volume=

Convergence analysis of two-layer neural networks with relu activation , author=. Advances in neural information processing systems , volume=

work page
[39]

arXiv preprint arXiv:1609.05191 , year=

Gradient descent learns linear dynamical systems , author=. arXiv preprint arXiv:1609.05191 , year=

work page arXiv
[40]

IEEE Transactions on information Theory , volume=

Median-truncated nonconvex approach for phase retrieval with outliers , author=. IEEE Transactions on information Theory , volume=. 2018 , publisher=

work page 2018
[41]

Identity matters in deep learning

Identity matters in deep learning , author=. arXiv preprint arXiv:1611.04231 , year=

work page arXiv
[42]

Applied and computational harmonic analysis , volume=

Rapid, robust, and reliable blind deconvolution via nonconvex optimization , author=. Applied and computational harmonic analysis , volume=. 2019 , publisher=

work page 2019
[43]

Advances in Neural Information Processing Systems , volume=

Uniform convergence of gradients for non-convex learning and optimization , author=. Advances in Neural Information Processing Systems , volume=

work page
[44]

arXiv preprint arXiv:1612.00547 , year=

Gradient descent efficiently finds the cubic-regularized non-convex Newton step , author=. arXiv preprint arXiv:1612.00547 , year=

work page arXiv
[45]

Advances in neural information processing systems , volume=

Stochastic cubic regularization for fast nonconvex optimization , author=. Advances in neural information processing systems , volume=

work page
[46]

The Annals of Probability , volume=

Matrix concentration inequalities via the method of exchangeable pairs , author=. The Annals of Probability , volume=. 2014 , publisher=

work page 2014
[47]

Better Theory for

Better theory for SGD in the nonconvex world , author=. arXiv preprint arXiv:2002.03329 , year=

work page arXiv 2002
[48]

Neural computation , volume=

Fast exact multiplication by the Hessian , author=. Neural computation , volume=. 1994 , publisher=

work page 1994
[49]

, author=

Stochastic Variance-Reduced Cubic Regularization Methods. , author=. J. Mach. Learn. Res. , volume=

work page
[50]

The 22nd International Conference on Artificial Intelligence and Statistics , pages=

Stochastic variance-reduced cubic regularization for nonconvex optimization , author=. The 22nd International Conference on Artificial Intelligence and Statistics , pages=. 2019 , organization=

work page 2019
[51]

Conference on Learning Theory , pages=

Convergence rates and approximation results for SGD and its continuous-time counterpart , author=. Conference on Learning Theory , pages=. 2021 , organization=

work page 2021
[52]

Advances in Neural Information Processing Systems , volume=

Convergence of cubic regularization for nonconvex optimization under KL property , author=. Advances in Neural Information Processing Systems , volume=

work page
[53]

Advances in Neural Information Processing Systems , volume=

Tight dimension independent lower bound on the expected convergence rate for diminishing step sizes in SGD , author=. Advances in Neural Information Processing Systems , volume=

work page
[54]

arXiv preprint arXiv:1611.01146 , year=

Finding approximate local minima for nonconvex optimization in linear time , author=. arXiv preprint arXiv:1611.01146 , year=

work page arXiv
[55]

Mathematical Programming , volume=

A linear-time algorithm for trust region problems , author=. Mathematical Programming , volume=. 2016 , publisher=

work page 2016
[56]

arXiv preprint arXiv:1710.05782 , year=

Second-order methods with cubic regularization under inexact information , author=. arXiv preprint arXiv:1710.05782 , year=

work page arXiv
[57]

Part I: motivation, convergence and numerical results , author=

Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results , author=. Mathematical Programming , volume=. 2011 , publisher=

work page 2011
[58]

arXiv preprint arXiv:1710.04788 , year=

A unified scheme to accelerate adaptive cubic regularization and gradient methods for convex optimization , author=. arXiv preprint arXiv:1710.04788 , year=

work page arXiv
[59]

International Conference on Machine Learning , pages=

Sub-sampled cubic regularization for non-convex optimization , author=. International Conference on Machine Learning , pages=. 2017 , organization=

work page 2017
[60]

Proceedings of the 2020 SIAM International Conference on Data Mining , pages=

Second-order optimization for non-convex machine learning: An empirical study , author=. Proceedings of the 2020 SIAM International Conference on Data Mining , pages=. 2020 , organization=

work page 2020
[61]

Advances in Neural Information Processing Systems , volume=

Variational policy gradient method for reinforcement learning with general utilities , author=. Advances in Neural Information Processing Systems , volume=

work page
[62]

Advances in Neural Information Processing Systems , volume=

On the convergence and sample efficiency of variance-reduced policy gradient method , author=. Advances in Neural Information Processing Systems , volume=

work page
[63]

International Conference on Machine Learning , pages=

Global convergence of policy gradient methods for the linear quadratic regulator , author=. International Conference on Machine Learning , pages=. 2018 , organization=

work page 2018
[64]

International Conference on Machine Learning , pages=

On the global convergence rates of softmax policy gradient methods , author=. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020
[65]

2003 , publisher=

Introductory lectures on convex optimization: A basic course , author=. 2003 , publisher=

work page 2003
[66]

International Conference on Machine Learning , pages=

An asynchronous parallel stochastic coordinate descent algorithm , author=. International Conference on Machine Learning , pages=. 2014 , organization=

work page 2014
[67]

Mathematical Programming , volume=

Linear convergence of first order methods for non-strongly convex optimization , author=. Mathematical Programming , volume=. 2019 , publisher=

work page 2019
[68]

Journal of the Operations Research Society of China , volume=

On the linear convergence of a proximal gradient method for a class of nonsmooth convex minimization problems , author=. Journal of the Operations Research Society of China , volume=. 2013 , publisher=

work page 2013
[69]

Foundations of Computational Mathematics , volume=

A geometric analysis of phase retrieval , author=. Foundations of Computational Mathematics , volume=. 2018 , publisher=

work page 2018
[70]

INFORMS Journal on Optimization , volume=

Adaptive stochastic variance reduction for subsampled Newton method with cubic regularization , author=. INFORMS Journal on Optimization , volume=. 2022 , publisher=

work page 2022
[71]

International Conference on Machine Learning , pages=

Stochastic variance-reduced cubic regularized Newton methods , author=. International Conference on Machine Learning , pages=. 2018 , organization=

work page 2018
[72]

Conference on Learning Theory , pages=

Second-order information in non-convex stochastic optimization: Power and limitations , author=. Conference on Learning Theory , pages=. 2020 , organization=

work page 2020
[73]

arXiv preprint arXiv:1902.02388 , year=

Inexact proximal cubic regularized newton methods for convex optimization , author=. arXiv preprint arXiv:1902.02388 , year=

work page arXiv 1902
[74]

GitHub repository , howpublished =

Zuo, Xingdong , title =. GitHub repository , howpublished =. 2018 , publisher =

work page 2018
[75]

2012 IEEE/RSJ international conference on intelligent robots and systems , pages=

Mujoco: A physics engine for model-based control , author=. 2012 IEEE/RSJ international conference on intelligent robots and systems , pages=. 2012 , organization=

work page 2012
[76]

International Conference on Machine Learning , pages=

Momentum-based policy gradient methods , author=. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020
[77]

Advances in neural information processing systems , volume=

Momentum-Based Variance Reduction in Non-Convex SGD , author=. Advances in neural information processing systems , volume=

work page
[78]

International Conference on Artificial Intelligence and Statistics , pages=

Stochastic recursive variance-reduced cubic regularization methods , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2020 , organization=

work page 2020
[79]

arXiv e-print , author=

The Masked Sample Covariance Estimator: An Analysis via Matrix Concentration Inequalities. arXiv e-print , author=. arXiv preprint arXiv:1109.1637 , year=

work page arXiv
[80]

Machine learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

work page 1992

Showing first 80 references.