Second-Order Bilevel Optimization with Accelerated Convergence Rates

Chengchang Liu, John C.S. Lui, Lesi Chen, Sheng Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:05 UTC · model grok-4.3

classification 🧮 math.OC

keywords bilevel optimizationsecond-order methodsaccelerated convergencenonconvex-strongly-convexhyper-objectivelazy approximationminimax optimization

0 comments

The pith

A fully second-order method for nonconvex-strongly-convex bilevel problems reaches Õ(ε^{-1.5}) iterations to an (ε, O(√ε)) second-order stationary point of the hyper-objective.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FSBA, a fully second-order bilevel approximation algorithm that incorporates second-order oracles from both levels to drive updates on the hyper-objective. It establishes that this method attains the stated iteration complexity bound, which improves on known first-order rates for the same stationarity measure. A lazy variant LFSBA reuses second-order information over multiple steps to cut computational cost by a √d factor. The same reuse idea yields an analogous lazy cubic-regularized Newton method for nonconvex-strongly-concave minimax problems.

Core claim

FSBA achieves an iteration complexity of Õ(ε^{-1.5}) for finding the (ε, O(√ε)) second-order stationary point of the hyper-objective function in nonconvex-strongly-convex bilevel optimization, demonstrating that second-order methods can achieve accelerated convergence rates compared with first-order methods.

What carries the argument

The FSBA update that approximates the hyper-objective gradient and Hessian via full second-order oracles at both levels and solves a cubic-regularized subproblem at each step.

Load-bearing premise

The bilevel problem has a nonconvex upper level and strongly convex lower level together with enough smoothness to support second-order oracles whose cost is counted in the complexity.

What would settle it

Running FSBA on a concrete nonconvex-strongly-convex bilevel test problem with successively smaller ε and observing that the number of iterations required grows faster than O(ε^{-1.5}) would disprove the claimed rate.

Figures

Figures reproduced from arXiv: 2605.06431 by Chengchang Liu, John C.S. Lui, Lesi Chen, Sheng Yang.

**Figure 1.** Figure 1: Comparison of LMCN and baseline algorithms in terms of oracle calls under different initial points (x1, y1) and (x2, y2). Breast p = 0.25 Breast p = 0.5 Australian p = 0.25 Australian p = 0.5 view at source ↗

**Figure 2.** Figure 2: Comparison of various bilevel algorithms with different noise rate p on “breast-cancer” and “australian” datasets. ITD (Ji et al., 2021), AID with conjugate gradient (Maclaurin et al., 2015), and near optimal fully first-order methods F 2BA (Chen et al., 2023) on “breast-cancer” and “australian” datasets (Chang & Lin, 2011). We report the results on Dtr with different noise rates p = 25% and p = 50% (the … view at source ↗

**Figure 3.** Figure 3: Comparison of various bilevel algorithms on logistic regression on the 20 Newsgroups dataset. (a) FC100 (b) miniImageNet view at source ↗

**Figure 4.** Figure 4: Test accuracy versus running time of different bilevel algorithms in 5-way 5-shot few-shot meta-learning experiments on FC100 and miniImageNet. Here, LDi (x, yi) = 1 |Di| P ξ∈Di ℓ(x, yi ; ξ) is the query loss, and LSi (x, yi) = 1 |Si| P ξ∈Si (ℓ(x, yi ; ξ) + R(yi)) is the support loss. In our experiments, ℓ is the cross-entropy loss and R is an ℓ2 regularizer. Since prior work has shown that PZOBO (Sow et a… view at source ↗

**Figure 5.** Figure 5: Comparison of various bilevel algorithms for data hypercleaning at different noise rate p on ”MNIST” datasets. 0 200 400 600 800 1000 1200 1400 \# of oracle calls 10 17 10 15 10 13 10 11 10 9 10 7 10 5 10 3 10 1 Error LMCN (m=1) LMCN (m=5) LMCN (m=10) LMCN (m=50) LMCN (m=100) (a) Synthetic problem (b) Data-cleaning problem on Austrilian dataset view at source ↗

**Figure 6.** Figure 6: Ablation study on the Hessian update frequency m. We also compare IFSBA method (Algorithm 3) with baseline methods, including ITD, AID with conjugate gradient, and F 2BA on ”MNIST” datasets (LeCun et al., 2002). We report the results on Dtr with different noise rates p = 25% and p = 50% in view at source ↗

**Figure 7.** Figure 7: Ablation study on the penalty multiplier λ for the data-cleaning task. (a) M = 1 (b) M = 5 (c) M = 10 view at source ↗

**Figure 8.** Figure 8: Ablation study on the cubic regularization parameter M for the data-cleaning task. bounded as ∥∇2 xxg(x, y)∥ ≤ 1 |Dtr| maxi |σ ′′(xi)|Li(y) = O(B/|Dtr|), ∥∇2 xyg(x, y)∥ ≤ 1 4|Dtr| qP i∈Dtr ∥ai∥ 2, and ∥∇2 yyg(x, y)∥ ≤ 1 4|Dtr| λmaxP i∈Dtr aia ⊤ i + µ. These bounds provide estimates of the smoothness constant of g, and other problem-dependent constants can be bounded in a similar way. Nevertheless, comput… view at source ↗

read the original abstract

This paper studies second-order methods for nonconvex-strongly-convex bilevel optimization. We propose a novel fully second-order bilevel approximation method (FSBA) that achieves an iteration complexity of $\tilde{\mathcal{O}}(\epsilon^{-1.5})$ for finding the $(\epsilon, \mathcal{O}(\sqrt{\epsilon}))$ second-order stationary point of the hyper-objective function. Our results demonstrate that second-order methods can achieve an accelerated convergence rate than first-order methods in bilevel optimization. To address the heavy computational cost associated with the second-order oracle, we introduce a lazy variant of FSBA, called LFSBA, which reuses second-order information across several iterations. We prove that LFSBA exhibits better computational complexity than FSBA by a factor of $\sqrt{d}$, where $d$ is the dimension of the problem. We also apply a similar idea to nonconvex strongly-concave minimax optimization and propose the lazy minimax cubic-regularized Newton (LMCN) method with better computational complexity compared to existing second-order methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The central claims follow from a new algorithmic construction (FSBA) that applies cubic-regularized Newton to the hyper-objective under the stated nonconvex-strongly-convex bilevel assumptions. Sections 3–4 derive the Õ(ε^{-1.5}) iteration complexity by bounding the errors introduced by implicit differentiation and inexact inner solves; these bounds are obtained from standard second-order analysis plus explicit per-iteration oracle accounting rather than by fitting parameters to the target quantity or by reducing to a self-citation whose content is itself the result. The lazy variant LFSBA and the minimax extension are likewise obtained by reusing second-order information with a separate complexity argument. No load-bearing step equates the claimed rate to an input by definition or by prior self-citation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard domain assumptions for bilevel optimization; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Nonconvex-strongly-convex bilevel problem with sufficient smoothness for second-order oracles
Invoked to obtain the stated iteration and computational complexities.

pith-pipeline@v0.9.0 · 5482 in / 1122 out tokens · 42446 ms · 2026-05-08T08:05:49.170644+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 16 canonical work pages

[1]

Near-optimal fully first- order algorithms for finding stationary points in bilevel optimization.arXiv preprint arXiv:2306.14853,

Chen, L., Ma, Y ., and Zhang, J. Near-optimal fully first- order algorithms for finding stationary points in bilevel optimization.arXiv preprint arXiv:2306.14853,

work page arXiv
[2]

Second-order min- max optimization with lazy hessians.arXiv preprint arXiv:2410.09568, 2024a

Chen, L., Liu, C., and Zhang, J. Second-order min- max optimization with lazy hessians.arXiv preprint arXiv:2410.09568, 2024a. Chen, L., Xu, J., and Zhang, J. On finding small hyper- gradients in bilevel optimization: Hardness results and improved analysis. InConference on Learning Theory (COLT), 2024b. Chen, L., Li, J., and Zhang, J. Faster gradient meth...

work page arXiv
[3]

A prov- ably convergent plug-and-play framework for stochastic bilevel optimization.arXiv preprint arXiv:2505.01258,

Chu, T., Xu, D., Yao, W., Yu, C., and Zhang, J. A prov- ably convergent plug-and-play framework for stochastic bilevel optimization.arXiv preprint arXiv:2505.01258,

work page arXiv
[4]

and Grapiglia, G

Doikov, N. and Grapiglia, G. N. First and zeroth-order im- plementations of the regularized newton method with lazy approximated hessians.arXiv preprint arXiv:2309.02412,

work page arXiv
[5]

Efficient curvature-aware hypergradient approximation for bilevel optimization.arXiv preprint arXiv:2505.02101,

Dong, Y ., Yang, J., Yao, W., and Zhang, J. Efficient curvature-aware hypergradient approximation for bilevel optimization.arXiv preprint arXiv:2505.02101,

work page arXiv
[6]

Approximation Methods for Bilevel Programming

Ghadimi, S. and Wang, M. Approximation methods for bilevel programming.arXiv preprint arXiv:1802.02246,

work page Pith review arXiv
[7]

Optimal hessian/jacobian-free nonconvex-pl bilevel optimization.arXiv preprint arXiv:2407.17823, 2024

Huang, F. Optimal hessian/jacobian-free nonconvex-pl bilevel optimization.arXiv preprint arXiv:2407.17823,

work page arXiv
[8]

A new simple stochastic gradient descent type algorithm with lower computa- tional complexity for bilevel optimization.arXiv preprint arXiv:2306.11211,

Huo, H., Liu, R., and Su, Z. A new simple stochastic gradient descent type algorithm with lower computa- tional complexity for bilevel optimization.arXiv preprint arXiv:2306.11211,

work page arXiv
[9]

and Shamir, O

Kornowski, G. and Shamir, O. High-order oracle complex- ity of smooth and strongly convex optimization.arXiv preprint arXiv:2010.06642,

work page arXiv 2010
[10]

On penalty methods for nonconvex bilevel optimization and first-order stochastic approximation.arXiv preprint arXiv:2309.01753, 2023

Kwon, J., Kwon, D., Wright, S., and Nowak, R. On penalty methods for nonconvex bilevel optimization and first-order stochastic approximation.arXiv preprint arXiv:2309.01753, 2023a. Kwon, J., Kwon, D., Wright, S., and Nowak, R. D. A fully first-order method for stochastic bilevel optimiza- tion. InInternational Conference on Machine Learning, pp. 18083–181...

work page arXiv
[11]

Rapid learning or feature reuse? towards understanding the effectiveness of maml.arXiv preprint arXiv:1909.09157, 2019

Raghu, A., Raghu, M., Bengio, S., and Vinyals, O. Rapid learning or feature reuse? towards understanding the effectiveness of maml.arXiv preprint arXiv:1909.09157,

work page arXiv 1909
[12]

J., Hassani, H., and Cevher, V

Robey, A., Latorre, F., Pappas, G. J., Hassani, H., and Cevher, V . Adversarial training should be cast as a non- zero-sum game.arXiv preprint arXiv:2306.11035,

work page arXiv
[13]

A primal-dual ap- proach to bilevel optimization with multiple inner minima

Sow, D., Ji, K., Guan, Z., and Liang, Y . A primal-dual ap- proach to bilevel optimization with multiple inner minima. arXiv preprint arXiv:2203.01123, 2022a. Sow, D., Ji, K., and Liang, Y . On the convergence theory for hessian-free bilevel algorithms.Advances in Neural Information Processing Systems, 35:4136–4149, 2022b. Tripuraneni, N., Stern, M., Jin,...

work page arXiv
[14]

Efficient first or- der method for saddle point problems with higher or- der smoothness.SIAM Journal on Optimization, 34(4): 3342–3370, 2024a

Wang, N., Zhang, J., and Zhang, S. Efficient first or- der method for saddle point problems with higher or- der smoothness.SIAM Journal on Optimization, 34(4): 3342–3370, 2024a. Wang, X., Chen, X., Ma, S., and Zhang, T. Fully first-order methods for decentralized bilevel optimization.arXiv preprint arXiv:2410.19319, 2024b. Wang, Y . and Li, J. Improved al...

work page arXiv
[15]

An alternating optimization method for bilevel problems under the polyak-lojasiewicz condition.Advances in Neural Information Processing Systems, 36:63847–63873, 2023a

Xiao, Q., Lu, S., and Chen, T. An alternating optimization method for bilevel problems under the polyak-lojasiewicz condition.Advances in Neural Information Processing Systems, 36:63847–63873, 2023a. Xiao, Q., Lu, S., and Chen, T. A generalized alternat- ing method for bilevel learning under the polyak- {\L} ojasiewicz condition.arXiv preprint arXiv:2306....

work page arXiv
[16]

Constrained bi-level optimization: Proximal lagrangian value func- tion approach and hessian-free algorithm.arXiv preprint arXiv:2401.16164,

12 Second-Order Bilevel Optimization Yao, W., Yu, C., Zeng, S., and Zhang, J. Constrained bi-level optimization: Proximal lagrangian value func- tion approach and hessian-free algorithm.arXiv preprint arXiv:2401.16164,

work page arXiv
[17]

Future Directions We list some future directions of this paper in this section

13 Second-Order Bilevel Optimization A. Future Directions We list some future directions of this paper in this section. • We use AGD to solve the lower-level problems to obtainyt ≈y ∗ λ(xt) and wt ≈y(x t) within ˜O(√κ) iterations. It is possible to accelerate the lower solvers by using second-order methods (Carmon et al., 2022; Kovalev & Gasnikov, 2022; K...

2022
[18]

• We consider the fully second-order methods for nonconvex strongly-convex bilevel optimization

to improve the complexity dependency onκ. • We consider the fully second-order methods for nonconvex strongly-convex bilevel optimization. It will be interesting to develop second-order methods for bilevel optimization without lower strong convexity (Chen et al., 2024b; Liu et al., 2020; 2021a;b; Shen & Chen, 2023; Sow et al., 2022a; Xiao et al., 2023a;b;...

2020
[19]

and distributed (Wang et al., 2024b; Lian et al., 2017; Scaman et al., 2017; Mishchenko et al., 2022; Wang et al., 2022; Ye & Abbe, 2018; Yuan et al.,

2017
[20]

variants of FSBA and LFSBA to further improve the practical performance. B. Useful Lemmas Lemma B.1(Lemma 2, Wang & Li (2020)).Running Algorithm 1 on ℓh-smooth and µh-strongly-convex objec- tive function h(·) with parameters η= 1/ℓ h and θ= √κh−1√κh+1 produces the output yK satisfying ∥yK −y ∗∥2 ≤ (κh +

2020
[21]

Lemma B.2(Lemma 3.2, Kwon et al

1− 1√κh K ∥y0 −y ∗∥2, wherey ∗ = arg miny h(y)andκ h =ℓ h/µh. Lemma B.2(Lemma 3.2, Kwon et al. (2023b)).Under Assumption 2.1, for λ≥2ℓ/µ , Lλ(x,·) is (λµ/2)-strongly convex. It is clear thaty ∗(x)isℓ/µ-Lipschitz. And we can also show a similar result fory ∗ λ(x). Lemma B.3(Lemma B.6, Chen et al. (2023)).Under Assumption 2.1, for λ≥2ℓ/µ , it holds that y∗ ...

2023
[22]

1.5 T M˜ϵ′3 ∆ = ˜O √ κM ϵ−1.5 = ˜O κ2√mρϵ−1.5 . E. The Details of Inexact Version of FSBA In this section, we present the details of IFSBA method introduced in Section 3.3. It is worth emphasizing that IFSBA never explicitly constructs the Hessian ; all Hessian-related operations are carried out via Hessian–vector products, thereby avoiding any second-ord...

2022
[23]

We suppose ϵ≤ L2 M , otherwise, the second-order condition ∇2L∗ λ(xt)⪰ − √ M ϵI always holds and we only need to use gradient methods to find first-order stationary point. The following lemma indicates that once g(xt;y t,w t) and C(xt;y t,w t) approximate ∇L∗ λ(xt) and ∇2L∗ λ(xt) well and Cubic-Solver iterates with sufficient steps, then IFSBA enjoys a si...

2022
[24]

Our Lemma E.3 corresponds to Theorem 3 in Luo et al

1.5 T˜ϵ3 TX t=1 ∥st−1∥3 !# + 2T. Our Lemma E.3 corresponds to Theorem 3 in Luo et al. (2022). Under the same setting and within the proof of that theorem, the following lemma (Lemma 16 in in Luo et al. (2022)) was employed: Under the setting of Lemma E.3, if it satisfies∆ t ≤ − 1 128 q ϵ3 M , then we have M 256 ∥st∥3 ≤ L ∗ λ (xt)− L ∗ λ (xt +s t)− 1 626 r...

2022
[25]

1.5 T M˜ϵ3 ∆ + 2T 31 Second-Order Bilevel Optimization = ˜O √ κM ϵ−1.5 = ˜O κ3 p ¯ℓ ϵ−1.5 . The total number of Hessian-vector calls from Algorithm 3 is at most T· K(ϵ, δ ′)·(K ′ 1 +K ′ 2)≤ ˜O κ2.5 p ¯ℓϵ−1.5 · ˜O L√ M ϵ · ˜O(√κ) ≤ ˜O κ2.5 p ¯ℓϵ−1.5 · ˜O r ¯ℓκ ϵ ! · ˜O(√κ) ≤ ˜O ¯ℓκ3.5ϵ−2 Using Lemma 8 of Tripuraneni et al. (2018), we know the total number ...

2018
[26]

We report the results on Dtr with different noise rates p= 25% and p= 50%in Figure

with baseline methods, including ITD, AID with conjugate gradient, and F2BA on ”MNIST” datasets (LeCun et al., 2002). We report the results on Dtr with different noise rates p= 25% and p= 50%in Figure

2002