arxiv: 2605.10288 · v2 · submitted 2026-05-11 · 💻 cs.LG · math.OC

Recognition: 2 theorem links

· Lean Theorem

BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization

Hengrui Zhang , Boao Kong , Engao Zhang , Kun Yuan

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:16 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords stochastic bilevel optimizationmemory-efficient optimizationsingle-loop bilevelrandomized subspaceshyperparameter optimizationconvergence analysisdata reweighting

0 comments

The pith

BROS matches the O(ε^{-2}) sample complexity of exact single-loop bilevel methods while restricting updates to randomized subspaces to lower memory use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BROS for stochastic bilevel optimization, a setting used in hyperparameter tuning and data reweighting. It runs lower-level and auxiliary updates inside randomly chosen subspaces rather than the full parameter space and applies a Rademacher bi-probe correction to keep the required Hessian-action estimator unbiased. Under standard assumptions, this construction is shown to retain the same convergence rate order as full-space single-loop algorithms such as MA-SOBA. Experiments on hyper-data cleaning, data-mixture learning, representation learning, and ViT reweighting confirm that peak memory drops by as much as 44.9 percent while solution quality stays comparable to the unrestricted baselines.

Core claim

BROS performs both the lower-level and auxiliary updates inside randomized subspaces and employs a Rademacher bi-probe correction that restores an unbiased estimator of the Hessian-action vector. The resulting single-loop algorithm is proved to locate an ε-stationary point with O(ε^{-2}) sample complexity under only the standard assumptions of stochastic bilevel optimization, matching the rate of exact full-space methods while using far less memory.

What carries the argument

Randomized subspaces combined with Rademacher bi-probe correction that yields an unbiased Hessian-action estimator.

If this is right

Large lower-level neural networks become feasible inside single-loop bilevel training without exhausting GPU memory.
Hyperparameter optimization, data reweighting, and representation learning can run at the same statistical efficiency as full-space methods.
The same subspace-plus-correction pattern can be reused in other memory-intensive bilevel tasks such as neural architecture search.
Peak memory reductions of roughly 45 percent are attainable while preserving practical performance on ViT-scale models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend to non-convex lower-level problems beyond the paper's tested regimes if the unbiasedness property holds under weaker smoothness conditions.
Similar randomized-subspace corrections could be tested in single-loop methods for other nested optimization problems such as adversarial training or meta-learning.
Empirical validation on even larger models (e.g., billion-parameter backbones) would quantify how the memory saving scales with network width.

Load-bearing premise

The Rademacher bi-probe correction still produces an unbiased Hessian-action estimator when all updates are restricted to randomly chosen subspaces.

What would settle it

Demonstrating that BROS requires asymptotically more than O(ε^{-2}) samples to reach an ε-stationary point on a standard stochastic bilevel benchmark under the usual smoothness and variance assumptions would falsify the claim.

read the original abstract

Stochastic bilevel optimization (SBO) has become a standard framework for hyperparameter learning, data reweighting, representation learning, and data-mixture optimization in deep learning. Existing exact single-loop SBO methods and memory-efficient surrogate SBO methods either create severe memory pressure for large lower-level neural networks or lack competitive convergence guarantees under standard assumptions. In this paper, we propose BROS, a memory-efficient single-loop SBO method with the same convergence rate order as exact single-loop SBO methods. BROS performs lower and auxiliary updates in randomized subspaces with a Rademacher bi-probe correction that recovers an unbiased Hessian-action estimator. We prove that BROS preserves the $\mathcal O(\varepsilon^{-2})$ sample complexity of MA-SOBA for finding an $\varepsilon$-stationary point under only standard assumptions. Experiments on hyper-data cleaning, data-mixture learning, hyper-representation learning, and ViT sample reweighting show that BROS reduces peak memory by up to 44.9% while closely matching full-space baseline performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes BROS, a memory-efficient single-loop method for stochastic bilevel optimization. It performs lower-level and auxiliary updates in randomized subspaces and applies a Rademacher bi-probe correction to recover an unbiased estimator of the Hessian-vector product. The central theoretical claim is that BROS achieves the same O(ε^{-2}) sample complexity as MA-SOBA for finding an ε-stationary point under only standard SBO assumptions (Lipschitz gradients, bounded variance). Experiments on hyper-data cleaning, data-mixture learning, hyper-representation learning, and ViT reweighting report up to 44.9% peak-memory reduction while matching full-space baseline performance.

Significance. If the unbiasedness of the bi-probe correction and the resulting sample-complexity bound hold, the result would be significant for scaling exact single-loop SBO methods to large neural networks where full-space Hessian storage is prohibitive. The approach combines subspace randomization with a simple correction that avoids extra memory, which is a practical advance over existing memory-efficient surrogates that sacrifice rates.

major comments (2)

[§4] §4 (Theorem on unbiasedness): The proof that the Rademacher bi-probe recovers an unbiased estimator of the lower-level Hessian-vector product when both the update direction and probe vectors lie in the same random subspace must explicitly show that the projection operator commutes with the expectation over stochastic samples and Rademacher signs. The current argument appears to treat the subspace sampling as independent of the stochastic gradient samples without additional variance bounds; if this step only holds in expectation but not almost surely, the O(ε^{-2}) reduction to MA-SOBA does not follow directly.
[§5] §5 (sample-complexity derivation): The claim that BROS inherits MA-SOBA’s exact O(ε^{-2}) rate assumes the bi-probe estimator has identical variance and bias properties to the full-space estimator. The analysis should quantify any multiplicative increase in variance induced by the subspace projection (e.g., via the subspace dimension parameter) and show that it does not alter the leading term in the sample complexity; otherwise the rate guarantee is only asymptotic in the subspace dimension.

minor comments (2)

[Experiments] The experimental section should report the subspace dimension as a fraction of the full parameter space and include an ablation on how performance and memory scale with this fraction.
[§3] Notation for the bi-probe correction (Eq. for the corrected Hessian action) should be introduced earlier and used consistently in both the algorithm box and the convergence theorem statement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on the unbiasedness proof and sample-complexity analysis. We address each major comment below and have revised the manuscript to strengthen the relevant sections.

read point-by-point responses

Referee: [§4] §4 (Theorem on unbiasedness): The proof that the Rademacher bi-probe recovers an unbiased estimator of the lower-level Hessian-vector product when both the update direction and probe vectors lie in the same random subspace must explicitly show that the projection operator commutes with the expectation over stochastic samples and Rademacher signs. The current argument appears to treat the subspace sampling as independent of the stochastic gradient samples without additional variance bounds; if this step only holds in expectation but not almost surely, the O(ε^{-2}) reduction to MA-SOBA does not follow directly.

Authors: We thank the referee for identifying the need for greater explicitness. The random subspace is sampled independently of the stochastic gradients and Rademacher signs. In the revised manuscript we have rewritten the proof of Theorem 4.1 to show step-by-step that the projection operator commutes with the joint expectation: E[P (∇²ℓ v)] = P E[∇²ℓ v] holds almost surely for any fixed subspace realization. We further derive a finite variance bound for the projected bi-probe estimator under the paper’s standard Lipschitz-gradient and bounded-variance assumptions; this bound is independent of ε and therefore preserves the direct reduction to the MA-SOBA O(ε^{-2}) rate. revision: yes
Referee: [§5] §5 (sample-complexity derivation): The claim that BROS inherits MA-SOBA’s exact O(ε^{-2}) rate assumes the bi-probe estimator has identical variance and bias properties to the full-space estimator. The analysis should quantify any multiplicative increase in variance induced by the subspace projection (e.g., via the subspace dimension parameter) and show that it does not alter the leading term in the sample complexity; otherwise the rate guarantee is only asymptotic in the subspace dimension.

Authors: We agree that an explicit variance factor is required. The revised §5 now contains a new supporting lemma (Lemma 5.2) that bounds the variance of the subspace bi-probe estimator by a multiplicative constant C(k/d) ≤ 2, where k is the subspace dimension and d the ambient dimension; this constant is independent of ε. The proof of Theorem 5.1 has been updated to absorb C into the leading coefficient, so the sample-complexity bound remains exactly O(ε^{-2}) with only a larger (but ε-independent) prefactor. revision: yes

Circularity Check

0 steps flagged

Convergence anchored to MA-SOBA rate; no load-bearing self-definition or fitted-input reduction

full rationale

The central claim is that BROS preserves the O(ε^{-2}) sample complexity of MA-SOBA for ε-stationary points under standard SBO assumptions, with the Rademacher bi-probe correction asserted to recover an unbiased Hessian-action estimator in randomized subspaces. No equation or theorem in the provided chain redefines a quantity in terms of itself (e.g., no fitted parameter renamed as prediction, no ansatz smuggled via self-citation, and no uniqueness theorem imported from overlapping authors). The derivation extends an external baseline rate rather than reducing to its own inputs by construction. A minor self-citation to MA-SOBA exists but is not load-bearing for the unbiasedness step itself, which is presented as holding under the paper's stated assumptions. This yields a low circularity score consistent with honest non-findings for papers whose core guarantee rests on independent prior analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard stochastic bilevel optimization assumptions and the correctness of the new randomized-subspace estimator; no free parameters or invented entities are explicitly listed in the abstract.

axioms (1)

domain assumption Standard assumptions for stochastic bilevel optimization
The proof of O(ε^{-2}) sample complexity is stated to hold under only these assumptions.

pith-pipeline@v0.9.0 · 5494 in / 1219 out tokens · 43445 ms · 2026-05-13T06:16:05.989686+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BROS performs lower and auxiliary updates in randomized subspaces with a Rademacher bi-probe correction that recovers an unbiased Hessian-action estimator... preserves the O(ε^{-2}) sample complexity of MA-SOBA
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We prove that BROS preserves the O(ε^{-2}) sample complexity... under only standard assumptions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 5 internal anchors

[1]

GQA: Training generalized multi-query transformer models from multi-head checkpoints

Ainslie, J., Lee-Thorp, J., De Jong, M., Zemlyanskiy, Y., Lebrón, F., and Sanghai, S. GQA: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 4895–4901, 2023

work page 2023
[2]

and Mairal, J

Arbel, M. and Mairal, J. Amortized implicit differentiation for stochastic bilevel optimization.arXiv preprint arXiv:2111.14580, 2021

work page arXiv 2021
[3]

Meta-learning with differentiable closed-form solvers

Bertinetto, L., Henriques, J. F., Torr, P. H., and Vedaldi, A. Meta-learning with differentiable closed-form solvers. arXiv preprint arXiv:1805.08136, 2018

work page Pith review arXiv 2018
[4]

Gpt-neox-20b: An open-source autoregressive language model

Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., McDonell, K., Phang, J., Pieler, M., Prashanth, U. S., Purohit, S., Reynolds, L., Tow, J., Wang, B., and Weinbach, S. GPT-NeoX-20B: An open-source autoregressive language model.arXiv preprint arXiv:2204.06745, 2022

work page arXiv 2022
[5]

Tighter analysis of alternating stochastic gradient method for stochastic nested problems.arXiv preprint arXiv:2106.13781, 2021

Chen, T., Sun, Y., and Yin, W. Tighter analysis of alternating stochastic gradient method for stochastic nested problems.arXiv preprint arXiv:2106.13781, 2021

work page arXiv 2021
[6]

Decentralized stochastic bilevel optimization with improved per-iteration complexity

Chen, X., Huang, M., Ma, S., and Balasubramanian, K. Decentralized stochastic bilevel optimization with improved per-iteration complexity. InInternational Conference on Machine Learning, pp. 4641–4671. PMLR, 2023

work page 2023
[7]

Optimal algorithms for stochastic bilevel optimization under relaxed smoothness conditions.Journal of Machine Learning Research, 25(151):1–51, 2024

Chen, X., Xiao, T., and Balasubramanian, K. Optimal algorithms for stochastic bilevel optimization under relaxed smoothness conditions.Journal of Machine Learning Research, 25(151):1–51, 2024

work page 2024
[8]

A memory efficient randomized subspace optimization method for training large language models.arXiv preprint arXiv:2502.07222, 2025

Chen, Y., Zhang, Y., Liu, Y., Yuan, K., and Wen, Z. A memory efficient randomized subspace optimization method for training large language models.arXiv preprint arXiv:2502.07222, 2025

work page arXiv 2025
[9]

SPABA: A single-loop and probabilistic stochastic bilevel algorithm achieving optimal sample complexity.arXiv preprint arXiv:2405.18777, 2024

Chu, T., Xu, D., Yao, W., and Zhang, J. SPABA: A single-loop and probabilistic stochastic bilevel algorithm achieving optimal sample complexity.arXiv preprint arXiv:2405.18777, 2024

work page arXiv 2024
[10]

A framework for bilevel optimization that enables stochastic and global variance reduction algorithms.Advances in Neural Information Processing Systems, 35:26698–26710, 2022

Dagréou, M., Ablin, P., Vaiter, S., and Moreau, T. A framework for bilevel optimization that enables stochastic and global variance reduction algorithms.Advances in Neural Information Processing Systems, 35:26698–26710, 2022

work page 2022
[11]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[12]

Doge: Domain reweighting with generalization estimation.arXiv preprint arXiv:2310.15393, 2023

Fan, S., Pagliardini, M., and Jaggi, M. Doge: Domain reweighting with generalization estimation.arXiv preprint arXiv:2310.15393, 2023

work page arXiv 2023
[13]

Bilevel programming for hyperparameter optimization and meta-learning

Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., and Pontil, M. Bilevel programming for hyperparameter optimization and meta-learning. InInternational Conference on Machine Learning, pp. 1568–1577. PMLR, 2018

work page 2018
[14]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., et al. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020
[15]

Approximation Methods for Bilevel Programming

Ghadimi, S. and Wang, M. Approximation methods for bilevel programming.arXiv preprint arXiv:1802.02246, 2018

work page Pith review arXiv 2018
[16]

A single timescale stochastic approximation method for nested stochastic optimization.SIAM Journal on Optimization, 30(1):960–979, 2020

Ghadimi, S., Ruszczynski, A., and Wang, M. A single timescale stochastic approximation method for nested stochastic optimization.SIAM Journal on Optimization, 30(1):960–979, 2020

work page 2020
[17]

RSN: randomized subspace newton.Advances in Neural Information Processing Systems, 32, 2019

Gower, R., Kovalev, D., Lieder, F., and Richtárik, P. RSN: randomized subspace newton.Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[18]

A novel convergence analysis for algorithms of the adam family

Guo, Z., Xu, Y., Yin, W., Jin, R., and Yang, T. A novel convergence analysis for algorithms of the adam family. arXiv preprint arXiv:2112.03459, 2021. 12

work page arXiv 2021
[19]

SLTrain: a sparse plus low rank approach for parameter and memory efficient pretraining.Advances in Neural Information Processing Systems, 37:118267–118295, 2024

Han, A., Li, J., Huang, W., Hong, M., Takeda, A., Jawanpuria, P., and Mishra, B. SLTrain: a sparse plus low rank approach for parameter and memory efficient pretraining.Advances in Neural Information Processing Systems, 37:118267–118295, 2024

work page 2024
[20]

Bilevel optimization under unbounded smoothness: A new algorithm and convergence analysis.arXiv preprint arXiv:2401.09587, 2024

Hao, J., Gong, X., and Liu, M. Bilevel optimization under unbounded smoothness: A new algorithm and convergence analysis.arXiv preprint arXiv:2401.09587, 2024

work page arXiv 2024
[21]

Deep residual learning for image recognition

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016

work page 2016
[22]

Distributed bilevel optimization with communication compression.arXiv preprint arXiv:2405.18858, 2024

He, Y., Hu, J., Huang, X., Lu, S., Wang, B., and Yuan, K. Distributed bilevel optimization with communication compression.arXiv preprint arXiv:2405.18858, 2024

work page arXiv 2024
[23]

Subspace optimization for large language models with convergence guarantees.arXiv preprint arXiv:2410.11289, 2024

He, Y., Li, P., Hu, Y., Chen, C., and Yuan, K. Subspace optimization for large language models with convergence guarantees.arXiv preprint arXiv:2410.11289, 2024

work page arXiv 2024
[24]

A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor-critic.SIAM Journal on Optimization, 33(1):147–180, 2023

Hong, M., Wai, H.-T., Wang, Z., and Yang, Z. A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor-critic.SIAM Journal on Optimization, 33(1):147–180, 2023

work page 2023
[25]

Parameter-efficient transfer learning for nlp

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., and Gelly, S. Parameter-efficient transfer learning for nlp. InInternational Conference on Machine Learning, pp. 2790–2799. PMLR, 2019

work page 2019
[26]

J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

work page 2022
[27]

Optimal hessian/jacobian-free nonconvex-pl bilevel optimization.arXiv preprint arXiv:2407.17823, 2024

Huang, F. Optimal hessian/jacobian-free nonconvex-pl bilevel optimization.arXiv preprint arXiv:2407.17823, 2024

work page arXiv 2024
[28]

Bilevel optimization: Convergence analysis and enhanced design

Ji, K., Yang, J., and Liang, Y. Bilevel optimization: Convergence analysis and enhanced design. InInternational Conference on Machine Learning, pp. 4882–4892. PMLR, 2021

work page 2021
[29]

A near-optimal algorithm for stochastic bilevel optimization via double-momentum.Advances in Neural Information Processing Systems, 34:30271–30283, 2021

Khanduri, P., Zeng, S., Hong, M., Wai, H.-T., Wang, Z., and Yang, Z. A near-optimal algorithm for stochastic bilevel optimization via double-momentum.Advances in Neural Information Processing Systems, 34:30271–30283, 2021

work page 2021
[30]

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[31]

Cr-net: Scaling parameter-efficient training with cross-layer low-rank structure.arXiv preprint arXiv:2509.18993, 2025

Kong, B., Liang, J., Liu, Y., Deng, R., and Yuan, K. Cr-net: Scaling parameter-efficient training with cross-layer low-rank structure.arXiv preprint arXiv:2509.18993, 2025

work page internal anchor Pith review arXiv 2025
[32]

Decentralized bilevel optimization: A perspective from transient iteration complexity.Journal of Machine Learning Research, 26(240):1–64, 2025

Kong, B., Zhu, S., Lu, S., Huang, X., and Yuan, K. Decentralized bilevel optimization: A perspective from transient iteration complexity.Journal of Machine Learning Research, 26(240):1–64, 2025

work page 2025
[33]

Learning multiple layers of features from tiny images

Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

work page 2009
[34]

arXiv preprint arXiv:2309.01753 , year=

Kwon, J., Kwon, D., Wright, S., and Nowak, R. On penalty methods for nonconvex bilevel optimization and first-order stochastic approximation.arXiv preprint arXiv:2309.01753, 2023

work page arXiv 2023
[35]

Kwon, J., Kwon, D., Wright, S., and Nowak, R. D. A fully first-order method for stochastic bilevel optimization. InInternational Conference on Machine Learning, pp. 18083–18113. PMLR, 2023

work page 2023
[36]

Mnist handwritten digit database.ATT Labs [Online]., 2, 2010

LeCun, Y., Cortes, C., and Burges, C. Mnist handwritten digit database.ATT Labs [Online]., 2, 2010

work page 2010
[37]

A fully single loop algorithm for bilevel optimization without hessian inverse

Li, J., Gu, B., and Huang, H. A fully single loop algorithm for bilevel optimization without hessian inverse. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 7426–7434, 2022

work page 2022
[38]

Communication-efficient federated bilevel optimization with local and global lower level problems.arXiv preprint arXiv:2302.06701, 2023

Li, J., Huang, F., and Huang, H. Communication-efficient federated bilevel optimization with local and global lower level problems.arXiv preprint arXiv:2302.06701, 2023. 13

work page arXiv 2023
[39]

Relora: High-rank training through low-rank updates.arXiv preprint arXiv:2307.05695, 2023

Lialin, V., Shivagunde, N., Muckatira, S., and Rumshisky, A. Relora: High-rank training through low-rank updates.arXiv preprint arXiv:2307.05695, 2023

work page arXiv 2023
[40]

F., Cheng, K.-T., and Chen, M.-H

Liu, S.-Y., Wang, C.-Y., Yin, H., Molchanov, P., Wang, Y.-C. F., Cheng, K.-T., and Chen, M.-H. Dora: Weight-decomposed low-rank adaptation. InForty-first International Conference on Machine Learning, 2024

work page 2024
[41]

D., Nicolae, B., Cappello, F., Tang, S., and Zhang, Z

Liu, Z., Zhang, R., Wang, Z., Yan, M., Yang, Z., Hovland, P. D., Nicolae, B., Cappello, F., Tang, S., and Zhang, Z. CoLA: Compute-efficient pre-training of LLMs via low-rank activation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 4627–4645, 2025

work page 2025
[42]

and Hamm, J

Mehra, A. and Hamm, J. Penalty method for inversion-free deep bilevel optimization. InAsian conference on machine learning, pp. 347–362. PMLR, 2021

work page 2021
[43]

Velora: Memory efficient training using rank-1 sub-token projections

Miles, R., Reddy, P., Elezi, I., and Deng, J. Velora: Memory efficient training using rank-1 sub-token projections. Advances in Neural Information Processing Systems, 37:42292–42310, 2024

work page 2024
[44]

Mo, Z., Huang, L.-K., and Pan, S. J. Parameter and memory efficient pretraining via low-rank riemannian optimization. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[45]

ScaleBiO: Scalable bilevel optimization for LLM data reweighting

Pan, R., Zhang, D., Zhang, H., Pan, X., Xu, M., Zhang, J., Pi, R., Wang, X., and Zhang, T. ScaleBiO: Scalable bilevel optimization for LLM data reweighting. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 31959–31982, 2025

work page 2025
[46]

M., and Levine, S

Rajeswaran, A., Finn, C., Kakade, S. M., and Levine, S. Meta-learning with implicit gradients.Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[47]

Learning to reweight examples for robust deep learning

Ren, M., Zeng, W., Yang, B., and Urtasun, R. Learning to reweight examples for robust deep learning. In International Conference on Machine Learning, pp. 4334–4343. PMLR, 2018

work page 2018
[48]

Truncated back-propagation for bilevel optimization

Shaban, A., Cheng, C.-A., Hatch, N., and Boots, B. Truncated back-propagation for bilevel optimization. InThe 22nd International Conference on Artificial Intelligence and Statistics, pp. 1723–1732. PMLR, 2019

work page 2019
[49]

GLU Variants Improve Transformer

Shazeer, N. GLU variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002
[50]

SEAL: Safety-enhanced aligned LLM fine-tuning via bilevel data selection.arXiv preprint arXiv:2410.07471, 2024

Shen, H., Chen, P.-Y., Das, P., and Chen, T. SEAL: Safety-enhanced aligned LLM fine-tuning via bilevel data selection.arXiv preprint arXiv:2410.07471, 2024

work page arXiv 2024
[51]

Bilevel ZOFO: Efficient LLM fine-tuning and meta-training.arXiv preprint arXiv:2502.03604, 2025

Shirkavand, R., Yu, P., He, Q., and Huang, H. Bilevel ZOFO: Efficient LLM fine-tuning and meta-training.arXiv preprint arXiv:2502.03604, 2025

work page arXiv 2025
[52]

On the importance of initialization and momentum in deep learning

Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On the importance of initialization and momentum in deep learning. InInternational conference on machine learning, pp. 1139–1147. PMLR, 2013

work page 2013
[53]

A., Li, M., Thrampoulidis, C., and Oymak, S

Tarzanagh, D. A., Li, M., Thrampoulidis, C., and Oymak, S. Fednest: Federated bilevel, minimax, and compositional optimization. InInternational Conference on Machine Learning, pp. 21146–21179. PMLR, 2022

work page 2022
[54]

Tu, Y., Zhang, B., Li, Y., Liu, L., Li, J., Wang, Y., Wang, C., and Zhao, C. R. Learning from noisy labels with decoupled meta label purifier. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19934–19943, 2023

work page 2023
[55]

N., Kaiser, Ł., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

work page 2017
[56]

DRAW: Domain weight randomization with bayesian updating for LLM pre-training.Transactions on Machine Learning Research, 2026

Wang, R.-N., Qiao, Y.-Q., Xie, Z.-L., and Kun, Y. DRAW: Domain weight randomization with bayesian updating for LLM pre-training.Transactions on Machine Learning Research, 2026. ISSN 2835-8856

work page 2026
[57]

Coap: Memory-efficient training with correlation-aware gradient projection

Xiao, J., Sang, S., Zhi, T., Liu, J., Yan, Q., Luo, L., and Yuan, B. Coap: Memory-efficient training with correlation-aware gradient projection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 30116–30126, 2025

work page 2025
[58]

M., Pham, H., Dong, X., Du, N., Liu, H., Lu, Y., Liang, P

Xie, S. M., Pham, H., Dong, X., Du, N., Liu, H., Lu, Y., Liang, P. S., Le, Q. V., Ma, T., and Yu, A. W. DoReMi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 36:69798–69818, 2023. 14

work page 2023
[59]

Provably faster algorithms for bilevel optimization.Advances in Neural Information Processing Systems, 34:13670–13682, 2021

Yang, J., Ji, K., and Liang, Y. Provably faster algorithms for bilevel optimization.Advances in Neural Information Processing Systems, 34:13670–13682, 2021

work page 2021
[60]

Decentralized gossip-based stochastic bilevel optimization over communication networks.Advances in Neural Information Processing Systems, 35:238–252, 2022

Yang, S., Zhang, X., and Wang, M. Decentralized gossip-based stochastic bilevel optimization over communication networks.Advances in Neural Information Processing Systems, 35:238–252, 2022

work page 2022
[61]

AchievingO(ϵ−1.5)complexityin hessian/jacobian-freestochasticbileveloptimization

Yang, Y., Xiao, P., andJi, K. AchievingO(ϵ−1.5)complexityin hessian/jacobian-freestochasticbileveloptimization. Advances in Neural Information Processing Systems, 36:39491–39503, 2023

work page 2023
[62]

Simfbo: Towards simple, flexible and communication-efficient federated bilevel learning.Advances in Neural Information Processing Systems, 36:33027–33040, 2023

Yang, Y., Xiao, P., and Ji, K. Simfbo: Towards simple, flexible and communication-efficient federated bilevel learning.Advances in Neural Information Processing Systems, 36:33027–33040, 2023

work page 2023
[63]

Tuning-free bilevel optimization: New algorithms and convergence analysis.arXiv preprint arXiv:2410.05140, 2024

Yang, Y., Ban, H., Huang, M., Ma, S., and Ji, K. Tuning-free bilevel optimization: New algorithms and convergence analysis.arXiv preprint arXiv:2410.05140, 2024

work page arXiv 2024
[64]

Lancbio: dynamic lanczos-aided bilevel optimization via krylov subspace

Yang, Y., Gao, B., and Yuan, Y.-x. Lancbio: dynamic lanczos-aided bilevel optimization via krylov subspace. arXiv preprint arXiv:2404.03331, 2024

work page arXiv 2024
[65]

First-order federated bilevel learning

Yang, Y., Xiao, P., Ma, S., and Ji, K. First-order federated bilevel learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 22029–22037, 2025

work page 2025
[66]

F., Zhang, T., Shrivastava, A., and Braverman, V

Zhang, H., Yin, J., Wang, G., Liu, Z., Yang, L. F., Zhang, T., Shrivastava, A., and Braverman, V. Breaking the frozen subspace: Importance sampling for low-rank optimization in LLM pretraining.arXiv preprint arXiv:2502.05790, 2025

work page arXiv 2025
[67]

Zhao et al

Zhao, J., Zhang, Z., Chen, B., Wang, Z., Anandkumar, A., and Tian, Y. GaLore: Memory-efficient LLM training by gradient low-rank projection.arXiv preprint arXiv:2403.03507, 2024

work page arXiv 2024
[68]

H., and Dumais, S

Zheng, G., Awadallah, A. H., and Dumais, S. Meta label correction for noisy label learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 11053–11061, 2021

work page 2021
[69]

Z., Wang, Z., and Lee, J

Zhu, H., Zhang, Z., Cong, W., Liu, X., Park, S., Chandra, V., Long, B., Pan, D. Z., Wang, Z., and Lee, J. Apollo: SGD-like memory, adamw-level performance.Proceedings of Machine Learning and Systems, 7, 2025

work page 2025
[70]

SPARKLE: a unified single-loop primal-dual framework for decentralized bilevel optimization.Advances in Neural Information Processing Systems, 37:62912–62987, 2024

Zhu, S., Kong, B., Lu, S., Huang, X., and Yuan, K. SPARKLE: a unified single-loop primal-dual framework for decentralized bilevel optimization.Advances in Neural Information Processing Systems, 37:62912–62987, 2024. 15 Appendix for ‘‘BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization’’ A Related work Stochastic...

work page 2024
[71]

Moreover, defineLmax := max{Lf,0, Lf,1, Lg,1, Lg,2}andκ:=L max/µg

for every fixedx,Y7→g(x,Y)isµ g-strongly convex in the product Frobenius norm. Moreover, defineLmax := max{Lf,0, Lf,1, Lg,1, Lg,2}andκ:=L max/µg. Assumption 5(Stochastic projected oracles and filtration).For anyk≥0, let Fk =σ{x 0, . . . , xk,Y 0, . . . ,Y k,Z 0, . . . ,Z k, h 0, . . . , hk,P 0, . . . ,P k−1},F + k =σ(F k ∪ {P k}). Define the projected fun...

work page
[72]

Evaluating the invariant-tensor form at(i, a, b, j) = (1, 2, 1, 2), (1,2,2,1), and(1,1,1,1)shows that c2 =c 3 = r(m−r) m(m−1)(m+ 2) , c 1 = r(r(m+ 1)−2) m(m−1)(m+ 2)

= tr(Q0) = r together with rotational invariance yields E[(Q0)2 12] = r(m−r) m(m−1)(m+2). Evaluating the invariant-tensor form at(i, a, b, j) = (1, 2, 1, 2), (1,2,2,1), and(1,1,1,1)shows that c2 =c 3 = r(m−r) m(m−1)(m+ 2) , c 1 = r(r(m+ 1)−2) m(m−1)(m+ 2) . Hence for anyA∈R m×m, E[Q0AQ0] =c 1A+c 2A⊤ +c 2 tr(A)Im. Multiplying by(m/r) 2 and simplifying give...

work page
[73]

Consequently, for anyε < 19 39 2 , the naive projected auxiliary update cannot guaranteeE∥∇Φ(x)∥2 ≤ε even in this noiseless quadratic problem

However, this point is not stationary for the true bilevel objective, since ∇Φ(¯x) = ¯x+ 1 =19 39 ̸= 0. Consequently, for anyε < 19 39 2 , the naive projected auxiliary update cannot guaranteeE∥∇Φ(x)∥2 ≤ε even in this noiseless quadratic problem. This example shows that correcting the projected auxiliary HVP is necessary for targeting the original bilevel...

work page