Distribution-Aware Robust Bilevel Optimization: Quantile-Guided Huber Updates in Two-Timescale Stochastic Approximation

Davide Carbone; Xi Xuan; Zhiyu Li

arxiv: 2606.22436 · v1 · pith:L5ZUGRRFnew · submitted 2026-06-21 · 💻 cs.LG

Distribution-Aware Robust Bilevel Optimization: Quantile-Guided Huber Updates in Two-Timescale Stochastic Approximation

Zhiyu Li , Xi Xuan , Davide Carbone This is my paper

Pith reviewed 2026-06-26 10:50 UTC · model grok-4.3

classification 💻 cs.LG

keywords bilevel optimizationtwo-timescale stochastic approximationheavy-tailed noisequantile estimationHuber clippingrobust optimizationconvergence analysis

0 comments

The pith

Quantile-guided Huber updates in two-timescale stochastic approximation achieve convergence rate O(T^{-(p-1)/(3p-2)}) for bilevel optimization under heavy-tailed noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that using rolling quantiles from historical gradient buffers to adaptively set Huber clipping thresholds inside two-timescale stochastic approximation makes bilevel optimization stable when stochastic gradients have infinite variance. This matters because myopic clipping methods often fail to separate useful signals from outliers, leading to divergence in hierarchical tasks. The framework supplies both a theoretical rate that matches the optimal dependence on the tail index p and empirical evidence of stable runs on vision, game, and reinforcement learning problems with low added cost. If the claim holds, bilevel methods become usable in settings where noise distributions were previously prohibitive.

Core claim

Under nonconvex-strongly convex assumptions with noise having finite p-moments for p in (1,2], quantile-guided TTSA converges at rate O(T^{-(p-1)/(3p-2)}), recovering the optimal dependence on the heavy-tailed parameter. The method estimates rolling quantiles from gradient buffers to drive adaptive Huber-style clipping that preserves local geometry while bounding effective variance. Experiments across six tasks confirm elimination of divergence spikes relative to baselines.

What carries the argument

Rolling quantile estimation from historical gradient buffers, used to set adaptive thresholds for Huber-style clipping inside the two-timescale stochastic approximation updates.

If this is right

Convergence holds with the stated rate for the full range of infinite-variance noise p in (1,2].
The approach applies directly to nonconvex-strongly convex bilevel problems.
Empirical stability extends to vision benchmarks, momentum-poisoned games, and offline reinforcement learning.
Hyperparameter sensitivity drops while added compute stays near 2.7 percent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same quantile-guided clipping could be tested in single-level stochastic gradient settings to check whether the rate improvement carries over.
Integration with momentum or other acceleration schemes might further tighten the dependence on p.
The buffer-based quantile mechanism could be examined under non-stationary noise where the tail index itself drifts.

Load-bearing premise

Historical gradient buffers must produce reliable rolling quantile estimates that preserve local optimization geometry and bound variance without the quantile step introducing bias or lag that invalidates the two-timescale analysis.

What would settle it

Run the method on a bilevel problem with p=1.5 noise and check whether the observed rate falls short of O(T^{-(p-1)/(3p-2)}) or whether divergence still occurs on one of the reported tasks.

Figures

Figures reproduced from arXiv: 2606.22436 by Davide Carbone, Xi Xuan, Zhiyu Li.

**Figure 2.** Figure 2: Convergence on Fashion-MNIST (Momentum-Integrated). RQ-TTSA (solid blue) achieves [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Empirical verification of convergence rates under heavy-tailed noise (p = 1.5). The log-log plot visualizes the decay of the upper-level gradient norm ∥∇Φ(x)∥ over iterations T. (1) Precision: RQ-TTSA (solid blue) achieves an empirical slope of ≈ −0.197, aligning almost perfectly with the theoretical optimal rate of −0.200 (red dashed). (2) Robustness: The method exhibits identical convergence traits under… view at source ↗

**Figure 4.** Figure 4: Ablation on clipping mechanisms (Fashion-MNIST). Comparison of Upper-Level Gradient Norm ∥∇Φ(x)∥ trajectories. The Norm-Based approach (Ours, blue) significantly outperforms Coordinate-Wise clipping (red) and Momentum baseline (orange), confirming that preserving gradient direction is crucial for bilevel convergence under heavy-tailed noise. D.4 Dynamics Verification in the Impulsive Corridor To provide a… view at source ↗

**Figure 5.** Figure 5: Dynamics Verification in the Impulsive Corridor. The background streamlines visualize the underlying noise-free gradient field directing toward the Target Attractor. Standard TTSA (Amber) fails to bound the update variance; a single heavy-tailed impulse provides enough momentum to eject the trajectory from the stable region, leading to immediate divergence. In contrast, RQTTSA (Teal) leverages its distrib… view at source ↗

**Figure 6.** Figure 6: Dynamics Verification in the Impulsive Corridor. Top: The external heavy-tailed force Fext(t), featuring catastrophic shocks and a sustained burst of high-energy noise. Bottom: The system convergence trajectory in parameter space. (1) Standard TTSA (Orange) lacks robust control and diverges instantly upon the first catastrophic shock. (2) Fixed Clipping (Red) prevents divergence but suffers from significan… view at source ↗

**Figure 7.** Figure 7: Canalization under Impulsive Noise in Bio-GRNs. The streamlines visualize the regulatory flow toward the equilibrium α ∗ . Under transcriptional bursts, standard methods (TTSA, AccBO) are ejected from the developmental path, and BiSLS stagnates. In contrast, RQ-TTSA (Cyan) leverages quantile-guided updates to filter impulsive variance, maintaining stable canalization along the non-linear valley to reach th… view at source ↗

**Figure 8.** Figure 8: RQ-TTSA Sensitivity Analysis on Fashion-MNIST. The heatmap illustrates the terminal validation loss across various quantile thresholds τ and window sizes W. The optimal configuration is identified at W = 100, τ = 0.95, though the overall landscape demonstrates significant robustness to parameter selection. I Code Availability and Reproducibility To support the reproducibility of our empirical results, we p… view at source ↗

read the original abstract

Bilevel optimization (BLO) is fundamental to hierarchical decision-making but suffers from critical instability under heavy-tailed stochastic noise. Existing variance-reduction techniques typically rely on myopic magnitude checks, which fail to distinguish informative geometric signals from impulsive outliers. To resolve this, we propose \textbf{RQ-TTSA} (Robust Quantile-guided TTSA), a distribution-aware framework that leverages historical gradient buffers to estimate rolling quantiles for adaptive Huber-style clipping, effectively preserving local optimization geometry while strictly bounding effective variance. Theoretically, we provide a convergence analysis for quantile-guided TTSA under nonconvex-strongly convex assumptions with infinite-variance noise ($p \in (1,2]$), deriving a rate of $\mathcal{O}(T^{-\frac{p-1}{3p-2}})$ that recovers optimal dependence on the heavy-tailed parameter. Empirically, across six diverse tasks, spanning heterogeneous vision benchmarks, dynamic games under momentum poisoning, and offline reinforcement learning, RQ-TTSA consistently outperforms state-of-the-art baselines by eliminating divergence spikes and ensuring stable convergence. Our method demonstrates significant robustness to hyperparameter variations and incurs negligible computational overhead ($\approx 2.7\%$ increase), validating distribution-aware gradient control as a practical and necessary component for reliable bilevel learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RQ-TTSA adds rolling-quantile Huber clipping to TTSA for bilevel problems under heavy-tailed noise and claims the matching rate O(T^{-(p-1)/(3p-2)}), but the analysis leaves the quantile estimator's own error term under-specified.

read the letter

The paper's core contribution is RQ-TTSA, which uses historical gradient buffers to compute rolling quantiles that adapt the clipping threshold in a Huber-style update for two-timescale stochastic approximation applied to bilevel optimization. This targets instability from heavy-tailed noise where variance is infinite.

It does a solid job laying out a framework that is distribution-aware rather than just magnitude-based. The claimed convergence rate of O(T^{-(p-1)/(3p-2)}) under nonconvex-strongly convex assumptions for p in (1,2] recovers the right dependence on the tail index, which is a nice touch. The experiments across vision, games, and RL show consistent outperformance and low overhead, which suggests the method is practical.

The soft spot is in how the rolling quantile is handled in the theory. The analysis assumes that the quantile estimates from finite buffers do not introduce bias or lag that perturbs the noise moments or the timescale separation. But in bilevel problems the outer loop gradients are non-stationary, so a buffer of fixed length could produce estimates whose error does not vanish fast enough relative to the claimed exponent. If that error feeds into the drift term, the rate could degrade. The abstract does not spell out how this is bounded, so the central claim rests on that step holding.

The paper is for people working on robust hierarchical optimization in machine learning, especially where data has heavy tails. A reader interested in stochastic approximation methods would get value from the specific rate and the empirical validation.

It deserves a serious referee because the problem is relevant and the method is clearly motivated, even if the proof details on the quantile part need scrutiny. I would recommend sending it to review rather than desk rejecting.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes RQ-TTSA, a distribution-aware robust bilevel optimization framework that uses rolling quantile estimates from historical gradient buffers to set adaptive Huber clipping thresholds within two-timescale stochastic approximation. Under nonconvex-strongly convex assumptions and p-stable noise (p ∈ (1,2]), it derives a convergence rate of O(T^{-(p-1)/(3p-2)}) and reports consistent empirical outperformance over baselines on six tasks spanning vision, dynamic games, and offline RL, with negligible overhead.

Significance. If the rate derivation rigorously absorbs the approximation error from the quantile estimator while preserving the claimed dependence on p, the result would be significant: it supplies the first distribution-aware variance control for bilevel problems that recovers the optimal heavy-tail exponent without requiring finite variance. The empirical robustness across heterogeneous domains would further support its utility for practical hierarchical learning under impulsive noise.

major comments (1)

[Abstract / convergence claim paragraph] Abstract and theoretical analysis paragraph: the claimed rate O(T^{-(p-1)/(3p-2)}) is derived by treating the quantile-guided Huber threshold as strictly variance-bounding while preserving local geometry, yet the analysis does not supply an explicit bound on the bias or lag error of the rolling quantile computed from finite historical buffers. In the bilevel setting the outer gradients are non-stationary, so this error term is not shown to be o(T^{-(p-1)/(3p-2)}) and may couple into the Lyapunov drift, degrading the exponent.

minor comments (1)

[Abstract] The abstract states empirical superiority but provides no details on the exact quantile estimator (e.g., sample quantile formula, buffer length, or update frequency), error-bar reporting, or baseline hyperparameter tuning protocol.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive comment on the convergence analysis. The observation regarding the quantile estimator's approximation error is well-taken, and we address it directly below while committing to a revision that strengthens the theoretical claims.

read point-by-point responses

Referee: [Abstract / convergence claim paragraph] Abstract and theoretical analysis paragraph: the claimed rate O(T^{-(p-1)/(3p-2)}) is derived by treating the quantile-guided Huber threshold as strictly variance-bounding while preserving local geometry, yet the analysis does not supply an explicit bound on the bias or lag error of the rolling quantile computed from finite historical buffers. In the bilevel setting the outer gradients are non-stationary, so this error term is not shown to be o(T^{-(p-1)/(3p-2)}) and may couple into the Lyapunov drift, degrading the exponent.

Authors: We acknowledge that the current manuscript does not provide an explicit bound on the bias and lag error induced by the finite-buffer rolling quantile estimator, nor does it demonstrate that this error remains o(T^{-(p-1)/(3p-2)}) under the non-stationary outer gradients characteristic of the bilevel setting. The analysis instead proceeds by treating the quantile threshold as effectively variance-bounding once the buffer is sufficiently large. To close this gap, we will add a supporting lemma in the revised theoretical section that quantifies the quantile approximation error under p-stable noise and the two-timescale dynamics. The lemma will establish that the resulting perturbation to the Lyapunov drift is absorbed into lower-order terms, preserving the leading exponent. We will also update the abstract to reflect this strengthened guarantee. revision: yes

Circularity Check

0 steps flagged

No circularity detected in convergence rate derivation

full rationale

The paper states a convergence analysis deriving the rate O(T^{-(p-1)/(3p-2)}) directly from nonconvex-strongly-convex assumptions and p-stable noise properties for the quantile-guided TTSA algorithm. No equations, steps, or claims in the provided text reduce this rate to a fitted parameter, self-citation chain, or input quantity by construction. The rolling quantile mechanism is introduced as part of the proposed method and then analyzed under explicit assumptions; the derivation is presented as independent rather than tautological. This matches the common case of a self-contained theoretical result with external empirical validation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the nonconvex-strongly convex bilevel setting and the existence of useful rolling quantiles from gradient history; no explicit free parameters, invented physical entities, or additional axioms are stated.

axioms (1)

domain assumption Bilevel problem satisfies nonconvex-strongly convex structure
Invoked in the convergence analysis paragraph of the abstract.

pith-pipeline@v0.9.1-grok · 5763 in / 1296 out tokens · 25023 ms · 2026-06-26T10:50:36.233208+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 7 canonical work pages · 2 internal anchors

[1]

2022 , eprint=

A Two-Timescale Framework for Bilevel Optimization: Complexity Analysis and Application to Actor-Critic , author=. 2022 , eprint=

2022
[2]

International conference on machine learning , pages=

Bilevel optimization: Convergence analysis and enhanced design , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[3]

arXiv preprint arXiv:2206.06995 , year=

Two-timescale stochastic approximation for bilevel optimisation problems in continuous-time models , author=. arXiv preprint arXiv:2206.06995 , year=

work page arXiv
[4]

Advances in Neural Information Processing Systems , volume=

Bisls/sps: Auto-tune step sizes for stable bi-level optimization , author=. Advances in Neural Information Processing Systems , volume=
[5]

arXiv preprint arXiv:1905.11881 , year=

Why gradient clipping accelerates training: A theoretical justification for adaptivity , author=. arXiv preprint arXiv:1905.11881 , year=

work page arXiv 1905
[6]

Advances in neural information processing systems , volume=

Momentum-based variance reduction in non-convex sgd , author=. Advances in neural information processing systems , volume=
[7]

International conference on machine learning , pages=

Model-agnostic meta-learning for fast adaptation of deep networks , author=. International conference on machine learning , pages=. 2017 , organization=

2017
[8]

International conference on machine learning , pages=

Bilevel programming for hyperparameter optimization and meta-learning , author=. International conference on machine learning , pages=. 2018 , organization=

2018
[9]

International conference on machine learning , pages=

Decentralized stochastic bilevel optimization with improved per-iteration complexity , author=. International conference on machine learning , pages=. 2023 , organization=

2023
[10]

The Thirty Seventh Annual Conference on Learning Theory , pages=

Fast two-time-scale stochastic gradient method with applications in reinforcement learning , author=. The Thirty Seventh Annual Conference on Learning Theory , pages=. 2024 , organization=

2024
[11]

Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

Robust estimation via robust gradient estimation , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2020 , publisher=

2020
[12]

Proceedings of Machine learning and systems , volume=

Federated optimization in heterogeneous networks , author=. Proceedings of Machine learning and systems , volume=
[13]

International Conference on Machine Learning , pages=

A tail-index analysis of stochastic gradient noise in deep neural networks , author=. International Conference on Machine Learning , pages=. 2019 , organization=

2019
[14]

IEEE transactions on pattern analysis and machine intelligence , volume=

Meta-learning in neural networks: A survey , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2021 , publisher=

2021
[15]

IEEE Signal Processing Magazine , volume=

An introduction to bilevel optimization: Foundations and applications in signal processing and machine learning , author=. IEEE Signal Processing Magazine , volume=. 2024 , publisher=

2024
[16]

arXiv preprint arXiv:2102.07367 , year=

A momentum-assisted single-timescale stochastic approximation algorithm for bilevel optimization , author=. arXiv preprint arXiv:2102.07367 , year=

work page arXiv
[17]

Advances in Neural Information Processing Systems , volume=

Optimal learners for realizable regression: Pac learning and online learning , author=. Advances in Neural Information Processing Systems , volume=
[18]

arXiv preprint arXiv:2506.08164 , year=

BLUR: A Bi-Level Optimization Approach for LLM Unlearning , author=. arXiv preprint arXiv:2506.08164 , year=

work page arXiv
[19]

International Conference on Machine Learning , pages=

Grad-match: Gradient matching based data subset selection for efficient deep model training , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021
[20]

Advances in Neural Information Processing Systems , volume=

Stochastic optimization with heavy-tailed noise via accelerated gradient clipping , author=. Advances in Neural Information Processing Systems , volume=
[21]

Advances in neural information processing systems , volume=

Tackling the objective inconsistency problem in heterogeneous federated optimization , author=. Advances in neural information processing systems , volume=
[22]

International Conference on Machine Learning , pages=

Provably efficient algorithms for multi-objective competitive rl , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021
[23]

International Conference on Artificial Intelligence and Statistics , pages=

Local SGD: Unified theory and new efficient methods , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2021 , organization=

2021
[24]

Advances in Neural Information Processing Systems , volume=

Large language model unlearning , author=. Advances in Neural Information Processing Systems , volume=
[25]

Adaptive Federated Optimization

Adaptive federated optimization , author=. arXiv preprint arXiv:2003.00295 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2003
[26]

Foundations of Computational Mathematics , volume=

Mean estimation and regression under heavy-tailed distributions: A survey , author=. Foundations of Computational Mathematics , volume=. 2019 , publisher=

2019
[27]

arXiv preprint arXiv:2401.09587 , year=

Bilevel optimization under unbounded smoothness: A new algorithm and convergence analysis , author=. arXiv preprint arXiv:2401.09587 , year=

work page arXiv
[28]

Journal of Machine Learning Research , volume=

Optimal algorithms for stochastic bilevel optimization under relaxed smoothness conditions , author=. Journal of Machine Learning Research , volume=
[29]

Advances in Neural Information Processing Systems , volume=

An accelerated algorithm for stochastic bilevel optimization under unbounded smoothness , author=. Advances in Neural Information Processing Systems , volume=
[30]

Advances in Neural Information Processing Systems , volume=

A framework for bilevel optimization that enables stochastic and global variance reduction algorithms , author=. Advances in Neural Information Processing Systems , volume=
[31]

International Conference on Machine Learning , pages=

Blockwise stochastic variance-reduced methods with parallel speedup for multi-block bilevel optimization , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[32]

Journal of Optimization Theory and Applications , volume=

High-Probability Complexity Bounds for Non-smooth Stochastic Convex Optimization with Heavy-Tailed Noise , author=. Journal of Optimization Theory and Applications , volume=. 2024 , publisher=

2024
[33]

Approximation Methods for Bilevel Programming

Approximation methods for bilevel programming , author=. arXiv preprint arXiv:1802.02246 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Linear convergence of gradient and proximal-gradient methods under the polyak-

Karimi, Hamed and Nutini, Julie and Schmidt, Mark , booktitle=. Linear convergence of gradient and proximal-gradient methods under the polyak-. 2016 , organization=

2016
[35]

Escaping the Variance Trap: Jacobian-Free Dynamics for Root-Finding Bilevel Optimization

Anonymous Authors. Escaping the Variance Trap: Jacobian-Free Dynamics for Root-Finding Bilevel Optimization
[36]

2014 , publisher=

The strategy of the genes , author=. 2014 , publisher=

2014
[37]

Cell , volume=

Nature, nurture, or chance: stochastic gene expression and its consequences , author=. Cell , volume=. 2008 , publisher=

2008
[38]

Optimizing methods in statistics , pages=

A convergence theorem for non negative almost supermartingales and some applications , author=. Optimizing methods in statistics , pages=. 1971 , publisher=

1971

[1] [1]

2022 , eprint=

A Two-Timescale Framework for Bilevel Optimization: Complexity Analysis and Application to Actor-Critic , author=. 2022 , eprint=

2022

[2] [2]

International conference on machine learning , pages=

Bilevel optimization: Convergence analysis and enhanced design , author=. International conference on machine learning , pages=. 2021 , organization=

2021

[3] [3]

arXiv preprint arXiv:2206.06995 , year=

Two-timescale stochastic approximation for bilevel optimisation problems in continuous-time models , author=. arXiv preprint arXiv:2206.06995 , year=

work page arXiv

[4] [4]

Advances in Neural Information Processing Systems , volume=

Bisls/sps: Auto-tune step sizes for stable bi-level optimization , author=. Advances in Neural Information Processing Systems , volume=

[5] [5]

arXiv preprint arXiv:1905.11881 , year=

Why gradient clipping accelerates training: A theoretical justification for adaptivity , author=. arXiv preprint arXiv:1905.11881 , year=

work page arXiv 1905

[6] [6]

Advances in neural information processing systems , volume=

Momentum-based variance reduction in non-convex sgd , author=. Advances in neural information processing systems , volume=

[7] [7]

International conference on machine learning , pages=

Model-agnostic meta-learning for fast adaptation of deep networks , author=. International conference on machine learning , pages=. 2017 , organization=

2017

[8] [8]

International conference on machine learning , pages=

Bilevel programming for hyperparameter optimization and meta-learning , author=. International conference on machine learning , pages=. 2018 , organization=

2018

[9] [9]

International conference on machine learning , pages=

Decentralized stochastic bilevel optimization with improved per-iteration complexity , author=. International conference on machine learning , pages=. 2023 , organization=

2023

[10] [10]

The Thirty Seventh Annual Conference on Learning Theory , pages=

Fast two-time-scale stochastic gradient method with applications in reinforcement learning , author=. The Thirty Seventh Annual Conference on Learning Theory , pages=. 2024 , organization=

2024

[11] [11]

Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

Robust estimation via robust gradient estimation , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2020 , publisher=

2020

[12] [12]

Proceedings of Machine learning and systems , volume=

Federated optimization in heterogeneous networks , author=. Proceedings of Machine learning and systems , volume=

[13] [13]

International Conference on Machine Learning , pages=

A tail-index analysis of stochastic gradient noise in deep neural networks , author=. International Conference on Machine Learning , pages=. 2019 , organization=

2019

[14] [14]

IEEE transactions on pattern analysis and machine intelligence , volume=

Meta-learning in neural networks: A survey , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2021 , publisher=

2021

[15] [15]

IEEE Signal Processing Magazine , volume=

An introduction to bilevel optimization: Foundations and applications in signal processing and machine learning , author=. IEEE Signal Processing Magazine , volume=. 2024 , publisher=

2024

[16] [16]

arXiv preprint arXiv:2102.07367 , year=

A momentum-assisted single-timescale stochastic approximation algorithm for bilevel optimization , author=. arXiv preprint arXiv:2102.07367 , year=

work page arXiv

[17] [17]

Advances in Neural Information Processing Systems , volume=

Optimal learners for realizable regression: Pac learning and online learning , author=. Advances in Neural Information Processing Systems , volume=

[18] [18]

arXiv preprint arXiv:2506.08164 , year=

BLUR: A Bi-Level Optimization Approach for LLM Unlearning , author=. arXiv preprint arXiv:2506.08164 , year=

work page arXiv

[19] [19]

International Conference on Machine Learning , pages=

Grad-match: Gradient matching based data subset selection for efficient deep model training , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021

[20] [20]

Advances in Neural Information Processing Systems , volume=

Stochastic optimization with heavy-tailed noise via accelerated gradient clipping , author=. Advances in Neural Information Processing Systems , volume=

[21] [21]

Advances in neural information processing systems , volume=

Tackling the objective inconsistency problem in heterogeneous federated optimization , author=. Advances in neural information processing systems , volume=

[22] [22]

International Conference on Machine Learning , pages=

Provably efficient algorithms for multi-objective competitive rl , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021

[23] [23]

International Conference on Artificial Intelligence and Statistics , pages=

Local SGD: Unified theory and new efficient methods , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2021 , organization=

2021

[24] [24]

Advances in Neural Information Processing Systems , volume=

Large language model unlearning , author=. Advances in Neural Information Processing Systems , volume=

[25] [25]

Adaptive Federated Optimization

Adaptive federated optimization , author=. arXiv preprint arXiv:2003.00295 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2003

[26] [26]

Foundations of Computational Mathematics , volume=

Mean estimation and regression under heavy-tailed distributions: A survey , author=. Foundations of Computational Mathematics , volume=. 2019 , publisher=

2019

[27] [27]

arXiv preprint arXiv:2401.09587 , year=

Bilevel optimization under unbounded smoothness: A new algorithm and convergence analysis , author=. arXiv preprint arXiv:2401.09587 , year=

work page arXiv

[28] [28]

Journal of Machine Learning Research , volume=

Optimal algorithms for stochastic bilevel optimization under relaxed smoothness conditions , author=. Journal of Machine Learning Research , volume=

[29] [29]

Advances in Neural Information Processing Systems , volume=

An accelerated algorithm for stochastic bilevel optimization under unbounded smoothness , author=. Advances in Neural Information Processing Systems , volume=

[30] [30]

Advances in Neural Information Processing Systems , volume=

A framework for bilevel optimization that enables stochastic and global variance reduction algorithms , author=. Advances in Neural Information Processing Systems , volume=

[31] [31]

International Conference on Machine Learning , pages=

Blockwise stochastic variance-reduced methods with parallel speedup for multi-block bilevel optimization , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023

[32] [32]

Journal of Optimization Theory and Applications , volume=

High-Probability Complexity Bounds for Non-smooth Stochastic Convex Optimization with Heavy-Tailed Noise , author=. Journal of Optimization Theory and Applications , volume=. 2024 , publisher=

2024

[33] [33]

Approximation Methods for Bilevel Programming

Approximation methods for bilevel programming , author=. arXiv preprint arXiv:1802.02246 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Linear convergence of gradient and proximal-gradient methods under the polyak-

Karimi, Hamed and Nutini, Julie and Schmidt, Mark , booktitle=. Linear convergence of gradient and proximal-gradient methods under the polyak-. 2016 , organization=

2016

[35] [35]

Escaping the Variance Trap: Jacobian-Free Dynamics for Root-Finding Bilevel Optimization

Anonymous Authors. Escaping the Variance Trap: Jacobian-Free Dynamics for Root-Finding Bilevel Optimization

[36] [36]

2014 , publisher=

The strategy of the genes , author=. 2014 , publisher=

2014

[37] [37]

Cell , volume=

Nature, nurture, or chance: stochastic gene expression and its consequences , author=. Cell , volume=. 2008 , publisher=

2008

[38] [38]

Optimizing methods in statistics , pages=

A convergence theorem for non negative almost supermartingales and some applications , author=. Optimizing methods in statistics , pages=. 1971 , publisher=

1971