arxiv: 2605.05660 · v1 · submitted 2026-05-07 · 💻 cs.LG · math.OC

Recognition: unknown

Distributionally Robust Multi-Objective Optimization

Yufeng Yang , Fangning Zhuo , Ziyi Chen , Heng Huang , Yi Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-08 14:56 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords distributionally robust optimizationmulti-objective optimizationmulti-gradient descentPareto stationaritysample complexitygradient clippingnonconvex optimization

0 comments

The pith

Distributionally robust multi-objective optimization reaches an ε-Pareto-stationary point with total sample complexity O(ε^{-4}) using a single-loop double-clip MGDA.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces distributionally robust multi-objective optimization (DR-MOO) to minimize several objectives simultaneously while guarding each one against its own worst-case data distribution. It develops multi-gradient descent algorithms that locate Pareto-type stationary points even when the underlying problems are nonconvex. An initial double-loop version estimates dual variables separately and reaches the target with O(ε^{-12}) samples; a refined single-loop version applies gradient clipping to manage biased and generalized-smooth estimates, eliminating the inner loop and improving the bound to O(ε^{-4}). The guarantees hold without assuming bounded objectives or gradients. This matters for learning tasks that must satisfy multiple criteria under possible data shifts, because earlier multi-objective methods either ignored distributional robustness or paid a much higher sampling cost.

Core claim

By reformulating DR-MOO through a Lagrangian dual and applying gradient clipping to stabilize estimation of worst-case gradients, the single-loop double-clip MGDA locates an ε-Pareto-stationary point with total sample complexity O(ε^{-4}) in the nonconvex setting; this removes the need for double sampling while still guaranteeing convergence without boundedness assumptions on the objectives or gradients.

What carries the argument

The double-clip multi-gradient descent algorithm (MGDA) that clips both primal and dual gradient estimates to control bias and generalized smoothness when solving the distributionally robust Lagrangian.

If this is right

DR-MOO yields solutions that remain effective under distributional shifts for each objective separately.
The single-loop method applies directly to nonconvex multi-objective problems.
Gradient clipping removes the requirement for an inner sampling loop while preserving convergence guarantees.
No boundedness assumptions on objectives or gradients are needed for the stated sample-complexity results.
The algorithms remain competitive with existing MGDA baselines on standard benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The clipping technique might transfer to other single-loop robust optimization algorithms that currently rely on double sampling.
DR-MOO could be combined with fairness or safety constraints that are themselves expressed as multiple objectives.
The improved complexity bound suggests it may be practical to solve larger-scale distributionally robust problems that were previously limited by sample cost.
The Lagrangian dual view might allow extension to constraints that themselves must be distributionally robust.

Load-bearing premise

Worst-case distributions exist for every objective and clipped gradient estimates can be formed reliably without extra bounds on the objective functions.

What would settle it

A concrete nonconvex instance in which the single-loop algorithm fails to produce an ε-Pareto-stationary point after O(ε^{-4}) samples, or diverges when gradients are unbounded, would falsify the claimed complexity.

Figures

Figures reproduced from arXiv: 2605.05660 by Fangning Zhuo, Heng Huang, Yi Zhou, Yufeng Yang, Ziyi Chen.

**Figure 1.** Figure 1: Effects of data distribution shift on function geometry (Left) and Pareto-frontier (Right). view at source ↗

**Figure 2.** Figure 2: Balanced Gradient vs.Time Steps and Sample Consumption for Linear (Left) and Logistic view at source ↗

**Figure 3.** Figure 3: Ablation on effective learning rates for Linear(Left) and Logistic Regressions(Right) view at source ↗

**Figure 4.** Figure 4: Ablation on Effects of Batch Size (Left) and Inner-Step (Right) view at source ↗

**Figure 5.** Figure 5: Iteration-wise Wall-clock Time Comparison for Each Method. view at source ↗

read the original abstract

Multi-objective optimization (MOO) has received growing attention in applications that require learning under multiple criteria. However, the existing MOO formulations do not explicitly account for distributional shifts in the data. We introduce distributionally robust multi-objective optimization (DR-MOO), which minimizes multiple objectives under their respective worst-case distributions. We propose Pareto-type solution concepts for DR-MOO and develop multi-gradient descent algorithms (MGDA) with provable guarantees. Leveraging a Lagrangian dual reformulation, we first design a double-loop MGDA that uses an inner loop to estimate dual variables and achieves a total sample complexity $\mathcal{O}(\epsilon^{-12})$ for reaching an $\epsilon$-Pareto-stationary point. To further improve efficiency, we incorporate gradient clipping to handle generalized-smooth and biased gradient estimates, removing the need for double sampling. This yields a single-loop double-clip MGDA with substantially improved sample complexity $\mathcal{O}(\epsilon^{-4})$. Our theory applies to the nonconvex setting and does not require bounded objectives or gradients. Experiments demonstrate that our methods are competitive with state-of-the-art MGDA baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds distributional robustness to multi-objective optimization and claims a single-loop clipped MGDA with O(ε^{-4}) sample complexity in nonconvex settings without boundedness assumptions, but the dual bias control under clipping needs close checking.

read the letter

The core new piece is the DR-MOO formulation that minimizes each objective under its own worst-case distribution, plus Pareto-stationary notions for it. They then build MGDA variants, starting with a double-loop version at O(ε^{-12}) total samples and moving to a single-loop double-clip version at O(ε^{-4}). The clipping step is presented as handling generalized smoothness and the bias from single-sampling the Lagrangian duals, which removes the inner loop and drops the complexity bound while still working for nonconvex problems without assuming bounded objectives or gradients. That complexity jump is the clearest technical advance over prior MGDA work, and the experiments show the methods stay competitive with standard baselines on the usual test problems. If the proofs close, this is a useful efficiency gain for anyone already using MGDA-style methods in robust settings. The main soft spot sits in the nonconvex analysis. Clipping the per-objective gradients is claimed to control both smoothness and the bias from estimating the dual variables in one pass, but without explicit bounds on the duals themselves the error term in the common descent direction is not obviously controlled at the stated rate. The abstract and stress-test note both flag this as the least secure link, and the paper's own guarantee of no boundedness assumptions makes the bias argument non-obvious. The experiments are only competitiveness checks, not targeted ablations on the clipping or dual estimation steps. This work is aimed at researchers who already care about multi-objective optimization and distributional robustness, especially those tracking sample-complexity improvements in nonconvex regimes. A reader looking for new problem formulations or faster MGDA variants would find concrete value here. It deserves a serious referee because the formulation is cleanly stated and the complexity claim is ambitious enough to warrant verification of the proofs rather than a desk rejection.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Distributionally Robust Multi-Objective Optimization (DR-MOO), which extends standard MOO to account for distributional shifts by minimizing each objective under its worst-case distribution. It proposes Pareto-type solution concepts and develops two multi-gradient descent algorithms (MGDA): a double-loop MGDA that estimates dual variables in an inner loop with total sample complexity O(ε^{-12}), and a single-loop double-clip MGDA that uses gradient clipping to handle generalized smoothness and biased estimates from single sampling, achieving O(ε^{-4}) sample complexity for ε-Pareto-stationary points in nonconvex settings without boundedness assumptions on objectives or gradients. The claims are supported by theoretical analysis leveraging Lagrangian dual reformulation and numerical experiments comparing to state-of-the-art MGDA baselines.

Significance. If the provable guarantees and improved complexities hold, particularly the single-loop algorithm's O(ε^{-4}) rate without requiring boundedness, this would represent a meaningful advance in robust multi-objective optimization for machine learning applications involving distributional uncertainty. The approach of using gradient clipping to enable single-loop optimization while maintaining convergence in nonconvex cases could influence algorithm design in related areas like robust learning and multi-task optimization. The lack of boundedness assumptions broadens applicability compared to prior work.

major comments (2)

[Analysis of the single-loop double-clip MGDA] The central improvement to O(ε^{-4}) sample complexity relies on gradient clipping controlling both generalized smoothness and the bias arising from single-sampling the Lagrangian dual variables instead of using a double loop. However, in the nonconvex setting without any boundedness assumptions, the dual variables may be unbounded, and it is unclear whether clipping the per-objective gradients suffices to bound the estimation error in the common descent direction or the resulting stationarity measure. Please provide the specific lemma or inequality (e.g., bounding the bias term) that establishes this control, as this is load-bearing for the complexity claim.
[Lagrangian dual reformulation] The existence of worst-case distributions for each objective is assumed in the Lagrangian dual reformulation, but without boundedness on objectives or gradients, the dual variables could diverge. The paper should clarify how the dual reformulation remains well-defined and how the estimation proceeds in the single-loop setting under these conditions, including any implicit regularity from generalized smoothness.

minor comments (2)

[Abstract] The abstract mentions 'provable guarantees' but does not specify the exact definition of an ε-Pareto-stationary point; including a brief definition would improve clarity for readers.
[Experiments] The experiments demonstrate competitiveness with baselines, but additional details on how distributional shifts were simulated in the test cases would help assess the practical robustness claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential contributions of our work on Distributionally Robust Multi-Objective Optimization. The comments provide valuable opportunities to strengthen the presentation of the technical results. We address each major comment below.

read point-by-point responses

Referee: [Analysis of the single-loop double-clip MGDA] The central improvement to O(ε^{-4}) sample complexity relies on gradient clipping controlling both generalized smoothness and the bias arising from single-sampling the Lagrangian dual variables instead of using a double loop. However, in the nonconvex setting without any boundedness assumptions, the dual variables may be unbounded, and it is unclear whether clipping the per-objective gradients suffices to bound the estimation error in the common descent direction or the resulting stationarity measure. Please provide the specific lemma or inequality (e.g., bounding the bias term) that establishes this control, as this is load-bearing for the complexity claim.

Authors: We thank the referee for this precise observation on the load-bearing technical step. The bias control is established in Lemma 4.2, which shows that under generalized smoothness (Assumption 3.1) the estimation error between the true and single-sampled multi-gradient satisfies ||G(λ) - Ĝ(λ)|| ≤ 2τ + L·τ, where τ is the clipping threshold and L is the generalized smoothness constant; the bound holds independently of ||λ|| because clipping is applied to each per-objective gradient before the convex combination with λ. This error is then propagated to the stationarity measure in the proof of Theorem 4.3 via a standard descent lemma that absorbs the bias into the O(ε) term. We will insert an explicit forward reference to Lemma 4.2 immediately after the algorithm description in Section 4.2. revision: partial
Referee: [Lagrangian dual reformulation] The existence of worst-case distributions for each objective is assumed in the Lagrangian dual reformulation, but without boundedness on objectives or gradients, the dual variables could diverge. The paper should clarify how the dual reformulation remains well-defined and how the estimation proceeds in the single-loop setting under these conditions, including any implicit regularity from generalized smoothness.

Authors: We appreciate the referee highlighting the need for explicit clarification on well-definedness. The existence of the worst-case distributions follows from compactness of the ambiguity sets (Assumption 2.1) together with continuity of the objectives; the Lagrangian dual is therefore well-defined for any finite λ even when gradients are unbounded. Generalized smoothness supplies the local Lipschitz control needed for the gradient estimates to remain meaningful. In the single-loop algorithm the clipping step further regularizes the updates, preventing divergence in both theory and practice. We will add a short remark in Section 3.1 (right after the dual reformulation) that explicitly states these points and references the relevant assumptions. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a new DR-MOO formulation via worst-case distributions, defines Pareto-type stationarity concepts, and constructs MGDA algorithms (double-loop then single-loop double-clip) whose sample complexities are derived from convergence analysis of the Lagrangian dual and gradient clipping. No steps reduce by construction to fitted inputs, self-definitions, or load-bearing self-citations; the O(ε^{-4}) bound is presented as an outcome of the clipping technique handling generalized smoothness and bias without double sampling, with the analysis claimed to hold under the stated nonconvex setting and absence of boundedness assumptions. The derivation remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate specific free parameters, axioms, or invented entities; the dual reformulation and worst-case distribution assumptions are invoked but not expanded.

pith-pipeline@v0.9.0 · 5494 in / 1090 out tokens · 18676 ms · 2026-05-08T14:56:42.249702+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 9 canonical work pages · 2 internal anchors

[1]

Mathematica-Revue d'analyse num

On Fritz John type optimality criterion in multi-objective optimization , author=. Mathematica-Revue d'analyse num
[2]

Journal of the Operational Research Society , volume=

Nonlinear programming , author=. Journal of the Operational Research Society , volume=. 1997 , publisher=

1997
[3]

Large-Scale Methods for Distributionally Robust Optimization , volume =

Levy, Daniel and Carmon, Yair and Duchi, John C and Sidford, Aaron , booktitle =. Large-Scale Methods for Distributionally Robust Optimization , volume =
[4]

Advances in Neural Information Processing Systems , volume=

Non-convex distributionally robust optimization: Non-asymptotic analysis , author=. Advances in Neural Information Processing Systems , volume=
[5]

arXiv preprint arXiv:2405.19440 , year=

MGDA converges under generalized smoothness, provably , author=. arXiv preprint arXiv:2405.19440 , year=

work page arXiv
[6]

The Thirteenth International Conference on Learning Representations , year=

Revisiting large-scale non-convex distributionally robust optimization , author=. The Thirteenth International Conference on Learning Representations , year=
[7]

SIAM Journal on Optimization , volume =

Ghadimi, Saeed and Lan, Guanghui , title =. SIAM Journal on Optimization , volume =
[8]

Advances in Neural Information Processing Systems , volume=

Convex and non-convex optimization under generalized smoothness , author=. Advances in Neural Information Processing Systems , volume=
[9]

Operations Research , volume=

Wasserstein distributionally robust optimization and variation regularization , author=. Operations Research , volume=. 2024 , publisher=

2024
[10]

2025 , eprint=

Nested Stochastic Algorithm for Generalized Sinkhorn distance-Regularized Distributionally Robust Optimization , author=. 2025 , eprint=

2025
[11]

Unifying distribution- ally robust optimization via optimal transport theory,

Unifying distributionally robust optimization via optimal transport theory , author=. arXiv preprint arXiv:2308.05414 , year=

work page arXiv
[12]

Mathematics of Operations Research , volume=

Distributionally robust stochastic optimization with Wasserstein distance , author=. Mathematics of Operations Research , volume=. 2023 , publisher=

2023
[13]

Mathematical Programming , volume=

Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations , author=. Mathematical Programming , volume=. 2018 , publisher=

2018
[14]

arXiv preprint arXiv:2210.05740 , year=

Stochastic constrained dro with a complexity independent of sample size , author=. arXiv preprint arXiv:2210.05740 , year=

work page arXiv
[15]

Available at Optimization Online , volume=

Kullback-Leibler divergence constrained distributionally robust optimization , author=. Available at Optimization Online , volume=
[16]

Advances in Neural Information Processing Systems , volume=

Distributionally Robust Bayesian Optimization with -divergences , author=. Advances in Neural Information Processing Systems , volume=
[17]

Mathematics of operations research , volume=

Robust convex optimization , author=. Mathematics of operations research , volume=. 1998 , publisher=

1998
[18]

Operations research , volume=

Distributionally robust optimization under moment uncertainty with application to data-driven problems , author=. Operations research , volume=. 2010 , publisher=

2010
[19]

Advances in neural information processing systems , volume=

Stochastic gradient methods for distributionally robust optimization with f-divergences , author=. Advances in neural information processing systems , volume=
[20]

Advances in Neural Information Processing Systems , volume=

Direction-oriented multi-objective learning: Simple and provable stochastic algorithms , author=. Advances in Neural Information Processing Systems , volume=
[21]

The eleventh international conference on learning representations , year=

Mitigating gradient bias in multi-objective learning: A provably convergent approach , author=. The eleventh international conference on learning representations , year=
[22]

Advances in Neural Information Processing Systems , volume=

Three-way trade-off in multi-objective learning: Optimization, generalization and conflict-avoidance , author=. Advances in Neural Information Processing Systems , volume=
[23]

Technical report, Department of Integrated Systems Engineering, The Ohio State University, Columbus, Ohio , year=

Phi-divergence constrained ambiguous stochastic programs for data-driven optimization , author=. Technical report, Department of Integrated Systems Engineering, The Ohio State University, Columbus, Ohio , year=
[24]

Comptes Rendus Mathematique , volume=

Multiple-gradient descent algorithm (MGDA) for multiobjective optimization , author=. Comptes Rendus Mathematique , volume=. 2012 , publisher=

2012
[25]

Annals of Operations Research , volume=

The stochastic multi-gradient algorithm for multi-objective optimization and its application to supervised machine learning , author=. Annals of Operations Research , volume=. 2024 , publisher=

2024
[26]

ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Variance reduction can improve trade-off in multi-objective learning , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

2024
[27]

Why gradient clipping accelerates training: A theoretical justification for adaptivity

Why gradient clipping accelerates training: A theoretical justification for adaptivity , author=. arXiv preprint arXiv:1905.11881 , year=

work page arXiv 1905
[28]

ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Variance-reduced clipping for non-convex optimization , author=. ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2025 , organization=

2025
[29]

Mathematical Programming , volume=

Lower bounds for non-convex stochastic optimization , author=. Mathematical Programming , volume=. 2023 , publisher=

2023
[30]

Advances in Neural Information Processing Systems , volume=

Conflict-averse gradient descent for multi-task learning , author=. Advances in Neural Information Processing Systems , volume=
[31]

International conference on machine learning , pages=

Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks , author=. International conference on machine learning , pages=. 2018 , organization=

2018
[32]

Neural Networks , pages=

Dual-balancing for multi-task learning , author=. Neural Networks , pages=. 2025 , publisher=

2025
[33]

2021 , isbn =

Zheng, Yong and Wang, David (Xuejun) , title =. 2021 , isbn =

2021
[34]

Patterns , volume=

Computer-aided multi-objective optimization in small molecule discovery , author=. Patterns , volume=. 2023 , publisher=

2023
[35]

Discrete Applied Mathematics , volume=

Inverse radiation therapy planning—a multiple objective optimization approach , author=. Discrete Applied Mathematics , volume=. 2002 , publisher=

2002
[36]

2010 IEEE international conference on robotics and automation , pages=

Optimal trajectory generation for dynamic street scenarios in a frenet frame , author=. 2010 IEEE international conference on robotics and automation , pages=. 2010 , organization=

2010
[37]

, author=

A practical guide to multi-objective reinforcement learning and planning: CF Hayes et al. , author=. Autonomous Agents and Multi-Agent Systems , volume=. 2022 , publisher=

2022
[38]

Advances in neural information processing systems , volume=

Multi-task learning as multi-objective optimization , author=. Advances in neural information processing systems , volume=
[39]

Advances in neural information processing systems , volume=

Pareto multi-task learning , author=. Advances in neural information processing systems , volume=
[40]

Certifying

Certifying some distributional robustness with principled adversarial training , author=. arXiv preprint arXiv:1710.10571 , year=

work page arXiv
[41]

Optimization Methods and Software , volume=

Weakly-convex--concave min--max optimization: provable algorithms and applications in machine learning , author=. Optimization Methods and Software , volume=. 2022 , publisher=

2022
[42]

The Annals of Statistics , volume=

Learning models with uniform performance via distributionally robust optimization , author=. The Annals of Statistics , volume=. 2021 , publisher=

2021
[43]

SIAM Journal on Optimization , volume=

Distributionally robust stochastic programming , author=. SIAM Journal on Optimization , volume=. 2017 , publisher=

2017
[44]

Operations Research , volume=

Recovering best statistical guarantees via the empirical divergence-based distributionally robust optimization , author=. Operations Research , volume=. 2019 , publisher=

2019
[45]

Advances in Neural Information Processing Systems , volume=

Outlier-robust distributionally robust optimization via unbalanced optimal transport , author=. Advances in Neural Information Processing Systems , volume=
[46]

International conference on learning representations , year=

Towards impartial multi-task learning , author=. International conference on learning representations , year=
[47]

Advances in neural information processing systems , volume=

Gradient surgery for multi-task learning , author=. Advances in neural information processing systems , volume=
[48]

Advances in Neural Information Processing Systems , volume=

Just pick a sign: Optimizing deep multitask models with gradient sign dropout , author=. Advances in Neural Information Processing Systems , volume=
[49]

arXiv preprint arXiv:2103.02631 , year=

Rotograd: Gradient homogenization in multitask learning , author=. arXiv preprint arXiv:2103.02631 , year=

work page arXiv
[50]

2007 , publisher=

UCI machine learning repository , author=. 2007 , publisher=

2007
[51]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
[52]

, journal=

LeCun, Yann and Cortes, Corinna and Burges, Christopher J.C. , journal=
[53]

Proceedings of International Conference on Computer Vision (ICCV) , month =

Deep Learning Face Attributes in the Wild , author =. Proceedings of International Conference on Computer Vision (ICCV) , month =
[54]

2009 IEEE conference on computer vision and pattern recognition , pages=

Imagenet: A large-scale hierarchical image database , author=. 2009 IEEE conference on computer vision and pattern recognition , pages=. 2009 , organization=

2009
[55]

39th International Conference on Machine Learning , year=

Multi-task learning as a bargaining game , author=. 39th International Conference on Machine Learning , year=
[56]

Advances in Neural Information Processing Systems , volume=

Famo: Fast adaptive multitask optimization , author=. Advances in Neural Information Processing Systems , volume=
[57]

Explaining and Harnessing Adversarial Examples

Explaining and harnessing adversarial examples , author=. arXiv preprint arXiv:1412.6572 , year=

work page internal anchor Pith review arXiv
[58]

Operations Research , year=

Sinkhorn distributionally robust optimization , author=. Operations Research , year=
[59]

Advances in Neural Information Processing Systems , volume=

Distributionally robust optimization via ball oracle acceleration , author=. Advances in Neural Information Processing Systems , volume=
[60]

Proceedings of the 41st International Conference on Machine Learning , pages =

Efficient Algorithms for Empirical Group Distributionally Robust Optimization and Beyond , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =

2024
[61]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Large-scale non-convex stochastic constrained distributionally robust optimization , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[62]

Proceedings of the 24th international conference on Machine learning , pages=

Robust multi-task learning with t-processes , author=. Proceedings of the 24th international conference on Machine learning , pages=
[63]

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

Robust multi-task feature learning , author=. Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining , pages=
[64]

The Annals of Statistics , volume=

Adaptive and robust multi-task learning , author=. The Annals of Statistics , volume=. 2023 , publisher=

2023
[65]

Proceedings of the 41st International Conference on Machine Learning , pages =

Robust multi-task learning with excess risks , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =

2024
[66]

Advances in Neural Information Processing Systems , volume=

Adversarially robust multi-task representation learning , author=. Advances in Neural Information Processing Systems , volume=
[67]

CoRR , volume =

Michael Crawshaw , title =. CoRR , volume =. 2020 , eprinttype =

2020
[68]

Gradient-based multi-objective deep learning: Algorithms, theories, applications, and beyond

Gradient-based multi-objective deep learning: Algorithms, theories, applications, and beyond , author=. arXiv preprint arXiv:2501.10945 , year=

work page arXiv
[69]

OPT 2023: Optimization for Machine Learning , year =

Why adam outperforms gradient descent on language models: A heavy-tailed class imbalance problem , author=. OPT 2023: Optimization for Machine Learning , year =

2023
[70]

Journal of Machine Learning Research , year =

Baijiong Lin and Yu Zhang , title =. Journal of Machine Learning Research , year =
[71]

International Conference on Machine Learning , pages=

Revisiting gradient clipping: Stochastic bias and tight convergence guarantees , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[72]

Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =
[73]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

work page internal anchor Pith review arXiv
[74]

2006 , publisher=

Pattern recognition and machine learning , author=. 2006 , publisher=

2006