Recognition: 2 theorem links
· Lean TheoremNexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima
Pith reviewed 2026-05-10 17:50 UTC · model grok-4.3
The pith
Converging to common minima across data sources during pretraining improves downstream generalization even at identical loss values.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the geometric closeness of task-specific minima is intrinsically linked to downstream generalization; standard pretraining leaves these minima distant, while maximizing gradient similarity produces a common minimizer and thereby better generalization without altering the achieved pretraining loss.
What carries the argument
The Nexus optimizer, which modifies the update step to maximize similarity between gradients computed on different data subsets and thereby pulls their individual minima closer together.
If this is right
- Pretraining loss ceases to be a sufficient proxy for model quality once optimizer-induced biases are considered.
- The same computational budget can produce stronger reasoning capabilities by changing only the optimization rule.
- The benefit appears across model scales and data mixtures, with the largest relative gains on out-of-distribution reasoning tasks.
- Gradient alignment during pretraining offers a controllable lever for generalization that does not require additional data or model size.
Where Pith is reading between the lines
- If closeness of minima is the operative mechanism, then similar gradient-alignment techniques could be applied to other multi-task or multi-domain training regimes beyond language modeling.
- The result suggests that future scaling laws may need to incorporate an explicit term for optimizer-induced geometric bias rather than loss alone.
- One testable extension is whether the same principle holds when the data mixture changes dynamically during training.
Load-bearing premise
That pulling task-specific minima geometrically closer is what causes the observed gains in downstream generalization rather than some other property of the optimization trajectory.
What would settle it
Measure the Euclidean distance between the minima found for separate data subsets after pretraining with and without Nexus; check whether the reduction in that distance predicts the size of the downstream accuracy improvement.
read the original abstract
Pretraining is the cornerstone of Large Language Models (LLMs), dominating the vast majority of computational budget and data to serve as the primary engine for their capabilities. During pretraining, LLMs acquire foundational knowledge from an unprecedentedly massive and diverse data sources, encompassing a vast array of domains such as general language, mathematics, code, and complex reasoning. In this work, we investigate an interesting geometric question regarding the converged state of pretraining: Does the model converge to a common minimizer across all data sources (e.g., \cref{fig:cwa_illustration:close}), or merely a minimizer of the summed loss (e.g., \cref{fig:cwa_illustration:distant})? We hypothesize that the geometric "closeness" of task-specific minima is intrinsically linked to downstream generalization. We reveal that standard optimizers (e.g., AdamW) often converge to points where task-specific minima are distant from each other. To address this, we propose the Nexus optimizer, which encourages the closeness of these minima by maximizing gradient similarity during optimization. Experiments across models ranging from 130M to 3B parameters, various data mixtures and hyperparameter schedules, show that Nexus \textit{significantly boosts downstream performance}, despite \textit{achieving the same pretraining loss} (see \cref{fig:demo:benchmark}). Notably, on the 3B model, Nexus reduces the out-of-distribution loss by 0.012 and yields up to a 15.0\% accuracy improvement on complex reasoning tasks (e.g., GSM8k). This finding challenges the reliance on pretraining loss as the sole proxy for model evaluation and demonstrates the importance of implicit biases in unlocking downstream generalization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that standard optimizers like AdamW converge to points where task-specific minima (across data sources such as language, math, and code) are geometrically distant, and that this distance harms downstream generalization even at the same pretraining loss. It introduces the Nexus optimizer, which augments the update rule to maximize gradient similarity across tasks during pretraining, thereby encouraging convergence to common minima. Experiments on models from 130M to 3B parameters across data mixtures report identical pretraining loss but improved downstream metrics, including a 0.012 reduction in out-of-distribution loss and up to 15% accuracy gains on tasks like GSM8k for the 3B model.
Significance. If the causal link between gradient-similarity maximization, minima closeness, and generalization holds after proper isolation, the result would be significant: it would demonstrate that pretraining loss is an incomplete proxy for model quality and that optimizer implicit biases can be engineered to improve downstream performance without additional data or compute. The scale of experiments (multiple model sizes and mixtures) and the challenge to loss-as-proxy evaluation add value, though the current evidence remains correlational rather than mechanistic.
major comments (3)
- [Abstract / experimental results] Abstract and experimental results section: the central claim requires that Nexus produces measurably closer task-specific minima than AdamW at identical pretraining loss, yet no direct quantification of minima distance (e.g., parameter-space distance after task-specific fine-tuning from the shared checkpoint, or loss-landscape interpolation between task minima) is reported; only final loss and downstream accuracy are shown.
- [Nexus optimizer definition] § on Nexus optimizer definition: the auxiliary term that maximizes gradient similarity is introduced without an ablation that applies an equivalent-magnitude auxiliary gradient (without the explicit similarity objective) to isolate whether the performance delta arises from similarity maximization rather than incidental changes in effective step size, noise scale, or curvature.
- [Experimental protocol] Experimental protocol: no details are supplied on the number of independent runs, random seeds, statistical tests (e.g., t-tests or confidence intervals) for the reported deltas such as the 0.012 OOD loss reduction or 15% GSM8k gain, nor on how gradient similarity was computed and maximized in practice across the data mixtures.
minor comments (2)
- [Figures] Figure captions and axis labels in the benchmark and illustration figures should explicitly state the number of runs and error bars if any.
- [Notation] The notation for the gradient-similarity term should be defined once in the main text rather than only in the appendix.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, providing clarifications and committing to revisions that strengthen the manuscript without misrepresenting our results.
read point-by-point responses
-
Referee: [Abstract / experimental results] Abstract and experimental results section: the central claim requires that Nexus produces measurably closer task-specific minima than AdamW at identical pretraining loss, yet no direct quantification of minima distance (e.g., parameter-space distance after task-specific fine-tuning from the shared checkpoint, or loss-landscape interpolation between task minima) is reported; only final loss and downstream accuracy are shown.
Authors: We acknowledge that direct quantification of minima distances would strengthen the geometric interpretation. However, in high-dimensional parameter spaces, Euclidean distances are often uninformative due to the curse of dimensionality and the non-convex nature of the loss landscape. We instead use downstream generalization performance at matched pretraining loss as the primary, practically relevant evidence for the hypothesis. In the revised manuscript we will add loss-landscape interpolation plots between task-specific fine-tuned models (starting from the shared pretrained checkpoint) to provide additional supporting analysis. This is a partial revision, as exhaustive distance metrics across all tasks would require substantial extra compute. revision: partial
-
Referee: [Nexus optimizer definition] § on Nexus optimizer definition: the auxiliary term that maximizes gradient similarity is introduced without an ablation that applies an equivalent-magnitude auxiliary gradient (without the explicit similarity objective) to isolate whether the performance delta arises from similarity maximization rather than incidental changes in effective step size, noise scale, or curvature.
Authors: This is a fair criticism and a useful control experiment. We will add an ablation in the revised version that applies an auxiliary gradient of matched magnitude but with randomized directions (no explicit similarity objective). This will help isolate whether the observed gains arise specifically from gradient alignment rather than secondary effects on step size or curvature. We have already run preliminary versions of this control and will report the full results. revision: yes
-
Referee: [Experimental protocol] Experimental protocol: no details are supplied on the number of independent runs, random seeds, statistical tests (e.g., t-tests or confidence intervals) for the reported deltas such as the 0.012 OOD loss reduction or 15% GSM8k gain, nor on how gradient similarity was computed and maximized in practice across the data mixtures.
Authors: We apologize for these omissions. The revised manuscript will include: (i) results averaged over 5 independent runs using seeds 42, 123, 456, 789 and 1011, with mean and standard deviation reported; (ii) two-sided t-tests and 95% confidence intervals for the key deltas (0.012 OOD loss and GSM8k accuracy); (iii) explicit implementation details: gradient similarity is computed as the average cosine similarity between per-source gradients (language, math, code) obtained by separate forward-backward passes on source-specific batches within each update; the auxiliary term is added to the loss as 0.1 * (1 - avg_cosine) and optimized jointly with the main objective. revision: yes
Circularity Check
No significant circularity; empirical results rest on independent optimizer design and measurements.
full rationale
The paper advances a hypothesis that geometric closeness of task-specific minima improves downstream generalization, introduces the Nexus optimizer to maximize gradient similarity as an independent design choice, and reports experimental outcomes (identical pretraining loss but better downstream metrics) across model scales. No load-bearing step reduces a claimed result to a fitted parameter, self-referential definition, or self-citation chain by construction. The optimizer objective is explicitly motivated rather than derived from the target generalization metric, and the reported deltas are direct empirical comparisons rather than predictions forced by the inputs. This is the normal case of a non-circular empirical contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Closeness of task-specific minima is intrinsically linked to downstream generalization
invented entities (1)
-
Nexus optimizer
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Nexus ... approximates the gradient of gradient similarity ∇CosSim(∇Li,∇Lj) ... J2nd(θ)=γ∑||Li(θ)||²−γ²(K−1)/(4K)∑CosSim(∇Li,∇Lj)
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_strictMono_of_one_lt unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 2.2 ... ET∼P[LT(θ∗train,B)]=Ctrain+a/K σ²B (quadratic basins, variance of minima)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[2]
Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, et al. Nvidia nemotron nano 2: An accurate and efficient hybrid mamba-transformer reasoning model.arXiv preprint arXiv:2508.14444, 2025
-
[3]
Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, pages 15849–15854, 2019
Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, pages 15849–15854, 2019
2019
-
[4]
Huanran Chen, Yinpeng Dong, Zeming Wei, Yao Huang, Yichi Zhang, Hang Su, and Jun Zhu. Understanding pre-training and fine-tuning from loss landscape perspectives.arXiv preprint arXiv:2505.17646, 2025
-
[5]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
Understanding optimization in deep learning with central flows
Jeremy Cohen, Alex Damian, Ameet Talwalkar, J Zico Kolter, and Jason D Lee. Understanding optimization in deep learning with central flows. InThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[7]
Label noise sgd provably prefers flat global minimizers.Advancesin Neural Information Processing Systems, pages 27449–27461, 2021
Alex Damian, Tengyu Ma, and Jason D Lee. Label noise sgd provably prefers flat global minimizers.Advancesin Neural Information Processing Systems, pages 27449–27461, 2021
2021
-
[8]
FlashAttention-2: Faster attention with better parallelism and work partitioning
Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. InInternational Conference on Learning Representations (ICLR), 2024
2024
-
[9]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvancesin Neural Information Processing Systems (NeurIPS), 2022
2022
-
[10]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Evaluating Large Language Models Trained on Code
Mark Chen et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[12]
Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization.arXiv preprint arXiv:2010.01412, 2020
-
[13]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[14]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[15]
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024
work page internal anchor Pith review arXiv 2024
-
[16]
Accelerating stochastic gradient descent using predictive variance reduction
Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advancesin Neural Information Processing Systems, 2013
2013
-
[17]
Muon: An optimizer for hidden layers in neural networks, 2024
KellerJordan, YuchenJin, VladoBoza, JiachengYou, FranzCesista, LakerNewhouse, andJeremyBernstein. Muon: An optimizer for hidden layers in neural networks, 2024. URLhttps://kellerjordan.github.io/posts/muon/
2024
-
[18]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprintarXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[19]
Konwoo Kim, Suhas Kotha, Percy Liang, and Tatsunori Hashimoto. Pre-training under infinite compute.arXiv preprint arXiv:2509.14786, 2025
-
[20]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014. 17
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[21]
Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks
Jungmin Kwon, Jeongseop Kim, Hyunseo Park, and In Kwon Choi. Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. InInternational Conference on Machine Learning, pages 5905–5914, 2021
2021
-
[22]
Visualizing the loss landscape of neural nets
Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. Advancesin Neural Information Processing Systems, 31, 2018
2018
-
[23]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Same pre-training loss, better downstream: Implicit bias matters for language models
Hong Liu, Sang Michael Xie, Zhiyuan Li, and Tengyu Ma. Same pre-training loss, better downstream: Implicit bias matters for language models. InInternational Conference on Machine Learning, pages 22188–22214. PMLR, 2023
2023
-
[25]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[26]
Sgdr: Stochastic gradient descent with warm restarts
Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. InInternational Conference on Learning Representations, 2017
2017
-
[27]
Gradient descent maximizes the margin of homogeneous neural networks
Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. InInternational Conference on Learning Representations, 2019
2019
-
[28]
In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning
Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning.arXiv preprint arXiv:1412.6614, 2014
work page Pith review arXiv 2014
-
[29]
Diffusion language models are super data learners.arXiv preprint arXiv:2511.03276,
Jinjie Ni, Qian Liu, Longxu Dou, Chao Du, Zili Wang, Hang Yan, Tianyu Pang, and Michael Qizhe Shieh. Diffusion language models are super data learners.arXiv preprint arXiv:2511.03276, 2025
-
[30]
Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, Michal Guerquin, David Heineman, Hamish Ivison, Pang Wei Koh, Ji...
2025
-
[31]
Automatic differentiation in pytorch
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017
2017
-
[32]
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022
work page internal anchor Pith review arXiv 2022
-
[33]
Diffusion beats autoregressive in data-constrained settings.arXiv preprint arXiv:2507.15857,
Mihir Prabhudesai, Mengning Wu, Amir Zadeh, Katerina Fragkiadaki, and Deepak Pathak. Diffusion beats autoregressive in data-constrained settings.arXiv preprint arXiv:2507.15857, 2025
-
[34]
Gpqa: A graduate-level google-proof q&a benchmark
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024
2024
-
[35]
Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440, 2025
Andrei Semenov, Matteo Pagliardini, and Martin Jaggi. Benchmarking optimizers for large language model pretraining. arXiv preprint arXiv:2509.01440, 2025
-
[36]
The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 2018
Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 2018
2018
-
[37]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov
Jacob Mitchell Springer, Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig, and Aditi Raghunathan. Overtrained language models are harder to fine-tune.arXiv preprint arXiv:2503.19206, 2025
-
[38]
Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014
1929
-
[39]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261, 2022
work page internal anchor Pith review arXiv 2022
-
[40]
Seed-oss open-source models.https://github.com/ByteDance-Seed/seed-oss, 2025
ByteDance Seed Team. Seed-oss open-source models.https://github.com/ByteDance-Seed/seed-oss, 2025. 18
2025
-
[41]
Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Muon outperforms Adam in tail-end associative memory learning.arXiv preprint arXiv:2509.26030, 2025
Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Cunxiao Du, Chao Du, Tianyu Pang, Zhuoran Yang, Mingyi Hong, and Vincent YF Tan. Muon outperforms adam in tail-end associative memory learning.arXiv preprint arXiv:2509.26030, 2025
-
[44]
arXiv preprint arXiv:2211.05729 , year=
Kaiyue Wen, Tengyu Ma, and Zhiyuan Li. How does sharpness-aware minimization minimize sharpness?arXiv preprint arXiv:2211.05729, 2022
-
[45]
Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046, 2025
Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them. arXiv preprint arXiv:2509.02046, 2025
-
[46]
Understanding warmup-stable-decay learning rates: A river valley loss landscape view
Kaiyue Wen, Zhiyuan Li, Jason S Wang, David Leo Wright Hall, Percy Liang, and Tengyu Ma. Understanding warmup-stable-decay learning rates: A river valley loss landscape view. InTheThirteenthInternationalConference on Learning Representations, 2025
2025
-
[47]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Understanding deep learning requires rethinking generalization
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization.arXiv preprint arXiv:1611.03530, 2016
work page internal anchor Pith review arXiv 2016
-
[49]
arXiv preprint arXiv:2402.15152 , year=
Yihao Zhang, Hangzhou He, Jingyu Zhu, Huanran Chen, Yifei Wang, and Zeming Wei. On the duality between sharpness-aware minimization and adversarial training.arXiv preprint arXiv:2402.15152, 2024. 19 Appendix A Notations and Assumptions To facilitate the theoretical analysis in the subsequent sections, we summarize the key mathematical notations and fundam...
-
[50]
The normalized gradient isL1-Lipschitz continuous: ∇Li(x) ∥∇Li(x)∥2 − ∇Li(y) ∥∇Li(y)∥2 2 ≤L 1∥x−y∥ 2.(61)
-
[51]
Derivation of Constants.Here, we provide the detailed derivation ofL1 andL 2 based on Assumptions 1-3
The Jacobian of the normalized gradient isL2-Lipschitz continuous: ∥Ji(x)−J i(y)∥2 ≤L 2∥x−y∥ 2,(62) whereJ i(θ) = ∂ ∂θ ∇Li(θ) ∥∇Li(θ)∥2 . Derivation of Constants.Here, we provide the detailed derivation ofL1 andL 2 based on Assumptions 1-3
-
[52]
The Jacobian is explicitly given by: Ji(θ) = 1 ∥∇Li∥2 I− ∇Li∇L⊤ i ∥∇Li∥2 2 ∇2Li(θ).(63) 29 The middle term is an orthogonal projection matrix with spectral norm 1
Derivation of L1:By the Mean Value Theorem,L1 is bounded by the supremum of the spectral norm of the JacobianJ i(θ). The Jacobian is explicitly given by: Ji(θ) = 1 ∥∇Li∥2 I− ∇Li∇L⊤ i ∥∇Li∥2 2 ∇2Li(θ).(63) 29 The middle term is an orthogonal projection matrix with spectral norm 1. Using the bounds from Assumptions 1 and 2: L1 ≤sup θ ∥Ji(θ)∥2 ≤ 1 Gmin ·1·L=...
-
[53]
Generalized Directional Sharpness
Derivation of L2:We decompose the Jacobian Ji(θ)into three components: a scalar termu(θ), a projection termΠ(θ), and the HessianH i(θ): Ji(θ) =∥∇L i(θ)∥−1 2| {z } u(θ) · I− ∇Li∇L⊤ i ∥∇Li∥2 2 | {z } Π(θ) · ∇2Li(θ)| {z } Hi(θ) .(65) We apply the product Lipschitz rule. For a product of three functionsf = abc, the Lipschitz constant satisfies Lf ≤L aMbMc + M...
-
[54]
Let θm−1 = θ0+∆m−1
Exact Expansion of the Gradient.Consider the m-th inner step with sampled tasksm. Let θm−1 = θ0+∆m−1. The exact third-order Taylor expansion is: ∇Lsm(θm−1) =∇L sm(θ0) +∇ 2Lsm(θ0)∆m−1 + 1 2 ∇3Lsm(θ0)[∆m−1,∆ m−1] +r (m) T aylor.(97) The remainder is bounded by ∥r(m) T aylor∥2 ≤ M3 6 ∥∆m−1∥3
-
[55]
Using the bound on displacement magnitude ∥∆m−1∥2 ≤(m−1)γ: ∥r(m) T aylor∥2 ≤ M3 6 (m−1) 3γ3.(98)
-
[56]
The true displacement is∆m−1 = ˜∆m−1 + δm−1
Displacement Decomposition.We define the ideal displacement using initial gradients as ˜∆m−1 =Pm−1 l=1 −γ ˆdsl. The true displacement is∆m−1 = ˜∆m−1 + δm−1. Using the Lipschitz constantL1 = L/Gmin for the normalized gradient, the accumulated error is bounded by: ∥δm−1∥2 ≤γ m−1X l=1 L1(l−1)γ≤ L1 2 (m−1) 2γ2.(99)
-
[57]
By multilinearity of the tensor: 1 2 ∇3Lsm[∆m−1,∆ m−1] = 1 2 ∇3Lsm[ ˜∆m−1, ˜∆m−1] +r (m) sub .(100) The residualr (m) sub accounts for the cross-terms and quadratic error terms
Substitution into Quadratic Term.We substitute∆ m−1 into the third-order term. By multilinearity of the tensor: 1 2 ∇3Lsm[∆m−1,∆ m−1] = 1 2 ∇3Lsm[ ˜∆m−1, ˜∆m−1] +r (m) sub .(100) The residualr (m) sub accounts for the cross-terms and quadratic error terms. Its norm is strictly bounded by: ∥r(m) sub ∥2 ≤ 1 2 2∥∇3∥∥ ˜∆m−1∥∥δm−1∥+∥∇ 3∥∥δm−1∥2 .(101) 35 Subst...
-
[58]
Derivation of the Expected Update Direction.The explicit third-order component of the update (excluding residuals) is: v3 = kX m=1 −γ 1 2 ∇3Lsm[ ˜∆m−1, ˜∆m−1] .(103) Substituting ˜∆m−1 =Pm−1 l=1 −γ ˆdsl: v3 =− γ3 2 kX m=1 m−1X l=1 m−1X p=1 ∇3Lsm[ ˆdsl , ˆdsp].(104) Taking the expectation over uniform sampling of indicessm, sl, sp, each triplet(i, j, p)app...
-
[59]
Bounding the Total Residual.The total error vector is E3rd =Pk m=1 −γ(r(m) T aylor + r(m) sub ). Taking the norm: ∥E3rd∥2 ≤γ kX m=1 M3 6 + M3L1 2 (m−1) 3γ3 + M3L2 1 8 (m−1) 4γ4 .(108) Using summation boundsPk−1 j=1 j3 ≤ k4 4 andPk−1 j=1 j4 ≤ k5 5 , and substitutingL1 =L/G min: ∥E3rd∥2 ≤ M3 24 + M3L 8Gmin k4γ4 + M3L2 40G2 min k5γ5.(109) □ 36 H More Experim...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.