arxiv: 2604.09258 · v1 · submitted 2026-04-10 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima

Huanran Chen , Huaqing Zhang , Xiao Li , Yinpeng Dong , Ke Shen , Jun Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:50 UTC · model grok-4.3

classification 💻 cs.LG

keywords pretrainingdownstream generalizationminima closenessgradient similarityNexus optimizerlarge language modelsout-of-distribution performance

0 comments

The pith

Converging to common minima across data sources during pretraining improves downstream generalization even at identical loss values.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether pretraining finds one shared minimizer for all data domains or simply a point that averages their losses separately. It posits that geometric closeness among those domain-specific minima drives better performance on unseen tasks. Standard optimizers such as AdamW typically leave the minima far apart. The proposed Nexus optimizer counters this by maximizing gradient similarity across data subsets throughout training. Experiments on models from 130M to 3B parameters show the same final pretraining loss yet materially lower out-of-distribution loss and gains of up to 15 percent on complex reasoning benchmarks.

Core claim

The central claim is that the geometric closeness of task-specific minima is intrinsically linked to downstream generalization; standard pretraining leaves these minima distant, while maximizing gradient similarity produces a common minimizer and thereby better generalization without altering the achieved pretraining loss.

What carries the argument

The Nexus optimizer, which modifies the update step to maximize similarity between gradients computed on different data subsets and thereby pulls their individual minima closer together.

If this is right

Pretraining loss ceases to be a sufficient proxy for model quality once optimizer-induced biases are considered.
The same computational budget can produce stronger reasoning capabilities by changing only the optimization rule.
The benefit appears across model scales and data mixtures, with the largest relative gains on out-of-distribution reasoning tasks.
Gradient alignment during pretraining offers a controllable lever for generalization that does not require additional data or model size.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If closeness of minima is the operative mechanism, then similar gradient-alignment techniques could be applied to other multi-task or multi-domain training regimes beyond language modeling.
The result suggests that future scaling laws may need to incorporate an explicit term for optimizer-induced geometric bias rather than loss alone.
One testable extension is whether the same principle holds when the data mixture changes dynamically during training.

Load-bearing premise

That pulling task-specific minima geometrically closer is what causes the observed gains in downstream generalization rather than some other property of the optimization trajectory.

What would settle it

Measure the Euclidean distance between the minima found for separate data subsets after pretraining with and without Nexus; check whether the reduction in that distance predicts the size of the downstream accuracy improvement.

read the original abstract

Pretraining is the cornerstone of Large Language Models (LLMs), dominating the vast majority of computational budget and data to serve as the primary engine for their capabilities. During pretraining, LLMs acquire foundational knowledge from an unprecedentedly massive and diverse data sources, encompassing a vast array of domains such as general language, mathematics, code, and complex reasoning. In this work, we investigate an interesting geometric question regarding the converged state of pretraining: Does the model converge to a common minimizer across all data sources (e.g., \cref{fig:cwa_illustration:close}), or merely a minimizer of the summed loss (e.g., \cref{fig:cwa_illustration:distant})? We hypothesize that the geometric "closeness" of task-specific minima is intrinsically linked to downstream generalization. We reveal that standard optimizers (e.g., AdamW) often converge to points where task-specific minima are distant from each other. To address this, we propose the Nexus optimizer, which encourages the closeness of these minima by maximizing gradient similarity during optimization. Experiments across models ranging from 130M to 3B parameters, various data mixtures and hyperparameter schedules, show that Nexus \textit{significantly boosts downstream performance}, despite \textit{achieving the same pretraining loss} (see \cref{fig:demo:benchmark}). Notably, on the 3B model, Nexus reduces the out-of-distribution loss by 0.012 and yields up to a 15.0\% accuracy improvement on complex reasoning tasks (e.g., GSM8k). This finding challenges the reliance on pretraining loss as the sole proxy for model evaluation and demonstrates the importance of implicit biases in unlocking downstream generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that standard optimizers like AdamW converge to points where task-specific minima (across data sources such as language, math, and code) are geometrically distant, and that this distance harms downstream generalization even at the same pretraining loss. It introduces the Nexus optimizer, which augments the update rule to maximize gradient similarity across tasks during pretraining, thereby encouraging convergence to common minima. Experiments on models from 130M to 3B parameters across data mixtures report identical pretraining loss but improved downstream metrics, including a 0.012 reduction in out-of-distribution loss and up to 15% accuracy gains on tasks like GSM8k for the 3B model.

Significance. If the causal link between gradient-similarity maximization, minima closeness, and generalization holds after proper isolation, the result would be significant: it would demonstrate that pretraining loss is an incomplete proxy for model quality and that optimizer implicit biases can be engineered to improve downstream performance without additional data or compute. The scale of experiments (multiple model sizes and mixtures) and the challenge to loss-as-proxy evaluation add value, though the current evidence remains correlational rather than mechanistic.

major comments (3)

[Abstract / experimental results] Abstract and experimental results section: the central claim requires that Nexus produces measurably closer task-specific minima than AdamW at identical pretraining loss, yet no direct quantification of minima distance (e.g., parameter-space distance after task-specific fine-tuning from the shared checkpoint, or loss-landscape interpolation between task minima) is reported; only final loss and downstream accuracy are shown.
[Nexus optimizer definition] § on Nexus optimizer definition: the auxiliary term that maximizes gradient similarity is introduced without an ablation that applies an equivalent-magnitude auxiliary gradient (without the explicit similarity objective) to isolate whether the performance delta arises from similarity maximization rather than incidental changes in effective step size, noise scale, or curvature.
[Experimental protocol] Experimental protocol: no details are supplied on the number of independent runs, random seeds, statistical tests (e.g., t-tests or confidence intervals) for the reported deltas such as the 0.012 OOD loss reduction or 15% GSM8k gain, nor on how gradient similarity was computed and maximized in practice across the data mixtures.

minor comments (2)

[Figures] Figure captions and axis labels in the benchmark and illustration figures should explicitly state the number of runs and error bars if any.
[Notation] The notation for the gradient-similarity term should be defined once in the main text rather than only in the appendix.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, providing clarifications and committing to revisions that strengthen the manuscript without misrepresenting our results.

read point-by-point responses

Referee: [Abstract / experimental results] Abstract and experimental results section: the central claim requires that Nexus produces measurably closer task-specific minima than AdamW at identical pretraining loss, yet no direct quantification of minima distance (e.g., parameter-space distance after task-specific fine-tuning from the shared checkpoint, or loss-landscape interpolation between task minima) is reported; only final loss and downstream accuracy are shown.

Authors: We acknowledge that direct quantification of minima distances would strengthen the geometric interpretation. However, in high-dimensional parameter spaces, Euclidean distances are often uninformative due to the curse of dimensionality and the non-convex nature of the loss landscape. We instead use downstream generalization performance at matched pretraining loss as the primary, practically relevant evidence for the hypothesis. In the revised manuscript we will add loss-landscape interpolation plots between task-specific fine-tuned models (starting from the shared pretrained checkpoint) to provide additional supporting analysis. This is a partial revision, as exhaustive distance metrics across all tasks would require substantial extra compute. revision: partial
Referee: [Nexus optimizer definition] § on Nexus optimizer definition: the auxiliary term that maximizes gradient similarity is introduced without an ablation that applies an equivalent-magnitude auxiliary gradient (without the explicit similarity objective) to isolate whether the performance delta arises from similarity maximization rather than incidental changes in effective step size, noise scale, or curvature.

Authors: This is a fair criticism and a useful control experiment. We will add an ablation in the revised version that applies an auxiliary gradient of matched magnitude but with randomized directions (no explicit similarity objective). This will help isolate whether the observed gains arise specifically from gradient alignment rather than secondary effects on step size or curvature. We have already run preliminary versions of this control and will report the full results. revision: yes
Referee: [Experimental protocol] Experimental protocol: no details are supplied on the number of independent runs, random seeds, statistical tests (e.g., t-tests or confidence intervals) for the reported deltas such as the 0.012 OOD loss reduction or 15% GSM8k gain, nor on how gradient similarity was computed and maximized in practice across the data mixtures.

Authors: We apologize for these omissions. The revised manuscript will include: (i) results averaged over 5 independent runs using seeds 42, 123, 456, 789 and 1011, with mean and standard deviation reported; (ii) two-sided t-tests and 95% confidence intervals for the key deltas (0.012 OOD loss and GSM8k accuracy); (iii) explicit implementation details: gradient similarity is computed as the average cosine similarity between per-source gradients (language, math, code) obtained by separate forward-backward passes on source-specific batches within each update; the auxiliary term is added to the loss as 0.1 * (1 - avg_cosine) and optimized jointly with the main objective. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results rest on independent optimizer design and measurements.

full rationale

The paper advances a hypothesis that geometric closeness of task-specific minima improves downstream generalization, introduces the Nexus optimizer to maximize gradient similarity as an independent design choice, and reports experimental outcomes (identical pretraining loss but better downstream metrics) across model scales. No load-bearing step reduces a claimed result to a fitted parameter, self-referential definition, or self-citation chain by construction. The optimizer objective is explicitly motivated rather than derived from the target generalization metric, and the reported deltas are direct empirical comparisons rather than predictions forced by the inputs. This is the normal case of a non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper rests on the unproven hypothesis that minima closeness drives generalization and on the empirical claim that gradient similarity maximization produces that closeness; no free parameters or invented physical entities are introduced in the abstract.

axioms (1)

domain assumption Closeness of task-specific minima is intrinsically linked to downstream generalization
Stated explicitly as the central hypothesis in the abstract.

invented entities (1)

Nexus optimizer no independent evidence
purpose: Encourages closeness of task minima by maximizing gradient similarity
New optimizer proposed to address the hypothesized geometric issue.

pith-pipeline@v0.9.0 · 5624 in / 1175 out tokens · 34982 ms · 2026-05-10T17:50:26.241619+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Nexus ... approximates the gradient of gradient similarity ∇CosSim(∇Li,∇Lj) ... J2nd(θ)=γ∑||Li(θ)||²−γ²(K−1)/(4K)∑CosSim(∇Li,∇Lj)
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 2.2 ... ET∼P[LT(θ∗train,B)]=Ctrain+a/K σ²B (quadratic basins, variance of minima)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 31 canonical work pages · 17 internal anchors

[1]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

NovaSky Team

Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, et al. Nvidia nemotron nano 2: An accurate and efficient hybrid mamba-transformer reasoning model.arXiv preprint arXiv:2508.14444, 2025

work page arXiv 2025
[3]

Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, pages 15849–15854, 2019

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, pages 15849–15854, 2019

2019
[4]

Unveiling the basin-like loss landscape in large language models.arXiv preprint arXiv:2505.17646, 2025

Huanran Chen, Yinpeng Dong, Zeming Wei, Yao Huang, Yichi Zhang, Hang Su, and Jun Zhu. Understanding pre-training and fine-tuning from loss landscape perspectives.arXiv preprint arXiv:2505.17646, 2025

work page arXiv 2025
[5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Understanding optimization in deep learning with central flows

Jeremy Cohen, Alex Damian, Ameet Talwalkar, J Zico Kolter, and Jason D Lee. Understanding optimization in deep learning with central flows. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[7]

Label noise sgd provably prefers flat global minimizers.Advancesin Neural Information Processing Systems, pages 27449–27461, 2021

Alex Damian, Tengyu Ma, and Jason D Lee. Label noise sgd provably prefers flat global minimizers.Advancesin Neural Information Processing Systems, pages 27449–27461, 2021

2021
[8]

FlashAttention-2: Faster attention with better parallelism and work partitioning

Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. InInternational Conference on Learning Representations (ICLR), 2024

2024
[9]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvancesin Neural Information Processing Systems (NeurIPS), 2022

2022
[10]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Evaluating Large Language Models Trained on Code

Mark Chen et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

Sharpness-aware minimization for efficiently improving generalization.arXiv preprint arXiv:2010.01412,

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization.arXiv preprint arXiv:2010.01412, 2020

work page arXiv 2010
[13]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[14]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[15]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024

work page internal anchor Pith review arXiv 2024
[16]

Accelerating stochastic gradient descent using predictive variance reduction

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advancesin Neural Information Processing Systems, 2013

2013
[17]

Muon: An optimizer for hidden layers in neural networks, 2024

KellerJordan, YuchenJin, VladoBoza, JiachengYou, FranzCesista, LakerNewhouse, andJeremyBernstein. Muon: An optimizer for hidden layers in neural networks, 2024. URLhttps://kellerjordan.github.io/posts/muon/

2024
[18]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprintarXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[19]

& Hashimoto, T

Konwoo Kim, Suhas Kotha, Percy Liang, and Tatsunori Hashimoto. Pre-training under infinite compute.arXiv preprint arXiv:2509.14786, 2025

work page arXiv 2025
[20]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014. 17

work page internal anchor Pith review Pith/arXiv arXiv 2014
[21]

Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks

Jungmin Kwon, Jeongseop Kim, Hyunseo Park, and In Kwon Choi. Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. InInternational Conference on Machine Learning, pages 5905–5914, 2021

2021
[22]

Visualizing the loss landscape of neural nets

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. Advancesin Neural Information Processing Systems, 31, 2018

2018
[23]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Same pre-training loss, better downstream: Implicit bias matters for language models

Hong Liu, Sang Michael Xie, Zhiyuan Li, and Tengyu Ma. Same pre-training loss, better downstream: Implicit bias matters for language models. InInternational Conference on Machine Learning, pages 22188–22214. PMLR, 2023

2023
[25]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

Sgdr: Stochastic gradient descent with warm restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. InInternational Conference on Learning Representations, 2017

2017
[27]

Gradient descent maximizes the margin of homogeneous neural networks

Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. InInternational Conference on Learning Representations, 2019

2019
[28]

In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning.arXiv preprint arXiv:1412.6614, 2014

work page Pith review arXiv 2014
[29]

Diffusion language models are super data learners.arXiv preprint arXiv:2511.03276,

Jinjie Ni, Qian Liu, Longxu Dou, Chao Du, Zili Wang, Hang Yan, Tianyu Pang, and Michael Qizhe Shieh. Diffusion language models are super data learners.arXiv preprint arXiv:2511.03276, 2025

work page arXiv 2025
[30]

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, Michal Guerquin, David Heineman, Hamish Ivison, Pang Wei Koh, Ji...

2025
[31]

Automatic differentiation in pytorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017

2017
[32]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022

work page internal anchor Pith review arXiv 2022
[33]

Diffusion beats autoregressive in data-constrained settings.arXiv preprint arXiv:2507.15857,

Mihir Prabhudesai, Mengning Wu, Amir Zadeh, Katerina Fragkiadaki, and Deepak Pathak. Diffusion beats autoregressive in data-constrained settings.arXiv preprint arXiv:2507.15857, 2025

work page arXiv 2025
[34]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024

2024
[35]

Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440, 2025

Andrei Semenov, Matteo Pagliardini, and Martin Jaggi. Benchmarking optimizers for large language model pretraining. arXiv preprint arXiv:2509.01440, 2025

work page arXiv 2025
[36]

The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 2018

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 2018

2018
[37]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov

Jacob Mitchell Springer, Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig, and Aditi Raghunathan. Overtrained language models are harder to fine-tune.arXiv preprint arXiv:2503.19206, 2025

work page arXiv 2025
[38]

Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014

1929
[39]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261, 2022

work page internal anchor Pith review arXiv 2022
[40]

Seed-oss open-source models.https://github.com/ByteDance-Seed/seed-oss, 2025

ByteDance Seed Team. Seed-oss open-source models.https://github.com/ByteDance-Seed/seed-oss, 2025. 18

2025
[41]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Muon outperforms Adam in tail-end associative memory learning.arXiv preprint arXiv:2509.26030, 2025

Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Cunxiao Du, Chao Du, Tianyu Pang, Zhuoran Yang, Mingyi Hong, and Vincent YF Tan. Muon outperforms adam in tail-end associative memory learning.arXiv preprint arXiv:2509.26030, 2025

work page arXiv 2025
[44]

arXiv preprint arXiv:2211.05729 , year=

Kaiyue Wen, Tengyu Ma, and Zhiyuan Li. How does sharpness-aware minimization minimize sharpness?arXiv preprint arXiv:2211.05729, 2022

work page arXiv 2022
[45]

Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046, 2025

Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them. arXiv preprint arXiv:2509.02046, 2025

work page arXiv 2025
[46]

Understanding warmup-stable-decay learning rates: A river valley loss landscape view

Kaiyue Wen, Zhiyuan Li, Jason S Wang, David Leo Wright Hall, Percy Liang, and Tengyu Ma. Understanding warmup-stable-decay learning rates: A river valley loss landscape view. InTheThirteenthInternationalConference on Learning Representations, 2025

2025
[47]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Understanding deep learning requires rethinking generalization

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization.arXiv preprint arXiv:1611.03530, 2016

work page internal anchor Pith review arXiv 2016
[49]

arXiv preprint arXiv:2402.15152 , year=

Yihao Zhang, Hangzhou He, Jingyu Zhu, Huanran Chen, Yifei Wang, and Zeming Wei. On the duality between sharpness-aware minimization and adversarial training.arXiv preprint arXiv:2402.15152, 2024. 19 Appendix A Notations and Assumptions To facilitate the theoretical analysis in the subsequent sections, we summarize the key mathematical notations and fundam...

work page arXiv 2024
[50]

The normalized gradient isL1-Lipschitz continuous: ∇Li(x) ∥∇Li(x)∥2 − ∇Li(y) ∥∇Li(y)∥2 2 ≤L 1∥x−y∥ 2.(61)
[51]

Derivation of Constants.Here, we provide the detailed derivation ofL1 andL 2 based on Assumptions 1-3

The Jacobian of the normalized gradient isL2-Lipschitz continuous: ∥Ji(x)−J i(y)∥2 ≤L 2∥x−y∥ 2,(62) whereJ i(θ) = ∂ ∂θ ∇Li(θ) ∥∇Li(θ)∥2 . Derivation of Constants.Here, we provide the detailed derivation ofL1 andL 2 based on Assumptions 1-3
[52]

The Jacobian is explicitly given by: Ji(θ) = 1 ∥∇Li∥2 I− ∇Li∇L⊤ i ∥∇Li∥2 2 ∇2Li(θ).(63) 29 The middle term is an orthogonal projection matrix with spectral norm 1

Derivation of L1:By the Mean Value Theorem,L1 is bounded by the supremum of the spectral norm of the JacobianJ i(θ). The Jacobian is explicitly given by: Ji(θ) = 1 ∥∇Li∥2 I− ∇Li∇L⊤ i ∥∇Li∥2 2 ∇2Li(θ).(63) 29 The middle term is an orthogonal projection matrix with spectral norm 1. Using the bounds from Assumptions 1 and 2: L1 ≤sup θ ∥Ji(θ)∥2 ≤ 1 Gmin ·1·L=...
[53]

Generalized Directional Sharpness

Derivation of L2:We decompose the Jacobian Ji(θ)into three components: a scalar termu(θ), a projection termΠ(θ), and the HessianH i(θ): Ji(θ) =∥∇L i(θ)∥−1 2| {z } u(θ) · I− ∇Li∇L⊤ i ∥∇Li∥2 2 | {z } Π(θ) · ∇2Li(θ)| {z } Hi(θ) .(65) We apply the product Lipschitz rule. For a product of three functionsf = abc, the Lipschitz constant satisfies Lf ≤L aMbMc + M...
[54]

Let θm−1 = θ0+∆m−1

Exact Expansion of the Gradient.Consider the m-th inner step with sampled tasksm. Let θm−1 = θ0+∆m−1. The exact third-order Taylor expansion is: ∇Lsm(θm−1) =∇L sm(θ0) +∇ 2Lsm(θ0)∆m−1 + 1 2 ∇3Lsm(θ0)[∆m−1,∆ m−1] +r (m) T aylor.(97) The remainder is bounded by ∥r(m) T aylor∥2 ≤ M3 6 ∥∆m−1∥3
[55]

Using the bound on displacement magnitude ∥∆m−1∥2 ≤(m−1)γ: ∥r(m) T aylor∥2 ≤ M3 6 (m−1) 3γ3.(98)
[56]

The true displacement is∆m−1 = ˜∆m−1 + δm−1

Displacement Decomposition.We define the ideal displacement using initial gradients as ˜∆m−1 =Pm−1 l=1 −γ ˆdsl. The true displacement is∆m−1 = ˜∆m−1 + δm−1. Using the Lipschitz constantL1 = L/Gmin for the normalized gradient, the accumulated error is bounded by: ∥δm−1∥2 ≤γ m−1X l=1 L1(l−1)γ≤ L1 2 (m−1) 2γ2.(99)
[57]

By multilinearity of the tensor: 1 2 ∇3Lsm[∆m−1,∆ m−1] = 1 2 ∇3Lsm[ ˜∆m−1, ˜∆m−1] +r (m) sub .(100) The residualr (m) sub accounts for the cross-terms and quadratic error terms

Substitution into Quadratic Term.We substitute∆ m−1 into the third-order term. By multilinearity of the tensor: 1 2 ∇3Lsm[∆m−1,∆ m−1] = 1 2 ∇3Lsm[ ˜∆m−1, ˜∆m−1] +r (m) sub .(100) The residualr (m) sub accounts for the cross-terms and quadratic error terms. Its norm is strictly bounded by: ∥r(m) sub ∥2 ≤ 1 2 2∥∇3∥∥ ˜∆m−1∥∥δm−1∥+∥∇ 3∥∥δm−1∥2 .(101) 35 Subst...
[58]

Derivation of the Expected Update Direction.The explicit third-order component of the update (excluding residuals) is: v3 = kX m=1 −γ 1 2 ∇3Lsm[ ˜∆m−1, ˜∆m−1] .(103) Substituting ˜∆m−1 =Pm−1 l=1 −γ ˆdsl: v3 =− γ3 2 kX m=1 m−1X l=1 m−1X p=1 ∇3Lsm[ ˆdsl , ˆdsp].(104) Taking the expectation over uniform sampling of indicessm, sl, sp, each triplet(i, j, p)app...
[59]

pre- consuming

Bounding the Total Residual.The total error vector is E3rd =Pk m=1 −γ(r(m) T aylor + r(m) sub ). Taking the norm: ∥E3rd∥2 ≤γ kX m=1 M3 6 + M3L1 2 (m−1) 3γ3 + M3L2 1 8 (m−1) 4γ4 .(108) Using summation boundsPk−1 j=1 j3 ≤ k4 4 andPk−1 j=1 j4 ≤ k5 5 , and substitutingL1 =L/G min: ∥E3rd∥2 ≤ M3 24 + M3L 8Gmin k4γ4 + M3L2 40G2 min k5γ5.(109) □ 36 H More Experim...

work page arXiv 1992