pith. machine review for the scientific record. sign in

arxiv: 2603.28743 · v2 · submitted 2026-03-30 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Rethinking Language Model Scaling under Transferable Hypersphere Optimization

Liliang Ren, Weizhu Chen, Yang Liu, Yelong Shen

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:47 UTC · model grok-4.3

classification 💻 cs.LG
keywords hypersphere parameterizationlearning rate transferscaling lawsMuon optimizerMoE stabilitycompute efficiencyLLM training
0
0 comments X

The pith

HyperP transfers one base learning rate across all scales under the Frobenius-sphere constraint, delivering 1.58 times compute efficiency at 6e21 FLOPs while keeping instability bounded.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HyperP, a hypersphere parameterization framework that constrains weight matrices to a fixed-norm Frobenius sphere and pairs it with the Muon optimizer. It proves weight decay acts as a first-order no-op on this sphere and shows that the optimal learning rate still follows the data-scaling power law with exponent 0.32. A single base learning rate tuned at the smallest scale therefore transfers across width, depth, token count, and MoE granularity without retuning. At 6×10^21 FLOPs this produces 1.58× compute efficiency over a strong Muon baseline, with all monitored instability indicators remaining bounded and non-increasing. Depth-μP is still required, and the authors also introduce SqrtGate to preserve output RMS across MoE granularities.

Core claim

Under the Frobenius-sphere constraint with the Muon optimizer, the optimal learning rate obeys the same 0.32 data-scaling exponent previously seen for AdamW, so a base rate tuned at the smallest scale transfers across all larger compute budgets. This yields 1.58× compute efficiency at 6×10^21 FLOPs over a strong Muon baseline while all instability indicators (Z-values, output RMS, activation outliers) stay bounded and non-increasing. Weight decay becomes a first-order no-op on the sphere, Depth-μP remains necessary, and SqrtGate enables stable MoE granularity scaling with larger auxiliary load-balancing weights.

What carries the argument

The Frobenius-sphere constraint on weight matrices together with the Muon optimizer, which enforces fixed-norm weights and removes the need for per-scale learning-rate retuning.

If this is right

  • One learning rate tuned at small scale works across every larger width, depth, token count, and MoE setting.
  • Instability indicators remain bounded as training FLOPs increase.
  • MoE models can use larger auxiliary load-balancing weights while staying balanced and performant.
  • SqrtGate preserves output RMS across different MoE granularities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sphere constraint might remove retuning needs for optimizers other than Muon.
  • Checking transfer at scales beyond 6×10^21 FLOPs would test whether the 0.32 exponent continues to hold.
  • Applying the sphere constraint to non-language models could show whether the stability transfer generalizes.

Load-bearing premise

The Frobenius-sphere constraint plus Muon optimizer preserves the 0.32 data-scaling exponent and prevents instability without any per-scale hyperparameter retuning.

What would settle it

Train a model at 10^22 FLOPs using the single small-scale tuned learning rate and check whether any instability indicator (Z-value, output RMS, or activation outlier count) begins to rise or whether efficiency falls below the predicted 1.58× gain.

Figures

Figures reproduced from arXiv: 2603.28743 by Liliang Ren, Weizhu Chen, Yang Liu, Yelong Shen.

Figure 1
Figure 1. Figure 1: Left: Loss vs. LR at different token budgets. Right: Fitted optimal LR vs. training tokens [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Validation loss vs. learning rate for Muon (sweeping weight decay [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Loss vs. LR curves across model sizes with Depth- [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Left: Loss vs. LR at different batch sizes. Right: Optimal LR vs. batch size on log-log [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Loss vs. LR curves for three auxiliary loss weights. The curves nearly overlap, indicating [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Left: Loss vs. LR across sparsity levels. Right: Optimal loss follows a power law in the [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Loss vs. LR across top-k values with and without SqrtGate. The exact optimal learning rates and losses are provided in [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Loss vs. LR across depths with HyperP (left) and without HyperP (right). HyperP keeps [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Left: Loss vs. FLOPs with power-law fits for all four methods. Right: Compute Efficiency [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Stability metrics across training for MoE models at depths [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Relative error in optimal LR (Left) and optimal loss (Right) estimates vs. number of sweep [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Small-scale LR sweeps at d=8, 10.4B tokens. Left: Dense attention normalization variants. GA QK-Norm achieves the lowest loss with a slightly shifted optimal LR. We exclude the LR=0.02 data points for dense models because the large learning rate leads to phase changes that harm fitting goodness. Right: MoE architecture variants. SharedExp + SqrtGate achieves the best loss while all variants maintain simil… view at source ↗
Figure 13
Figure 13. Figure 13: Dense architecture scaling. Left: Loss vs. FLOPs with power-law fits [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: MoE architecture scaling. Left: Loss vs. FLOPs with power-law fits. All properly-tuned [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Stability comparison of architecture ablations. Left: Router [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗
read the original abstract

Scaling laws for large language models depend critically on the optimizer and parameterization. Existing hyperparameter transfer laws are mainly developed for first-order optimizers, and they do not structurally prevent training instability at scale. Recent hypersphere optimization methods constrain weight matrices to a fixed-norm hypersphere, offering a promising alternative for more stable scaling. We introduce HyperP (Hypersphere Parameterization), the first framework for transferring optimal learning rates across model width, depth, training tokens, and Mixture-of-Experts (MoE) granularity under the Frobenius-sphere constraint with the Muon optimizer. We prove that weight decay is a first-order no-op on the Frobenius sphere, show that Depth-$\mu$P remains necessary, and find that the optimal learning rate follows the same data-scaling power law with the "magic exponent" 0.32 previously observed for AdamW. A single base learning rate tuned at the smallest scale transfers across all compute budgets under HyperP, yielding $1.58\times$ compute efficiency over a strong Muon baseline at $6\times10^{21}$ FLOPs. Moreover, HyperP delivers transferable stability: all monitored instability indicators, including $Z$-values, output RMS, and activation outliers, remain bounded and non-increasing under training FLOPs scaling. We also propose SqrtGate, an MoE gating mechanism derived from the hypersphere constraint that preserves output RMS across MoE granularities for improved granularity scaling, and show that hypersphere optimization enables substantially larger auxiliary load-balancing weights, yielding both strong performance and good expert balance. We release our training codebase at https://github.com/microsoft/ArchScale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces HyperP, a hypersphere parameterization framework that constrains weight matrices to the Frobenius sphere and pairs it with the Muon optimizer. It proves that weight decay acts as a first-order no-op under this constraint, shows that Depth-μP remains necessary, and reports that the optimal learning rate obeys the same 0.32 data-scaling exponent previously observed for AdamW. A single base learning rate tuned at the smallest scale is claimed to transfer across width, depth, token count, and MoE granularity, delivering 1.58× compute efficiency over a strong Muon baseline at 6×10^21 FLOPs while keeping instability indicators (Z-values, output RMS, activation outliers) bounded and non-increasing. The work also proposes SqrtGate, an MoE gating mechanism derived from the hypersphere constraint, and demonstrates that larger auxiliary load-balancing weights can be used without harming expert balance. The training codebase is released.

Significance. If the transfer result and exponent invariance hold, the work offers a concrete route to reduce per-scale hyperparameter retuning for hypersphere-based optimizers, which could simplify scaling experiments and improve training stability at frontier compute budgets. The explicit proof that weight decay is a first-order no-op, the requirement for Depth-μP, the SqrtGate construction, and the public code release are all positive contributions that strengthen the manuscript's utility to the community.

major comments (2)
  1. [Empirical scaling results and learning-rate transfer section] The central transfer claim at 6×10^21 FLOPs rests on the data-scaling exponent for the optimal learning rate remaining exactly 0.32 under the Frobenius-sphere + Muon dynamics. The manuscript reports that this exponent matches prior AdamW observations from small-scale fits, but provides no derivation or invariance argument showing why the hypersphere constraint preserves the exponent when width, depth, tokens, and MoE granularity are scaled simultaneously. If the effective exponent shifts even modestly, the fixed base LR would mis-tune at the largest budget, undermining both the 1.58× efficiency figure and the “no per-scale retuning” claim.
  2. [Large-scale experiments and efficiency comparison] The 1.58× compute-efficiency comparison at 6×10^21 FLOPs is presented against a “strong Muon baseline.” The manuscript must specify exactly how the baseline was tuned (whether it received per-scale LR retuning or used the same transfer protocol) and report the precise FLOPs-matched token counts and model configurations used for both arms; without these details the efficiency gain cannot be verified as arising from HyperP rather than from differences in baseline tuning effort.
minor comments (3)
  1. [Introduction and abstract] The phrase “magic exponent” 0.32 should be accompanied by a direct citation to the original AdamW scaling-law paper on first mention.
  2. [MoE and SqrtGate section] The SqrtGate derivation would benefit from an explicit equation showing how the hypersphere constraint leads to the square-root gating form; a short derivation paragraph would improve clarity.
  3. [Stability analysis figures] Plots of instability indicators (Z-values, output RMS) should include shaded regions or error bars across multiple random seeds to substantiate the claim that they remain bounded and non-increasing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, providing clarifications on the empirical nature of our scaling results and committing to expanded experimental details in the revision.

read point-by-point responses
  1. Referee: [Empirical scaling results and learning-rate transfer section] The central transfer claim at 6×10^21 FLOPs rests on the data-scaling exponent for the optimal learning rate remaining exactly 0.32 under the Frobenius-sphere + Muon dynamics. The manuscript reports that this exponent matches prior AdamW observations from small-scale fits, but provides no derivation or invariance argument showing why the hypersphere constraint preserves the exponent when width, depth, tokens, and MoE granularity are scaled simultaneously. If the effective exponent shifts even modestly, the fixed base LR would mis-tune at the largest budget, undermining both the 1.58× efficiency figure and the “no per-scale retuning” claim.

    Authors: We acknowledge that the 0.32 exponent is reported as an empirical observation from fits across our HyperP experiments, matching the value previously seen for AdamW, rather than derived from a theoretical invariance argument under the Frobenius-sphere constraint. Our multi-scale ablations (width, depth, tokens, and MoE granularity) show the exponent remains consistent in practice, supporting the single-base-LR transfer. We will revise the manuscript to add a dedicated paragraph in the scaling section explicitly noting the empirical basis, include additional log-log plots of optimal LR versus tokens at intermediate scales, and discuss potential limitations if the exponent were to drift at even larger budgets. This strengthens transparency without overstating theoretical guarantees. revision: partial

  2. Referee: [Large-scale experiments and efficiency comparison] The 1.58× compute-efficiency comparison at 6×10^21 FLOPs is presented against a “strong Muon baseline.” The manuscript must specify exactly how the baseline was tuned (whether it received per-scale LR retuning or used the same transfer protocol) and report the precise FLOPs-matched token counts and model configurations used for both arms; without these details the efficiency gain cannot be verified as arising from HyperP rather than from differences in baseline tuning effort.

    Authors: We agree that these details are essential for verification. The Muon baseline employed the identical transfer protocol: a single base learning rate tuned at the smallest scale and applied without per-scale retuning. We will revise the efficiency-comparison subsection to state this explicitly and add a table listing the precise configurations for both arms at 6×10^21 FLOPs (model width, depth, token count, MoE granularity, and exact FLOPs-matched training budgets). This will confirm the 1.58× gain arises from HyperP dynamics rather than differential tuning effort. revision: yes

Circularity Check

0 steps flagged

No circularity; scaling exponent and transfer results are empirical observations, not reductions to fitted inputs or self-citations

full rationale

The paper's derivation chain consists of a standalone proof that weight decay is a first-order no-op on the Frobenius sphere, an empirical finding that the optimal learning rate follows the previously observed 0.32 exponent across scales, and direct large-scale validation of LR transfer under HyperP. None of these steps reduce by construction to their inputs: the exponent match is reported from experiments at multiple budgets rather than being fitted at the smallest scale and renamed as a prediction; the transfer claim is tested at 6e21 FLOPs with monitored stability indicators; Muon and the 0.32 exponent are treated as external priors. No self-citations are load-bearing, no ansatz is smuggled, and no uniqueness theorem is invoked from prior author work. The results are self-contained against external benchmarks and falsifiable via the released codebase.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 2 invented entities

The work introduces HyperP and SqrtGate as new constructs, relies on the external Muon optimizer, and treats the 0.32 data-scaling exponent as an observed constant rather than a derived one. No additional free parameters beyond the base learning rate are introduced in the abstract.

free parameters (1)
  • base learning rate
    Tuned once at the smallest scale and transferred; its specific value is not stated in the abstract.
axioms (2)
  • domain assumption Weight decay is a first-order no-op on the Frobenius sphere
    Stated as proved in the paper under the hypersphere constraint.
  • domain assumption Depth-μP remains necessary even under the sphere constraint
    Explicitly retained as a requirement for the transfer to work.
invented entities (2)
  • HyperP no independent evidence
    purpose: Hypersphere parameterization enabling LR transfer across scales
    New framework introduced to combine Frobenius-sphere constraint with Muon.
  • SqrtGate no independent evidence
    purpose: MoE gating mechanism that preserves output RMS across granularities
    Derived directly from the hypersphere constraint.

pith-pipeline@v0.9.0 · 5602 in / 1523 out tokens · 58779 ms · 2026-05-14T21:47:28.338003+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Learning Rate Transfer in Normalized Transformers

    cs.LG 2026-04 unverdicted novelty 6.0

    νGPT is a modified parameterization of normalized transformers that enables learning rate transfer across width, depth, and token horizon.

  2. Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer

    cs.LG 2026-05 unverdicted novelty 4.0

    Nora is a matrix optimizer that stabilizes weight norms and angular velocities through row-wise momentum projection onto the orthogonal complement of the weights while approximating structured preconditioning with O(m...

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 2 Pith papers · 13 internal anchors

  1. [1]

    Power lines: Scaling laws for weight decay and batch size in llm pre-training.arXiv preprint arXiv: 2505.13738,

    [BDG+25] Shane Bergsma, Nolan Dey, Gurpreet Gosal, Gavia Gray, Daria Soboleva, and Joel Hestness. Power lines: Scaling laws for weight decay and batch size in llm pre-training.arXiv preprint arXiv: 2505.13738,

  2. [2]

    Layer Normalization

    https://thinkingmachines.ai/blog/modular-manifolds/. [BKH16] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization.arXiv preprint arXiv: 1607.06450,

  3. [3]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

    17 [BMR+20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

  4. [4]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    [DA24a] DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv: 2405.04434,

  5. [5]

    DeepSeek-V3 Technical Report

    [DA24b] DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv: 2412.19437,

  6. [6]

    Why gradients rapidly increase near the end of training.arXiv preprint arXiv: 2506.02285,

    [Def25] Aaron Defazio. Why gradients rapidly increase near the end of training.arXiv preprint arXiv: 2506.02285,

  7. [7]

    LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    [DLBZ22] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale.arXiv preprint arXiv: 2208.07339,

  8. [8]

    Nemotron-flash: Towards latency-optimal hybrid small language models.arXiv preprint arXiv: 2511.18890,

    [FDD+25] Yonggan Fu, Xin Dong, Shizhe Diao, Matthijs Van keirsbilck, Hanrong Ye, Wonmin Byeon, Yashaswi Karnati, Lucas Liebenwein, Hannah Zhang, Nikolaus Binder, Maksim Khadkevich, Alexander Keller, Jan Kautz, Yingyan Celine Lin, and Pavlo Molchanov. Nemotron-flash: Towards latency-optimal hybrid small language models.arXiv preprint arXiv: 2511.18890,

  9. [9]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    [Goo25] Google. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv: 2507.06261,

  10. [10]

    Analyzing and improving the training dynamics of diffusion models.arXiv preprint arXiv: 2312.02696,

    [KAL+23] Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models.arXiv preprint arXiv: 2312.02696,

  11. [11]

    Scaling Laws for Neural Language Models

    [KMH+20] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv: 2001.08361,

  12. [12]

    Li, Berlin Chen, Caitlin Wang, Aviv Bick, J

    [LLC+26] Aakash Lahoti, Kevin Y . Li, Berlin Chen, Caitlin Wang, Aviv Bick, J. Zico Kolter, Tri Dao, and Albert Gu. Mamba-3: Improved sequence modeling using state space principles.arXiv preprint arXiv: 2603.15569,

  13. [13]

    Predictable scale: Part i - optimal hyperparameter scaling law in large language model pretraining.arXiv preprint arXiv: 2503.04715,

    [LZH+25] Houyi Li, Wenzhen Zheng, Jingcheng Hu, Qiufeng Wang, Hanshan Zhang, Zili Wang, Shijie Xuyang, Yuantao Fan, Shuigeng Zhou, Xiangyu Zhang, and Daxin Jiang. Predictable scale: Part i - optimal hyperparameter scaling law in large language model pretraining.arXiv preprint arXiv: 2503.04715,

  14. [14]

    An Empirical Model of Large-Batch Training

    [MKAT18] Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch training.arXiv preprint arXiv: 1812.06162,

  15. [15]

    gpt-oss-120b & gpt-oss-20b Model Card

    [Ope25] OpenAI. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv: 2508.10925,

  16. [16]

    A unified view of attention and residual sinks: Outlier-driven rescaling is essential for transformer training.arXiv preprint arXiv: 2601.22966,

    [QHW+26] Zihan Qiu, Zeyu Huang, Kaiyue Wen, Peng Jin, Bo Zheng, Yuxin Zhou, Haofeng Huang, Zekun Wang, Xiao Li, Huaqing Zhang, Yang Xu, Haoran Lian, Siqi Zhang, Rui Men, Jianwei Zhang, Ivan Titov, Dayiheng Liu, Jingren Zhou, and Junyang Lin. A unified view of attention and residual sinks: Outlier-driven rescaling is essential for transformer training.arXi...

  17. [17]

    Accessed: 2026-03-22

    Qwen Blog. Accessed: 2026-03-22. [QWZ+25] Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. InThe Thirty-ninth Annual Conference on Neural Information Proces...

  18. [18]

    GLU Variants Improve Transformer

    URL: https://www.cerebras.net/blog/ slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama. [Sha20] Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv: 2002.05202,

  19. [19]

    Le, Geoffrey E

    [SMM+17] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V . Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net,

  20. [20]

    Kimi K2: Open Agentic Intelligence

    19 [Tea25a] Kimi Team. Kimi k2: Open agentic intelligence.arXiv preprint arXiv: 2507.20534,

  21. [21]

    Every activation boosted: Scaling general reasoner to 1 trillion open language foundation.arXiv preprint arXiv: 2510.22115,

    [Tea25b] Ling Team. Every activation boosted: Scaling general reasoner to 1 trillion open language foundation.arXiv preprint arXiv: 2510.22115,

  22. [22]

    How to set adamw’s weight decay as you scale model and dataset size.arXiv preprint arXiv: 2405.13698,

    [W A24] Xi Wang and Laurence Aitchison. How to set adamw’s weight decay as you scale model and dataset size.arXiv preprint arXiv: 2405.13698,

  23. [23]

    Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv: 2408.15664,

    [WGZ+24] Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv: 2408.15664,

  24. [24]

    Controlled llm training on spectral sphere.arXiv preprint arXiv: 2601.08393,

    [XLT+26] Tian Xie, Haoming Luo, Haoyu Tang, Yiwen Hu, Jason Klein Liu, Qingnan Ren, Yang Wang, Wayne Xin Zhao, Rui Yan, Bing Su, Chong Luo, and Baining Guo. Controlled llm training on spectral sphere.arXiv preprint arXiv: 2601.08393,

  25. [25]

    On layer normalization in the transformer architecture

    [XYH+20] Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. On layer normalization in the transformer architecture. InProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 ofProceedings of Machine Learning Res...

  26. [26]

    Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao

    [YHB+22] Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer.arXiv preprint arXiv: 2203.03466,

  27. [27]

    Qwen3 Technical Report

    [YLY+25] An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kex...

  28. [28]

    ST-MoE: Designing Stable and Transferable Sparse Expert Models

    [ZBK+22] Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. St-moe: Designing stable and transferable sparse expert models.arXiv preprint arXiv: 2202.08906,

  29. [29]

    OPT: Open Pre-trained Transformer Language Models

    [ZRG+22] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christo- pher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models.arXiv preprint arXiv...

  30. [30]

    training token budget under fine-grid sweeping with quadratic fitting

    Table 4: Optimal LR vs. training token budget under fine-grid sweeping with quadratic fitting. Training Tokens Fittedη ∗ Fitted Min Loss 10.4B 0.01515 2.4741 20.8B 0.01208 2.4189 41.6B 0.00958 2.3773 83.2B 0.00772 2.3456 166.4B 0.00635 2.3214 Table 5: Validation loss vs. LR across model depth at a fixed token budget of 10.4B without Depth-µP. Depth (d) Pa...