MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training

Hongtao Xu; Jiacheng Li; Jianchao Tan; Jiaqi Zhang; Xunliang Cai; Yerui Sun; Yifan Lu; Yuchen Xie

arxiv: 2605.26842 · v1 · pith:7N54CBHLnew · submitted 2026-05-26 · 💻 cs.LG · cs.CL

MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training

Jiacheng Li , Jianchao Tan , Hongtao Xu , Jiaqi Zhang , Yifan Lu , Yerui Sun , Yuchen Xie , Xunliang Cai This is my paper

Pith reviewed 2026-06-29 19:12 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords MONA optimizerMuon optimizerNesterov accelerationMixture-of-Experts pretrainingconvergence analysisspectral-norm regularizationAdamWlanguage model training

0 comments

The pith

MONA adds an acceleration term from the exponential moving average of gradient differences into Muon's gradient processing pipeline to escape sharp minima while keeping spectral-norm regularization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MONA as an optimizer that combines Muon's matrix orthogonalization with a Nesterov-style acceleration term. This term is inserted directly into the gradient processing step and is computed from the exponential moving average of gradient differences. The goal is to help the optimizer leave sharp local minima without losing the geometry-aware updates that Muon provides. The authors supply a convergence analysis and report stronger pretraining convergence plus better downstream results than both Muon and AdamW on Mixture-of-Experts models ranging from 1B to 68B parameters, with the largest run using one trillion tokens. After supervised fine-tuning on the 68B model, MONA reaches state-of-the-art scores on general capability, mathematical reasoning, and code generation tasks.

Core claim

MONA adds an acceleration term, calculated from the exponential moving average of gradient differences, directly into Muon's gradient processing pipeline. Convergence analysis shows that the acceleration term enables escape from sharp minima while preserving Muon's spectral-norm regularization. Empirically, MONA achieves better convergence and downstream task performance compared to both Muon and AdamW across three scales of Mixture-of-Experts pretraining, spanning from 1B to 68B parameters, with the largest model trained on 1 trillion tokens. Furthermore, supervised fine-tuning on the MOE-68B-A3B model yields SOTA performance on general capability, mathematical reasoning, and code generatio

What carries the argument

The acceleration term added directly into Muon's gradient processing pipeline and computed from the exponential moving average of gradient differences.

If this is right

MONA reaches lower training loss than Muon or AdamW on MoE models from 1B to 68B parameters.
It produces higher downstream task performance after pretraining on up to one trillion tokens.
The spectral-norm regularization property of Muon remains intact.
SOTA results appear on general, math, and code benchmarks after fine-tuning the 68B model.
The same gains hold across three different model scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The acceleration mechanism might be added to other orthogonalization-based optimizers with similar effect.
Training runs on dense transformer models could test whether the gains are specific to MoE architectures.
The convergence analysis could be extended to derive explicit rates that quantify the escape speed from sharp minima.
Longer training horizons beyond one trillion tokens might reveal whether the advantage persists or saturates.

Load-bearing premise

The acceleration term enables escape from sharp minima while preserving Muon's spectral-norm regularization.

What would settle it

A side-by-side pretraining run of the 68B MoE model in which MONA produces neither faster loss reduction nor higher final downstream scores than Muon would falsify the empirical superiority claim.

Figures

Figures reproduced from arXiv: 2605.26842 by Hongtao Xu, Jiacheng Li, Jianchao Tan, Jiaqi Zhang, Xunliang Cai, Yerui Sun, Yifan Lu, Yuchen Xie.

**Figure 1.** Figure 1: General capability evaluation results for pretraining MOE-68B-A3B at 700B tokens. MONA consistently outperforms Muon and AdamW across multiple benchmarks. Liu et al., 2024b; Team et al., 2025b; Yang et al., 2025), the requirement for optimizers with superior sample efficiency has increased. Recently, Muon (Jordan et al., 2024) has become a solid alternative. Instead of updating each parameter on its own… view at source ↗

**Figure 2.** Figure 2: Validation loss on Code-Valid for MOE1B-A0d2B [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Validation loss on General-English-Text for MOE-1B-A0d2B [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Validation loss on Code-Valid for MOE6B-A0d5B. MOE-1B-A0d2B. The smallest model uses 10 transformer layers with 768 hidden dimensions, 16 attention heads, 128 experts with 256 FFN hidden size each, and top-8 routing [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Validation loss on General-English-Text for MOE-6B-A0d5B. We train for approximately 400B tokens. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: SFT training loss on code data for the MOE-68B-A3B model. MONA-pretrained checkpoint (red) achieves lower loss than Muon (green) throughout training, with a larger gap emerging in later epochs. To assess the practical utility of MONAoptimized models beyond pretraining, we conduct supervised fine-tuning (SFT) on the MOE-68B-A3B model using high-quality code data. The SFT stage employs Adam with a peak le… view at source ↗

**Figure 7.** Figure 7: Validation loss on Code_Valid for the MOE-1B-A0d2B model, extending [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Validation loss on General-English-Text [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 10.** Figure 10: Validation loss on Chinese-AcademicText for MOE-1B-A0d2B [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Validation loss on MathematicalReasoning for MOE-6B-A0d5B [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 12.** Figure 12: Validation loss on Chinese-AcademicText for MOE-6B-A0d5B [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 9.** Figure 9: Validation loss on MathematicalReasoning for MOE-1B-A0d2B. C Computational Overhead Analysis The memory overhead of MONA’s additional buffers was already discussed in Section 6.3. Here, we add measurements of computational time overhead from the MOE-6B-A0d5B pretraining run [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 13.** Figure 13: Optimizer inner step time for MOE-6BA0d5B pretraining. MONA runs about 1% slower than Muon at the optimizer step level [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: End-to-end iteration time for MOE-6BA0d5B pretraining. MONA and Muon show no practical difference in overall training speed. D Comparison with Accelerated AdamW To better understand where the acceleration gains come from, we compare MONA against not only Muon and AdamW but also an AdamW variant equipped with the same acceleration term. We call this variant AdamWAcc. It is adapted from ALTO’s accelerati… view at source ↗

**Figure 15.** Figure 15: Training loss for MOE-1B-A0d2B with four optimizers: AdamW(black), AdamWAcc(purple), Muon(blue), and MONA(green) [PITH_FULL_IMAGE:figures/full_fig_p014_15.png] view at source ↗

**Figure 16.** Figure 16: Validation loss on General-EnglishText for MOE-1B-A0d2B. MONA outperforms Muon, which in turn outperforms AdamW-Acc and AdamW [PITH_FULL_IMAGE:figures/full_fig_p015_16.png] view at source ↗

read the original abstract

The Muon optimizer has recently offered a promising alternative to AdamW for large language model training, leveraging matrix orthogonalization to produce geometry-aware updates. However, like all first-order methods, Muon can become trapped in sharp local minima. In this work, we present MONA, an optimizer that bridges Muon's orthogonalization framework with curvature-aware acceleration. MONA adds an acceleration term directly into Muon's gradient processing pipeline. This term is calculated from the exponential moving average of gradient differences. We provide a detailed convergence analysis for MONA, showing that the acceleration term enables escape from sharp minima while preserving Muon's spectral-norm regularization. Empirically, MONA achieves better convergence and downstream task performance compared to both Muon and AdamW across three scales of Mixture-of-Experts pretraining, spanning from 1B to 68B parameters, with the largest model trained on 1 trillion tokens. Furthermore, we conduct supervised fine-tuning on the MOE-68B-A3B model and evaluate it on general capability, mathematical reasoning, and code generation benchmarks, where MONA achieves SOTA performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MONA is a direct Nesterov addition to Muon with scaling experiments on MoE models up to 68B, and the analysis looks internally consistent.

read the letter

The paper's main move is to insert an acceleration term—computed as an exponential moving average of gradient differences—straight into Muon's gradient processing step. This is framed as giving Nesterov-like escape from sharp minima while keeping the spectral-norm regularization that Muon already provides. They supply a convergence analysis for the combined method and run pretraining on Mixture-of-Experts models at 1B, and up to 68B parameters trained on a trillion tokens, plus supervised fine-tuning on the largest model with results on general, math, and code benchmarks.

What the work actually delivers is a set of head-to-head comparisons against both Muon and AdamW at multiple scales, plus the downstream SFT evaluation. The stress-test found no load-bearing inconsistencies in the argument structure or hidden circularity in the derivations, which is the main thing that matters here.

The soft spot is that the core idea is an explicit combination of two established pieces rather than a new mechanism, so the novelty sits in the integration and the empirical scaling rather than in fresh theory. The gains are presented as direct, but any referee would still want to see the exact hyperparameter matching and whether the acceleration term introduces any extra tuning burden that could affect the comparison. The SOTA claim on the fine-tuning benchmarks is stated clearly but would need the full tables to judge the size of the effect.

This is for people who train or tune large MoE models and care about optimizer choices at that scale. A reader already following Muon or second-order-style methods would find the experiments relevant. The combination of analysis and multi-scale runs is enough to justify sending it to peer review rather than desk rejection.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces MONA, which augments the Muon optimizer by inserting an acceleration term—computed as the exponential moving average of gradient differences—directly into Muon's gradient processing pipeline. It supplies a convergence analysis asserting that this term enables escape from sharp minima while preserving Muon's spectral-norm regularization property. Empirically, MONA is shown to yield better convergence and downstream performance than both Muon and AdamW on Mixture-of-Experts pretraining at three scales (1B to 68B parameters), with the largest model trained on 1 trillion tokens; after supervised fine-tuning of the 68B model, MONA attains SOTA results on general capability, mathematical reasoning, and code generation benchmarks.

Significance. If the convergence analysis holds under the step-size regimes and model architectures used in the experiments, and if the scaling results prove reproducible, MONA would represent a meaningful advance in optimizer design for large-scale language model training by combining geometry-aware orthogonalization with curvature-aware acceleration. The multi-scale empirical evaluation (1B–68B) together with the SFT benchmark results constitutes a substantial practical contribution; the explicit preservation of spectral-norm regularization is a noteworthy theoretical strength.

minor comments (3)

[Convergence analysis] The convergence analysis section would benefit from an explicit statement of the Lipschitz or smoothness assumptions required for the escape-from-sharp-minima guarantee, together with a brief discussion of how these assumptions align with the MoE routing dynamics observed in the 68B experiments.
[Experiments] Table 2 (or equivalent results table): the reported downstream metrics after SFT should include the number of independent runs or standard deviations to allow assessment of statistical reliability of the SOTA claim.
[Method] The description of the acceleration term insertion (around Eq. (X) in the method section) could be accompanied by a short pseudocode snippet showing the exact placement relative to Muon's orthogonalization step for immediate reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work on MONA and the recommendation for minor revision. The referee's description of the method, convergence analysis, and empirical results on MoE models up to 68B parameters is accurate.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's central theoretical contribution is a convergence analysis showing that the added acceleration term (EMA of gradient differences) enables escape from sharp minima while preserving Muon's spectral-norm regularization. No load-bearing step reduces by construction to a fitted input, self-definition, or self-citation chain; the analysis is presented as a direct derivation from the modified update rule using standard optimization techniques. Empirical results consist of direct head-to-head comparisons on independent pretraining and SFT benchmarks across scales, without renaming known patterns or smuggling ansatzes via citation. The derivation chain therefore stands on its own stated assumptions and does not collapse to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are specified or can be extracted.

pith-pipeline@v0.9.1-grok · 5750 in / 1226 out tokens · 52087 ms · 2026-06-29T19:12:54.374905+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 32 canonical work pages · 18 internal anchors

[1]

Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, and John Langford. 2025. Dion: Distributed orthonormalized updates. arXiv preprint arXiv:2504.05295

work page arXiv 2025
[2]

Alon Albalak, Duy Phung, Nathan Lile, Rafael Rafailov, Kanishk Gandhi, Louis Castricato, Anikait Singh, Chase Blagden, Violet Xiang, Dakota Mahan, and 1 others. 2025. Big-math: A large-scale, high-quality math dataset for reinforcement learning in language models. arXiv preprint arXiv:2502.17387

work page arXiv 2025
[3]

Loubna Ben Allal, Niklas Muennighoff, Logesh Kumar Umapathi, Ben Lipkin, and Leandro Von Werra. 2022. A framework for the evaluation of code generation models

2022
[4]

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence

2020
[5]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901

2020
[6]

Weilin Cai, Juyong Jiang, Le Qin, Junwei Cui, Sunghun Kim, and Jiayi Huang. 2024. Shortcut-connected expert parallelism for accelerating mixture-of-experts. arXiv preprint arXiv:2404.05019

work page arXiv 2024
[7]

Multipl-e: A scalable and extensible approach to benchmarking neural code generation, 2022

Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, and 1 others. Multipl-e: A scalable and extensible approach to benchmarking neural code generation, 2022. URL https://arxiv. org/abs/2208.08227

work page arXiv 2022
[8]

Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and 1 others. 2023. Symbolic discovery of optimization algorithms. Advances in neural information processing systems, 36:49205--49233

2023
[9]

Yao Cheng, Jianfeng Chen, Jie Chen, Li Chen, Liyu Chen, Wentao Chen, Zhengyu Chen, Shijie Geng, Aoyan Li, Bo Li, and 1 others. 2024. Fullstack bench: Evaluating llms as full stack coders. arXiv preprint arXiv:2412.00535

work page arXiv 2024
[10]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, and 1 others. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

DeepSeek-AI. 2026. Deepseek-v4: Towards highly efficient million-token context intelligence

2026
[12]

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short ...

2019
[13]

Kaja Gruntkowska, Yassine Maziane, Zheng Qu, and Peter Richt \'a rik. 2025. Drop-muon: Update less, converge faster. arXiv preprint arXiv:2510.02239

work page arXiv 2025
[14]

Alex Gu, Baptiste Rozi \`e re, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I Wang. 2024. Cruxeval: A benchmark for code reasoning, understanding and execution. arXiv preprint arXiv:2401.03065

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Vineet Gupta, Tomer Koren, and Yoram Singer. 2018. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, pages 1842--1850. PMLR

2018
[16]

Wei He, Kai Han, Hang Zhou, Hanting Chen, Zhicheng Liu, Xinghao Chen, and Yunhe Wang. 2025. Root: Robust orthogonalized optimizer for neural network training. arXiv preprint arXiv:2511.20626

work page arXiv 2025
[17]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2020
[18]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Yao Fu, and 1 others. 2023. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Advances in neural information processing systems, 36:62991--63010

2023
[20]

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts. Neural computation, 3(1):79--87

1991
[21]

Naman Jain, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2025. Livecodebench: Holistic and contamination free evaluation of large language models for code. In International Conference on Learning Representations, volume 2025, pages 58791--58831

2025
[22]

Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. 2024. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan. github. io/posts/muon, 6(3):4

2024
[23]

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. 2016. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836

work page internal anchor Pith review Pith/arXiv arXiv 2016
[24]

Ahmed Khaled, Kaan Ozkara, Tao Yu, Mingyi Hong, and Youngsuk Park. 2025. https://doi.org/10.48550/arXiv.2510.16981 Muonbp: Faster muon via block-periodic orthogonalization . arXiv preprint arXiv:2510.16981

work page doi:10.48550/arxiv.2510.16981 2025
[25]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014
[26]

Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. 2023. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pages 18319--18345. PMLR

2023
[27]

CMMLU: Measuring massive multitask language understanding in Chinese

Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. Cmmlu: Measuring massive multitask language understanding in chinese, 2024. URL https://arxiv. org/abs/2306.09212

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, and 1 others. 2024 a . Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, and 1 others. 2024 b . Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Hong Liu, Jiaqi Zhang, Chao Wang, Xing Hu, Linkun Lyu, Jiaqi Sun, Xurui Yang, Bo Wang, Fengcun Li, Yulei Qian, Lingtong Si, Yerui Sun, Rumei Li, Peng Pei, Yuchen Xie, and Xunliang Cai. 2026. https://arxiv.org/abs/2601.21204 Scaling embeddings outperforms scaling experts in language models . Preprint, arXiv:2601.21204

work page arXiv 2026
[32]

Jiawei Liu, Songrun Xie, Junhao Wang, Yuxiang Wei, Yifeng Ding, and Lingming Zhang. 2024 c . Evaluating language models for efficient code generation. arXiv preprint arXiv:2408.06450

work page arXiv 2024
[33]

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, and 1 others. 2025. Muon is scalable for llm training. arXiv preprint arXiv:2502.16982

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2017
[35]

James Martens and Roger Grosse. 2015. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pages 2408--2417. PMLR

2015
[36]

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and 1 others. 2017. Mixed precision training. arXiv preprint arXiv:1710.03740

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP

2018
[38]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. 2023. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. The annals of mathematical statistics, pages 400--407

1951
[40]

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2019. Winogrande: An adversarial winograd schema challenge at scale. arXiv preprint arXiv:1907.10641

work page internal anchor Pith review Pith/arXiv arXiv 2019
[41]

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. 2019. https://www.aclweb.org/anthology/D19-1454 Social iqa: Commonsense reasoning about social interactions . In EMNLP

2019
[42]

Chongjie Si, Debing Zhang, and Wei Shen. 2025. Adamuon: Adaptive muon optimizer. arXiv preprint arXiv:2507.11005

work page arXiv 2025
[43]

Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. 2013. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139--1147. pmlr

2013
[44]

Mirac Suzgun, Nathan Scales, Nathanael Sch \"a rli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, and 1 others. 2023. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13003--13051

2023
[45]

Oyvind Tafjord, Bhavana Dalvi, and Peter Clark. 2021. Proofwriter: Generating implications, proofs, and abductive statements over natural language. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3621--3634

2021
[46]

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. https://doi.org/10.18653/v1/N19-1421 C ommonsense QA : A question answering challenge targeting commonsense knowledge . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long ...

work page doi:10.18653/v1/n19-1421 2019
[47]

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, and 150 others. 2025 a . https://arxiv.org/abs/2507.20534 Kimi k2: Open agentic intelligence . Preprint, arXiv:2507.20534

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Meituan LongCat Team, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, and 1 others. 2025 b . Longcat-flash technical report. arXiv preprint arXiv:2509.01322

work page arXiv 2025
[49]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30

2017
[50]

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, and 1 others. 2024. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. Advances in Neural Information Processing Systems, 37:95266--95290

2024
[51]

Yunhui Xia, Wei Shen, Yan Wang, Jason Klein Liu, Huifeng Sun, Siyue Wu, Jian Hu, and Xiaolong Xu. 2025. Leetcodedataset: A temporal dataset for robust evaluation and efficient training of code llms. arXiv preprint arXiv:2504.14655

work page arXiv 2025
[52]

Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, and Shuicheng Yan. 2024. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9508--9520

2024
[53]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others. 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Yang You, Igor Gitman, and Boris Ginsburg. 2017. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888

work page internal anchor Pith review Pith/arXiv arXiv 2017
[55]

Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. 2019. Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962

work page internal anchor Pith review Pith/arXiv arXiv 2019
[56]

Daoguang Zan, Bei Chen, Zeqi Lin, Bei Guan, Yongji Wang, and Jian-Guang Lou. 2022. When language model meets private library. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 277--288

2022
[57]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

2019
[58]

Tong Zhao, Jiacheng Li, Yuanchang Zhou, Guangming Tan, and Weile Jia. 2026. Exploring landscapes for better minima along valleys. Advances in Neural Information Processing Systems, 38:171496--171547

2026
[59]

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, and 1 others. 2025. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions. In International Conference on Learning Representations, volume 2025, pages 66602--66656

2025
[60]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[61]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

[1] [1]

Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, and John Langford. 2025. Dion: Distributed orthonormalized updates. arXiv preprint arXiv:2504.05295

work page arXiv 2025

[2] [2]

Alon Albalak, Duy Phung, Nathan Lile, Rafael Rafailov, Kanishk Gandhi, Louis Castricato, Anikait Singh, Chase Blagden, Violet Xiang, Dakota Mahan, and 1 others. 2025. Big-math: A large-scale, high-quality math dataset for reinforcement learning in language models. arXiv preprint arXiv:2502.17387

work page arXiv 2025

[3] [3]

Loubna Ben Allal, Niklas Muennighoff, Logesh Kumar Umapathi, Ben Lipkin, and Leandro Von Werra. 2022. A framework for the evaluation of code generation models

2022

[4] [4]

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence

2020

[5] [5]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901

2020

[6] [6]

Weilin Cai, Juyong Jiang, Le Qin, Junwei Cui, Sunghun Kim, and Jiayi Huang. 2024. Shortcut-connected expert parallelism for accelerating mixture-of-experts. arXiv preprint arXiv:2404.05019

work page arXiv 2024

[7] [7]

Multipl-e: A scalable and extensible approach to benchmarking neural code generation, 2022

Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, and 1 others. Multipl-e: A scalable and extensible approach to benchmarking neural code generation, 2022. URL https://arxiv. org/abs/2208.08227

work page arXiv 2022

[8] [8]

Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and 1 others. 2023. Symbolic discovery of optimization algorithms. Advances in neural information processing systems, 36:49205--49233

2023

[9] [9]

Yao Cheng, Jianfeng Chen, Jie Chen, Li Chen, Liyu Chen, Wentao Chen, Zhengyu Chen, Shijie Geng, Aoyan Li, Bo Li, and 1 others. 2024. Fullstack bench: Evaluating llms as full stack coders. arXiv preprint arXiv:2412.00535

work page arXiv 2024

[10] [10]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, and 1 others. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[11] [11]

DeepSeek-AI. 2026. Deepseek-v4: Towards highly efficient million-token context intelligence

2026

[12] [12]

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short ...

2019

[13] [13]

Kaja Gruntkowska, Yassine Maziane, Zheng Qu, and Peter Richt \'a rik. 2025. Drop-muon: Update less, converge faster. arXiv preprint arXiv:2510.02239

work page arXiv 2025

[14] [14]

Alex Gu, Baptiste Rozi \`e re, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I Wang. 2024. Cruxeval: A benchmark for code reasoning, understanding and execution. arXiv preprint arXiv:2401.03065

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Vineet Gupta, Tomer Koren, and Yoram Singer. 2018. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, pages 1842--1850. PMLR

2018

[16] [16]

Wei He, Kai Han, Hang Zhou, Hanting Chen, Zhicheng Liu, Xinghao Chen, and Yunhe Wang. 2025. Root: Robust orthogonalized optimizer for neural network training. arXiv preprint arXiv:2511.20626

work page arXiv 2025

[17] [17]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2020

[18] [18]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874

work page internal anchor Pith review Pith/arXiv arXiv 2021

[19] [19]

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Yao Fu, and 1 others. 2023. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Advances in neural information processing systems, 36:62991--63010

2023

[20] [20]

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts. Neural computation, 3(1):79--87

1991

[21] [21]

Naman Jain, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2025. Livecodebench: Holistic and contamination free evaluation of large language models for code. In International Conference on Learning Representations, volume 2025, pages 58791--58831

2025

[22] [22]

Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. 2024. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan. github. io/posts/muon, 6(3):4

2024

[23] [23]

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. 2016. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836

work page internal anchor Pith review Pith/arXiv arXiv 2016

[24] [24]

Ahmed Khaled, Kaan Ozkara, Tao Yu, Mingyi Hong, and Youngsuk Park. 2025. https://doi.org/10.48550/arXiv.2510.16981 Muonbp: Faster muon via block-periodic orthogonalization . arXiv preprint arXiv:2510.16981

work page doi:10.48550/arxiv.2510.16981 2025

[25] [25]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014

[26] [26]

Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. 2023. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pages 18319--18345. PMLR

2023

[27] [27]

CMMLU: Measuring massive multitask language understanding in Chinese

Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. Cmmlu: Measuring massive multitask language understanding in chinese, 2024. URL https://arxiv. org/abs/2306.09212

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [29]

Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, and 1 others. 2024 a . Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [30]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, and 1 others. 2024 b . Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [31]

Hong Liu, Jiaqi Zhang, Chao Wang, Xing Hu, Linkun Lyu, Jiaqi Sun, Xurui Yang, Bo Wang, Fengcun Li, Yulei Qian, Lingtong Si, Yerui Sun, Rumei Li, Peng Pei, Yuchen Xie, and Xunliang Cai. 2026. https://arxiv.org/abs/2601.21204 Scaling embeddings outperforms scaling experts in language models . Preprint, arXiv:2601.21204

work page arXiv 2026

[31] [32]

Jiawei Liu, Songrun Xie, Junhao Wang, Yuxiang Wei, Yifeng Ding, and Lingming Zhang. 2024 c . Evaluating language models for efficient code generation. arXiv preprint arXiv:2408.06450

work page arXiv 2024

[32] [33]

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, and 1 others. 2025. Muon is scalable for llm training. arXiv preprint arXiv:2502.16982

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [34]

Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2017

[34] [35]

James Martens and Roger Grosse. 2015. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pages 2408--2417. PMLR

2015

[35] [36]

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and 1 others. 2017. Mixed precision training. arXiv preprint arXiv:1710.03740

work page internal anchor Pith review Pith/arXiv arXiv 2017

[36] [37]

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP

2018

[37] [38]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. 2023. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [39]

Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. The annals of mathematical statistics, pages 400--407

1951

[39] [40]

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2019. Winogrande: An adversarial winograd schema challenge at scale. arXiv preprint arXiv:1907.10641

work page internal anchor Pith review Pith/arXiv arXiv 2019

[40] [41]

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. 2019. https://www.aclweb.org/anthology/D19-1454 Social iqa: Commonsense reasoning about social interactions . In EMNLP

2019

[41] [42]

Chongjie Si, Debing Zhang, and Wei Shen. 2025. Adamuon: Adaptive muon optimizer. arXiv preprint arXiv:2507.11005

work page arXiv 2025

[42] [43]

Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. 2013. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139--1147. pmlr

2013

[43] [44]

Mirac Suzgun, Nathan Scales, Nathanael Sch \"a rli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, and 1 others. 2023. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13003--13051

2023

[44] [45]

Oyvind Tafjord, Bhavana Dalvi, and Peter Clark. 2021. Proofwriter: Generating implications, proofs, and abductive statements over natural language. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3621--3634

2021

[45] [46]

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. https://doi.org/10.18653/v1/N19-1421 C ommonsense QA : A question answering challenge targeting commonsense knowledge . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long ...

work page doi:10.18653/v1/n19-1421 2019

[46] [47]

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, and 150 others. 2025 a . https://arxiv.org/abs/2507.20534 Kimi k2: Open agentic intelligence . Preprint, arXiv:2507.20534

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [48]

Meituan LongCat Team, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, and 1 others. 2025 b . Longcat-flash technical report. arXiv preprint arXiv:2509.01322

work page arXiv 2025

[48] [49]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30

2017

[49] [50]

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, and 1 others. 2024. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. Advances in Neural Information Processing Systems, 37:95266--95290

2024

[50] [51]

Yunhui Xia, Wei Shen, Yan Wang, Jason Klein Liu, Huifeng Sun, Siyue Wu, Jian Hu, and Xiaolong Xu. 2025. Leetcodedataset: A temporal dataset for robust evaluation and efficient training of code llms. arXiv preprint arXiv:2504.14655

work page arXiv 2025

[51] [52]

Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, and Shuicheng Yan. 2024. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9508--9520

2024

[52] [53]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others. 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [54]

Yang You, Igor Gitman, and Boris Ginsburg. 2017. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888

work page internal anchor Pith review Pith/arXiv arXiv 2017

[54] [55]

Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. 2019. Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962

work page internal anchor Pith review Pith/arXiv arXiv 2019

[55] [56]

Daoguang Zan, Bei Chen, Zeqi Lin, Bei Guan, Yongji Wang, and Jian-Guang Lou. 2022. When language model meets private library. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 277--288

2022

[56] [57]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

2019

[57] [58]

Tong Zhao, Jiacheng Li, Yuanchang Zhou, Guangming Tan, and Weile Jia. 2026. Exploring landscapes for better minima along valleys. Advances in Neural Information Processing Systems, 38:171496--171547

2026

[58] [59]

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, and 1 others. 2025. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions. In International Conference on Learning Representations, volume 2025, pages 66602--66656

2025

[59] [60]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

[60] [61]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...