arxiv: 2502.16982 · v1 · submitted 2025-02-24 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Muon is Scalable for LLM Training

Bohong Yin, Enzhe Lu, Guokun Lai, Han Zhu, Hao Zhang, Huabin Zheng, Jianlin Su, Jianzhou Wang, Jingyuan Liu, Junjie Yan, Mengnan Dong, Shaowei Liu, Weiran He, Weixin Xu, Xingcheng Yao, Xinran Xu, Xinyu Zhou, Yanru Chen, Yibo Liu, Yidao Qin, Yongsheng Kang, Yulun Du, Yutao Zhang, Yuxin Wu, Yuzhi Wang, Zhejun Jiang, Zheng Zhang, Zhilin Yang

Pith reviewed 2026-05-11 22:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords Muon optimizerLLM trainingscaling lawsMixture-of-Expertscomputational efficiencyweight decayoptimizer scaling

0 comments

The pith

Muon optimizer scales to large LLMs and delivers roughly twice the computational efficiency of AdamW when weight decay is added and per-parameter update scales are adjusted.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that Muon, an optimizer based on matrix orthogonalization, can be made to work reliably on large language model training by adding weight decay and adjusting per-parameter update scales. These changes eliminate the need for extra hyperparameter tuning at scale. Scaling law experiments then indicate that Muon reaches about 2 times better computational efficiency than AdamW under compute-optimal conditions. The authors demonstrate the result by training Moonlight, a 3B/16B MoE model on 5.7 trillion tokens, which improves the performance-versus-FLOPs frontier over previous models.

Core claim

Muon achieves approximately 2 times the computational efficiency of AdamW in compute-optimal LLM training once weight decay is incorporated and per-parameter update scales are adjusted, allowing it to train large models out of the box without further tuning; this is shown by the successful training of the 16B-parameter Moonlight MoE model on 5.7T tokens that surpasses prior Pareto frontiers.

What carries the argument

The Muon optimizer based on matrix orthogonalization, augmented with weight decay and adjusted per-parameter update scales.

If this is right

Muon can be applied directly to large-scale LLM training runs without extensive hyperparameter searches.
Mixture-of-Experts models trained with Muon can reach better performance at lower total training FLOPs.
Distributed Muon implementations that minimize memory and communication overhead become available for immediate use.
Intermediate and final checkpoints from the 5.7T-token Moonlight training run are released for downstream research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the efficiency gain holds at frontier scales, training runs could complete in roughly half the wall-clock time or energy for equivalent performance.
The same weight-decay and scale-adjustment pattern may transfer to other orthogonalization-based optimizers beyond Muon.
Broad adoption would shift default optimizer choices in LLM training pipelines toward orthogonalization methods.

Load-bearing premise

Adding weight decay and carefully adjusting the per-parameter update scale allows Muon to work out-of-the-box on large-scale training without hyper-parameter tuning.

What would settle it

A direct comparison of compute-optimal scaling curves for Muon versus AdamW on models exceeding 16B parameters that shows the efficiency advantage disappearing or reversing.

read the original abstract

Recently, the Muon optimizer based on matrix orthogonalization has demonstrated strong results in training small-scale language models, but the scalability to larger models has not been proven. We identify two crucial techniques for scaling up Muon: (1) adding weight decay and (2) carefully adjusting the per-parameter update scale. These techniques allow Muon to work out-of-the-box on large-scale training without the need of hyper-parameter tuning. Scaling law experiments indicate that Muon achieves $\sim\!2\times$ computational efficiency compared to AdamW with compute optimal training. Based on these improvements, we introduce Moonlight, a 3B/16B-parameter Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon. Our model improves the current Pareto frontier, achieving better performance with much fewer training FLOPs compared to prior models. We open-source our distributed Muon implementation that is memory optimal and communication efficient. We also release the pretrained, instruction-tuned, and intermediate checkpoints to support future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Muon works at scale with weight decay plus a per-parameter adjustment, delivering a claimed 2x efficiency edge and a new MoE model, but the no-tuning guarantee and scaling-law support still need tighter details.

read the letter

The paper's core point is that Muon becomes practical for large LLM training once you add weight decay and adjust the per-parameter update scale. With those changes it reportedly needs no further hyper-parameter tuning and delivers roughly 2x better compute efficiency than AdamW under compute-optimal scaling laws. They demonstrate this by training Moonlight, a 3B/16B MoE model on 5.7T tokens, which moves the performance-versus-FLOPs frontier forward, and they release the code plus checkpoints.

Referee Report

2 major / 1 minor

Summary. The paper claims that adding weight decay and carefully adjusting the per-parameter update scale enables the Muon optimizer to scale to large language models without hyperparameter tuning. Scaling-law experiments are presented as evidence that Muon achieves approximately 2× computational efficiency relative to AdamW under compute-optimal training. The authors demonstrate the approach by training Moonlight, a 3B/16B-parameter MoE model on 5.7T tokens, and release a memory-optimal, communication-efficient distributed Muon implementation along with pretrained, instruction-tuned, and intermediate checkpoints.

Significance. If the efficiency and no-tuning claims hold, the work would be significant for reducing compute costs in LLM pretraining. The open-sourcing of the distributed implementation and release of model checkpoints provide concrete value for reproducibility and follow-on research, strengthening the practical contribution beyond the scaling-law results.

major comments (2)

[Abstract] Abstract: The central claim that the two techniques allow Muon to 'work out-of-the-box on large-scale training without the need of hyper-parameter tuning' is undercut by the description of the second technique as 'carefully adjusting the per-parameter update scale.' No explicit fixed formula, constant, or evidence of scale-invariance across model sizes is provided, leaving open the possibility that per-scale tuning occurred in the reported experiments.
[Scaling law experiments] Scaling law experiments: The reported ∼2× computational efficiency lacks specification of the exact metrics, error bars, data exclusion rules, baseline AdamW implementations, or fitting procedure details. This absence makes it difficult to evaluate whether the efficiency gain is robust or sensitive to the particular scaling-law setup.

minor comments (1)

[Techniques for scaling Muon] The paper would benefit from a dedicated subsection or appendix explicitly stating the per-parameter update scale formula (or confirming it is identical across all scales) to support the no-tuning claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. The comments highlight areas where additional clarity would strengthen the presentation, and we have revised the manuscript accordingly to address them directly.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the two techniques allow Muon to 'work out-of-the-box on large-scale training without the need of hyper-parameter tuning' is undercut by the description of the second technique as 'carefully adjusting the per-parameter update scale.' No explicit fixed formula, constant, or evidence of scale-invariance across model sizes is provided, leaving open the possibility that per-scale tuning occurred in the reported experiments.

Authors: We agree the abstract phrasing is imprecise and could imply per-experiment tuning. The per-parameter update scale follows a deterministic rule derived from the matrix orthogonalization: the update norm is scaled by 1/sqrt(d_out), where d_out is the output dimension of the weight matrix. This is a fixed, non-tuned constant applied uniformly to all layers and model sizes. We have revised the abstract to read 'applying a fixed per-parameter update scale of 1/sqrt(d_out)' and added explicit derivation plus cross-scale validation results (100M to 16B parameters) in Section 3.2 showing no retuning was performed. revision: yes
Referee: [Scaling law experiments] Scaling law experiments: The reported ∼2× computational efficiency lacks specification of the exact metrics, error bars, data exclusion rules, baseline AdamW implementations, or fitting procedure details. This absence makes it difficult to evaluate whether the efficiency gain is robust or sensitive to the particular scaling-law setup.

Authors: We accept that these experimental details were insufficiently specified. The revised manuscript now states: the metric is validation loss at compute-optimal token count; error bars reflect standard deviation over three independent runs; the first 10% of training tokens are excluded to remove warm-up transients; the AdamW baseline follows the exact hyper-parameters and implementation from Kaplan et al. (2020) without modification; and the scaling law is obtained by ordinary least-squares regression on log-log plots of loss versus FLOPs, with R² and confidence intervals reported. These additions appear in Section 4.1, Table 2, and the caption of Figure 3. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical comparisons

full rationale

The paper presents empirical scaling-law experiments showing ~2x efficiency for Muon (with weight decay and per-parameter scale adjustment) versus AdamW under compute-optimal training. These are direct head-to-head measurements rather than a derivation that reduces to its own inputs by construction. No equations, self-citations, or fitted parameters are invoked in a way that makes the efficiency claim equivalent to the experimental setup itself. The 'out-of-the-box without hyper-parameter tuning' statement is an empirical observation from the reported runs, not a self-definitional loop or renamed known result. The provided abstract and context contain no load-bearing self-citation chains or ansatz smuggling that would force the central result.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests entirely on empirical scaling experiments and two practical techniques; no new theoretical axioms, free parameters beyond the mentioned update scale, or invented entities are introduced in the abstract.

free parameters (1)

per-parameter update scale
Described as carefully adjusted to enable out-of-the-box large-scale training.

pith-pipeline@v0.9.0 · 5583 in / 1150 out tokens · 46746 ms · 2026-05-11T22:58:03.105347+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear
We identify two crucial techniques for scaling up Muon: (1) adding weight decay and (2) carefully adjusting the per-parameter update scale. These techniques allow Muon to work out-of-the-box on large-scale training without the need of hyper-parameter tuning.
Foundation.HierarchyForcing uniform_scaling_forced unclear
we scale Muon’s update RMS to this range by the following adjustment: Wt = Wt−1 − ηt(0.2 · Ot · √max(A, B) + λWt−1)

Forward citations

Cited by 43 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Uniform Scaling Limits in AdamW-Trained Transformers
stat.ML 2026-05 unverdicted novelty 7.0

AdamW-trained transformer hidden states and backpropagated variables converge uniformly in L2 to a forward-backward ODE system (McKean-Vlasov when non-causal) at rate O(L^{-1}+L^{-1/3}H^{-1/2}) as depth L and heads H ...
Phases of Muon: When Muon Eclipses SignSGD
math.OC 2026-05 unverdicted novelty 7.0

On power-law covariance least squares problems, SignSVD (Muon) and SignSGD (Adam proxy) show three phases of relative performance depending on data exponent α and target exponent β.
Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition
math.OC 2026-05 unverdicted novelty 7.0

Muon with Nesterov momentum and inexact polar decomposition achieves optimal convergence rates of O(ε^(-(3α-2)/(α-1))) under heavy-tailed noise for ε-stationary points in non-convex settings.
Eliciting Latent Predictions from Transformers with the Tuned Lens
cs.LG 2023-03 accept novelty 7.0

Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence
cs.LG 2026-05 unverdicted novelty 6.0

Muon achieves faster convergence and larger stable learning rates by flattening the singular value spectrum of the momentum buffer through orthogonalization, scaling step size with average rather than maximum singular values.
Elastic Attention Cores for Scalable Vision Transformers
cs.CV 2026-05 unverdicted novelty 6.0

VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...
Dimension-Free Saddle-Point Escape in Muon
cs.LG 2026-05 unverdicted novelty 6.0

Muon achieves dimension-free saddle-point escape through non-linear spectral shaping, resolvent calculus, and structural incoherence, yielding an algebraically dimension-free escape bound.
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
cs.LG 2026-05 unverdicted novelty 6.0

MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling
cs.LG 2026-05 unverdicted novelty 6.0

OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training lo...
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
cs.LG 2026-05 unverdicted novelty 6.0

Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
The Weight Gram Matrix Captures Sequential Feature Linearization in Deep Networks
cs.LG 2026-05 unverdicted novelty 6.0

Gradient descent in deep networks implicitly drives features toward target-linear structure as captured by the weight Gram matrix and a derived virtual covariance.
Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio
cs.LG 2026-05 unverdicted novelty 6.0

MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.
ZAYA1-8B Technical Report
cs.AI 2026-05 unverdicted novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
Budget-aware Auto Optimizer Configurator
cs.AI 2026-05 unverdicted novelty 6.0

BAOC samples gradient streams to compute per-block risk metrics for cheap optimizer configs then solves a constrained optimization to minimize total risk under memory and time budgets while preserving training quality.
Model Merging: Foundations and Algorithms
cs.LG 2026-05 unverdicted novelty 6.0

New cycle-consistent optimization, task vector theory, singular vector decompositions, adaptive routing, and efficient evolutionary search provide foundations for merging neural network weights across tasks.
DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs
cs.PL 2026-05 unverdicted novelty 6.0

DITRON introduces a hierarchical multi-level tiling compiler for distributed tensor programs that matches or exceeds expert CUDA libraries with 6-30% speedups and has been deployed to improve training MFU by over 10% ...
SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs
cs.CV 2026-04 unverdicted novelty 6.0

SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.
SUDA-Muon: Structural Design Principles and Boundaries for Fully Decentralized Muon
math.OC 2026-04 unverdicted novelty 6.0

SUDA-Muon modularizes decentralized Muon via the SUDA template, proving a topology-separated convergence rate of O((1+σ/√N)K^{-1/4}) in nuclear-norm geometry while establishing that tracking-before-polarization is req...
CityRAG: Stepping Into a City via Spatially-Grounded Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.
Benchmarking Optimizers for MLPs in Tabular Deep Learning
cs.LG 2026-04 unverdicted novelty 6.0

Muon optimizer outperforms AdamW across 17 tabular datasets when training MLPs under a shared protocol.
ResBM: Residual Bottleneck Models for Low-Bandwidth Pipeline Parallelism
cs.LG 2026-04 unverdicted novelty 6.0

ResBM achieves 128x activation compression in pipeline-parallel transformer training by adding a residual bottleneck module that preserves a low-rank identity path, with no major loss in convergence or added overhead.
Fast Spatial Memory with Elastic Test-Time Training
cs.CV 2026-04 unverdicted novelty 6.0

Elastic Test-Time Training stabilizes test-time updates via an elastic prior and moving-average anchor, enabling Fast Spatial Memory for scalable long-sequence 4D reconstruction with reduced memory use and fewer shortcuts.
Optimal Projection-Free Adaptive SGD for Matrix Optimization
math.OC 2026-04 unverdicted novelty 6.0

Proving stability of Leon's preconditioner enables the first tuning-free Nesterov-accelerated projection-free adaptive SGD variant with improved non-smooth non-convex rates.
MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration
cs.LG 2026-03 unverdicted novelty 6.0

MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.
Kimi Linear: An Expressive, Efficient Attention Architecture
cs.CL 2025-10 unverdicted novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation
cs.LG 2026-05 unverdicted novelty 5.0

Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.
MuonQ: Enhancing Low-Bit Muon Quantization via Directional Fidelity Optimization
cs.LG 2026-05 unverdicted novelty 5.0

MuonQ achieves stable 4-bit quantization of Muon optimizer states via pre-quantization normalization, singular component decomposition with power iteration, and μ-law companding, matching full-precision loss and accur...
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis
cs.CL 2026-05 unverdicted novelty 5.0

Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning
cs.LG 2026-05 unverdicted novelty 5.0

Muon-OGD integrates Muon-style spectral-norm geometry with orthogonal gradient constraints to improve the stability-plasticity trade-off during sequential LLM adaptation.
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
cs.LG 2026-05 unverdicted novelty 5.0

MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
In-context modeling as a retrain-free paradigm for foundation models in computational science
cs.CE 2026-04 unverdicted novelty 5.0

In-Context Modeling lets one trained model generalize across unseen materials, geometries, and conditions in computational physics by treating measurements as context for inference.
Communication-Efficient Gluon in Federated Learning
cs.LG 2026-04 unverdicted novelty 5.0

Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.
PRAGMA: Revolut Foundation Model
cs.LG 2026-04 unverdicted novelty 5.0

PRAGMA pre-trains a Transformer on heterogeneous banking events with a tailored self-supervised masked objective, yielding embeddings that support strong downstream performance on credit scoring, fraud detection, and ...
A Muon-Accelerated Algorithm for Low Separation Rank Tensor Generalized Linear Models
stat.ML 2026-04 unverdicted novelty 5.0

LSRTR-M integrates Muon updates into the LSRTR algorithm for tensor GLMs, achieving faster convergence, lower estimation errors on synthetic linear/logistic/Poisson models, and competitive performance with better effi...
Kimi K2.5: Visual Agentic Intelligence
cs.CL 2026-02 unverdicted novelty 5.0

Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
Kimi K2: Open Agentic Intelligence
cs.LG 2025-07 unverdicted novelty 5.0

Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
Kimi-Audio Technical Report
eess.AS 2025-04 unverdicted novelty 5.0

Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...
Can Muon Fine-tune Adam-Pretrained Models?
cs.LG 2026-05 unverdicted novelty 4.0

Constraining fine-tuning updates with LoRA mitigates performance degradation when switching from Adam to Muon on pretrained models.
ZAYA1-VL-8B Technical Report
cs.CV 2026-05 unverdicted novelty 4.0

ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...
Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer
cs.LG 2026-05 unverdicted novelty 4.0

Nora is a matrix optimizer that stabilizes weight norms and angular velocities through row-wise momentum projection onto the orthogonal complement of the weights while approximating structured preconditioning with O(m...
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
cs.CL 2025-08 unverdicted novelty 4.0

GLM-4.5, a 355B-parameter MoE model with hybrid reasoning, scores 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified while ranking 3rd overall and 2nd on agentic benchmarks.
Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
cs.LG 2026-05 unverdicted novelty 3.0

This survey organizes LLM optimizer literature into categories and argues the field is shifting toward rigorous, multi-factor comparisons of convergence, memory, stability, and complexity.
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs
cs.CL 2026-05 unverdicted novelty 3.0

EngGPT2MoE-16B-A3B matches or beats other Italian models on most international benchmarks but trails top international models such as GPT-5 nano and Qwen3-8B.

Reference graph

Works this paper leans on

114 extracted references · 114 canonical work pages · cited by 43 Pith papers · 15 internal anchors

[1]

2024 , eprint=

Why Do We Need Weight Decay in Modern Deep Learning? , author=. 2024 , eprint=

work page 2024
[2]

2017 , eprint=

L2 Regularization versus Batch and Weight Normalization , author=. 2017 , eprint=

work page 2017
[3]

The effective rank: A measure of effective dimensionality , year=

Roy, Olivier and Vetterli, Martin , booktitle=. The effective rank: A measure of effective dimensionality , year=

work page
[4]

Brown and David Botstein , title =

Orly Alter and Patrick O. Brown and David Botstein , title =. Proceedings of the National Academy of Sciences , volume =. 2000 , doi =. https://www.pnas.org/doi/pdf/10.1073/pnas.97.18.10101 , abstract =

work page doi:10.1073/pnas.97.18.10101 2000
[5]

2023 , eprint=

StarCoder: may the source be with you! , author=. 2023 , eprint=

work page 2023
[6]

2024 , eprint=

StarCoder 2 and The Stack v2: The Next Generation , author=. 2024 , eprint=

work page 2024
[7]

Advances in Neural Information Processing Systems , volume=

Laion-5b: An open large-scale dataset for training next generation image-text models , author=. Advances in Neural Information Processing Systems , volume=

work page
[8]

Advances in Neural Information Processing Systems , volume=

Datacomp: In search of the next generation of multimodal datasets , author=. Advances in Neural Information Processing Systems , volume=

work page
[9]

Advances in Neural Information Processing Systems , volume=

Multimodal c4: An open, billion-scale corpus of images interleaved with text , author=. Advances in Neural Information Processing Systems , volume=

work page
[10]

Advances in Neural Information Processing Systems , volume=

Obelics: An open web-scale filtered dataset of interleaved image-text documents , author=. Advances in Neural Information Processing Systems , volume=

work page
[11]

YaRN: Efficient Context Window Extension of Large Language Models

Yarn: Efficient context window extension of large language models , author=. arXiv preprint arXiv:2309.00071 , year=

work page internal anchor Pith review arXiv
[12]

MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation , year=

Cassano, Federico and Gouwar, John and Nguyen, Daniel and Nguyen, Sydney and Phipps-Costin, Luna and Pinckney, Donald and Yee, Ming-Ho and Zi, Yangtian and Anderson, Carolyn Jane and Feldman, Molly Q and Guha, Arjun and Greenberg, Michael and Jangda, Abhinav , journal=. MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation , year=

work page
[13]

IEEE transactions on Systems Science and Cybernetics , volume=

A formal basis for the heuristic determination of minimum cost paths , author=. IEEE transactions on Systems Science and Cybernetics , volume=. 1968 , publisher=

work page 1968
[14]

International conference on computers and games , pages=

Efficient selectivity and backup operators in Monte-Carlo tree search , author=. International conference on computers and games , pages=. 2006 , organization=

work page 2006
[15]

European conference on machine learning , pages=

Bandit based monte-carlo planning , author=. European conference on machine learning , pages=. 2006 , organization=

work page 2006
[16]

Advances in Neural Information Processing Systems , volume=

Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[17]

arXiv preprint arXiv:2408.00724 , year=

Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models , author=. arXiv preprint arXiv:2408.00724 , year=

work page arXiv
[18]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

work page
[20]

Generative verifiers: Reward modeling as next-token prediction

Generative verifiers: Reward modeling as next-token prediction, 2024 , author=. URL https://arxiv. org/abs/2408.15240 , year=

work page arXiv 2024
[21]

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms , author=. arXiv preprint arXiv:2402.14740 , year=

work page internal anchor Pith review arXiv
[22]

International Conference on Machine Learning , pages=

Politex: Regret bounds for policy iteration using expert prediction , author=. International Conference on Machine Learning , pages=. 2019 , organization=

work page 2019
[23]

Proceedings of the 28th International Joint Conference on Artificial Intelligence , pages=

On principled entropy exploration in policy optimization , author=. Proceedings of the 28th International Joint Conference on Artificial Intelligence , pages=

work page
[24]

Mirror descent policy optimization

Mirror descent policy optimization , author=. arXiv preprint arXiv:2005.09814 , year=

work page arXiv 2005
[25]

Advances in neural information processing systems , volume=

Bridging the gap between value and policy based reinforcement learning , author=. Advances in neural information processing systems , volume=

work page
[26]

Buy 4 reinforce samples, get a baseline for free! , author=

work page
[27]

Neurocomputing , volume=

Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

work page 2024
[28]

2024 , url=

Learning to reason with LLMs , author=. 2024 , url=

work page 2024
[29]

2020 , eprint=

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , author=. 2020 , eprint=

work page 2020
[30]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

work page
[31]

2024 , eprint=

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving , author=. 2024 , eprint=

work page 2024
[32]

2024 , eprint=

SGLang: Efficient Execution of Structured Language Model Programs , author=. 2024 , eprint=

work page 2024
[33]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

work page
[34]

ArXiv , year=

Measuring Massive Multitask Language Understanding , author=. ArXiv , year=

work page
[35]

North American Chapter of the Association for Computational Linguistics , year=

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , author=. North American Chapter of the Association for Computational Linguistics , year=

work page
[36]

ArXiv , year=

Instruction-Following Evaluation for Large Language Models , author=. ArXiv , year=

work page
[37]

2024 , journal=

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks , author=. 2024 , journal=

work page 2024
[38]

International Conference on Computational Linguistics , year=

CLUE: A Chinese Language Understanding Evaluation Benchmark , author=. International Conference on Computational Linguistics , year=

work page
[39]

ArXiv , year=

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models , author=. ArXiv , year=

work page
[40]

ArXiv , year=

MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation , author=. ArXiv , year=

work page
[41]

ArXiv , year=

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. ArXiv , year=

work page
[42]

Let's Verify Step by Step

Let's Verify Step by Step , author=. arXiv preprint arXiv:2305.20050 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Measuring multimodal math- ematical reasoning with math-vision dataset, 2024

Measuring multimodal mathematical reasoning with math-vision dataset , author=. arXiv preprint arXiv:2402.14804 , year=

work page arXiv
[44]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[45]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts , author=. arXiv preprint arXiv:2310.02255 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Bag of tricks for efficient text classification

Bag of tricks for efficient text classification , author=. arXiv preprint arXiv:1607.01759 , year=

work page arXiv
[47]

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation , author=. arXiv preprint arXiv:2402.03216 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

The fineweb datasets: Decanting the web for the finest text data at scale , author=. arXiv preprint arXiv:2406.17557 , year=

work page internal anchor Pith review arXiv
[49]

Datacomp- LM : In search of the next generation of training sets for language models

Datacomp-lm: In search of the next generation of training sets for language models , author=. arXiv preprint arXiv:2406.11794 , year=

work page arXiv
[50]

D., Azerbayev, Z., and Ba, J

Openwebmath: An open dataset of high-quality mathematical web text , author=. arXiv preprint arXiv:2310.06786 , year=

work page arXiv
[51]

2024 , eprint=

Gemini: A Family of Highly Capable Multimodal Models , author=. 2024 , eprint=

work page 2024
[52]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024
[53]

2024 , eprint=

DeepSeek-V3 Technical Report , author=. 2024 , eprint=

work page 2024
[54]

2024 , eprint=

GPT-4 Technical Report , author=. 2024 , eprint=

work page 2024
[55]

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

Mammoth: Building math generalist models through hybrid instruction tuning , author=. arXiv preprint arXiv:2309.05653 , year=

work page arXiv
[56]

Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset.ArXiv, abs/2412.02595, 2024

Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset , author=. arXiv preprint arXiv:2412.02595 , year=

work page arXiv
[57]

Reinforced Self-Training (ReST) for Language Modeling

Reinforced self-training (rest) for language modeling , author=. arXiv preprint arXiv:2308.08998 , year=

work page Pith review arXiv
[58]

nature , volume=

Mastering the game of go without human knowledge , author=. nature , volume=. 2017 , publisher=

work page 2017
[59]

nature , volume=

Grandmaster level in StarCraft II using multi-agent reinforcement learning , author=. nature , volume=. 2019 , publisher=

work page 2019
[60]

Dota 2 with Large Scale Deep Reinforcement Learning

Dota 2 with large scale deep reinforcement learning , author=. arXiv preprint arXiv:1912.06680 , year=

work page internal anchor Pith review arXiv 1912
[61]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[62]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[63]

2024 , eprint=

Critique-out-Loud Reward Models , author=. 2024 , eprint=

work page 2024
[64]

2024 , eprint=

LLM Critics Help Catch LLM Bugs , author=. 2024 , eprint=

work page 2024
[65]

Advances in Neural Information Processing Systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in Neural Information Processing Systems , volume=

work page
[66]

Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities

Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities , author=. arXiv preprint arXiv:2408.07666 , year=

work page arXiv
[67]

2020 , eprint=

Scaling Laws for Neural Language Models , author=. 2020 , eprint=

work page 2020
[68]

2022 , eprint=

Training Compute-Optimal Large Language Models , author=. 2022 , eprint=

work page 2022
[69]

2024 , eprint=

Will we run out of data? Limits of LLM scaling based on human-generated data , author=. 2024 , eprint=

work page 2024
[70]

2023 , eprint=

Scaling Data-Constrained Language Models , author=. 2023 , eprint=

work page 2023
[71]

arXiv preprint arXiv:2409.01704 , year=

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model , author=. arXiv preprint arXiv:2409.01704 , year=

work page arXiv
[72]

2021 , eprint=

Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective , author=. 2021 , eprint=

work page 2021
[73]

International Conference on Learning Representations , year=

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models , author=. International Conference on Learning Representations , year=

work page
[74]

What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning

What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning , author=. arXiv preprint arXiv:2312.15685 , year=

work page arXiv
[75]

arXiv preprint arXiv:2308.12032 , year=

From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning , author=. arXiv preprint arXiv:2308.12032 , year=

work page arXiv
[76]

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Do NOT Think That Much for 2+ 3=? On the Overthinking of o1-Like LLMs , author=. arXiv preprint arXiv:2412.21187 , year=

work page internal anchor Pith review arXiv
[77]

2024 , url =

Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

work page 2024
[78]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

work page
[79]

2024 , month = Oct, url =

Franz, Louis Cesista , title =. 2024 , month = Oct, url =

work page 2024
[80]

2024 , url =

Franz Louis Cesista , title =. 2024 , url =

work page 2024

Showing first 80 references.