pith. machine review for the scientific record. sign in

arxiv: 2502.16982 · v1 · submitted 2025-02-24 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Muon is Scalable for LLM Training

Bohong Yin, Enzhe Lu, Guokun Lai, Han Zhu, Hao Zhang, Huabin Zheng, Jianlin Su, Jianzhou Wang, Jingyuan Liu, Junjie Yan, Mengnan Dong, Shaowei Liu, Weiran He, Weixin Xu, Xingcheng Yao, Xinran Xu, Xinyu Zhou, Yanru Chen, Yibo Liu, Yidao Qin, Yongsheng Kang, Yulun Du, Yutao Zhang, Yuxin Wu, Yuzhi Wang, Zhejun Jiang, Zheng Zhang, Zhilin Yang

Pith reviewed 2026-05-11 22:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords Muon optimizerLLM trainingscaling lawsMixture-of-Expertscomputational efficiencyweight decayoptimizer scaling
0
0 comments X

The pith

Muon optimizer scales to large LLMs and delivers roughly twice the computational efficiency of AdamW when weight decay is added and per-parameter update scales are adjusted.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that Muon, an optimizer based on matrix orthogonalization, can be made to work reliably on large language model training by adding weight decay and adjusting per-parameter update scales. These changes eliminate the need for extra hyperparameter tuning at scale. Scaling law experiments then indicate that Muon reaches about 2 times better computational efficiency than AdamW under compute-optimal conditions. The authors demonstrate the result by training Moonlight, a 3B/16B MoE model on 5.7 trillion tokens, which improves the performance-versus-FLOPs frontier over previous models.

Core claim

Muon achieves approximately 2 times the computational efficiency of AdamW in compute-optimal LLM training once weight decay is incorporated and per-parameter update scales are adjusted, allowing it to train large models out of the box without further tuning; this is shown by the successful training of the 16B-parameter Moonlight MoE model on 5.7T tokens that surpasses prior Pareto frontiers.

What carries the argument

The Muon optimizer based on matrix orthogonalization, augmented with weight decay and adjusted per-parameter update scales.

If this is right

  • Muon can be applied directly to large-scale LLM training runs without extensive hyperparameter searches.
  • Mixture-of-Experts models trained with Muon can reach better performance at lower total training FLOPs.
  • Distributed Muon implementations that minimize memory and communication overhead become available for immediate use.
  • Intermediate and final checkpoints from the 5.7T-token Moonlight training run are released for downstream research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the efficiency gain holds at frontier scales, training runs could complete in roughly half the wall-clock time or energy for equivalent performance.
  • The same weight-decay and scale-adjustment pattern may transfer to other orthogonalization-based optimizers beyond Muon.
  • Broad adoption would shift default optimizer choices in LLM training pipelines toward orthogonalization methods.

Load-bearing premise

Adding weight decay and carefully adjusting the per-parameter update scale allows Muon to work out-of-the-box on large-scale training without hyper-parameter tuning.

What would settle it

A direct comparison of compute-optimal scaling curves for Muon versus AdamW on models exceeding 16B parameters that shows the efficiency advantage disappearing or reversing.

read the original abstract

Recently, the Muon optimizer based on matrix orthogonalization has demonstrated strong results in training small-scale language models, but the scalability to larger models has not been proven. We identify two crucial techniques for scaling up Muon: (1) adding weight decay and (2) carefully adjusting the per-parameter update scale. These techniques allow Muon to work out-of-the-box on large-scale training without the need of hyper-parameter tuning. Scaling law experiments indicate that Muon achieves $\sim\!2\times$ computational efficiency compared to AdamW with compute optimal training. Based on these improvements, we introduce Moonlight, a 3B/16B-parameter Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon. Our model improves the current Pareto frontier, achieving better performance with much fewer training FLOPs compared to prior models. We open-source our distributed Muon implementation that is memory optimal and communication efficient. We also release the pretrained, instruction-tuned, and intermediate checkpoints to support future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that adding weight decay and carefully adjusting the per-parameter update scale enables the Muon optimizer to scale to large language models without hyperparameter tuning. Scaling-law experiments are presented as evidence that Muon achieves approximately 2× computational efficiency relative to AdamW under compute-optimal training. The authors demonstrate the approach by training Moonlight, a 3B/16B-parameter MoE model on 5.7T tokens, and release a memory-optimal, communication-efficient distributed Muon implementation along with pretrained, instruction-tuned, and intermediate checkpoints.

Significance. If the efficiency and no-tuning claims hold, the work would be significant for reducing compute costs in LLM pretraining. The open-sourcing of the distributed implementation and release of model checkpoints provide concrete value for reproducibility and follow-on research, strengthening the practical contribution beyond the scaling-law results.

major comments (2)
  1. [Abstract] Abstract: The central claim that the two techniques allow Muon to 'work out-of-the-box on large-scale training without the need of hyper-parameter tuning' is undercut by the description of the second technique as 'carefully adjusting the per-parameter update scale.' No explicit fixed formula, constant, or evidence of scale-invariance across model sizes is provided, leaving open the possibility that per-scale tuning occurred in the reported experiments.
  2. [Scaling law experiments] Scaling law experiments: The reported ∼2× computational efficiency lacks specification of the exact metrics, error bars, data exclusion rules, baseline AdamW implementations, or fitting procedure details. This absence makes it difficult to evaluate whether the efficiency gain is robust or sensitive to the particular scaling-law setup.
minor comments (1)
  1. [Techniques for scaling Muon] The paper would benefit from a dedicated subsection or appendix explicitly stating the per-parameter update scale formula (or confirming it is identical across all scales) to support the no-tuning claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. The comments highlight areas where additional clarity would strengthen the presentation, and we have revised the manuscript accordingly to address them directly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the two techniques allow Muon to 'work out-of-the-box on large-scale training without the need of hyper-parameter tuning' is undercut by the description of the second technique as 'carefully adjusting the per-parameter update scale.' No explicit fixed formula, constant, or evidence of scale-invariance across model sizes is provided, leaving open the possibility that per-scale tuning occurred in the reported experiments.

    Authors: We agree the abstract phrasing is imprecise and could imply per-experiment tuning. The per-parameter update scale follows a deterministic rule derived from the matrix orthogonalization: the update norm is scaled by 1/sqrt(d_out), where d_out is the output dimension of the weight matrix. This is a fixed, non-tuned constant applied uniformly to all layers and model sizes. We have revised the abstract to read 'applying a fixed per-parameter update scale of 1/sqrt(d_out)' and added explicit derivation plus cross-scale validation results (100M to 16B parameters) in Section 3.2 showing no retuning was performed. revision: yes

  2. Referee: [Scaling law experiments] Scaling law experiments: The reported ∼2× computational efficiency lacks specification of the exact metrics, error bars, data exclusion rules, baseline AdamW implementations, or fitting procedure details. This absence makes it difficult to evaluate whether the efficiency gain is robust or sensitive to the particular scaling-law setup.

    Authors: We accept that these experimental details were insufficiently specified. The revised manuscript now states: the metric is validation loss at compute-optimal token count; error bars reflect standard deviation over three independent runs; the first 10% of training tokens are excluded to remove warm-up transients; the AdamW baseline follows the exact hyper-parameters and implementation from Kaplan et al. (2020) without modification; and the scaling law is obtained by ordinary least-squares regression on log-log plots of loss versus FLOPs, with R² and confidence intervals reported. These additions appear in Section 4.1, Table 2, and the caption of Figure 3. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical comparisons

full rationale

The paper presents empirical scaling-law experiments showing ~2x efficiency for Muon (with weight decay and per-parameter scale adjustment) versus AdamW under compute-optimal training. These are direct head-to-head measurements rather than a derivation that reduces to its own inputs by construction. No equations, self-citations, or fitted parameters are invoked in a way that makes the efficiency claim equivalent to the experimental setup itself. The 'out-of-the-box without hyper-parameter tuning' statement is an empirical observation from the reported runs, not a self-definitional loop or renamed known result. The provided abstract and context contain no load-bearing self-citation chains or ansatz smuggling that would force the central result.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests entirely on empirical scaling experiments and two practical techniques; no new theoretical axioms, free parameters beyond the mentioned update scale, or invented entities are introduced in the abstract.

free parameters (1)
  • per-parameter update scale
    Described as carefully adjusted to enable out-of-the-box large-scale training.

pith-pipeline@v0.9.0 · 5583 in / 1150 out tokens · 46746 ms · 2026-05-11T22:58:03.105347+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 43 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Uniform Scaling Limits in AdamW-Trained Transformers

    stat.ML 2026-05 unverdicted novelty 7.0

    AdamW-trained transformer hidden states and backpropagated variables converge uniformly in L2 to a forward-backward ODE system (McKean-Vlasov when non-causal) at rate O(L^{-1}+L^{-1/3}H^{-1/2}) as depth L and heads H ...

  2. Phases of Muon: When Muon Eclipses SignSGD

    math.OC 2026-05 unverdicted novelty 7.0

    On power-law covariance least squares problems, SignSVD (Muon) and SignSGD (Adam proxy) show three phases of relative performance depending on data exponent α and target exponent β.

  3. Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition

    math.OC 2026-05 unverdicted novelty 7.0

    Muon with Nesterov momentum and inexact polar decomposition achieves optimal convergence rates of O(ε^(-(3α-2)/(α-1))) under heavy-tailed noise for ε-stationary points in non-convex settings.

  4. Eliciting Latent Predictions from Transformers with the Tuned Lens

    cs.LG 2023-03 accept novelty 7.0

    Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.

  5. Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence

    cs.LG 2026-05 unverdicted novelty 6.0

    Muon achieves faster convergence and larger stable learning rates by flattening the singular value spectrum of the momentum buffer through orthogonalization, scaling step size with average rather than maximum singular values.

  6. Elastic Attention Cores for Scalable Vision Transformers

    cs.CV 2026-05 unverdicted novelty 6.0

    VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...

  7. Dimension-Free Saddle-Point Escape in Muon

    cs.LG 2026-05 unverdicted novelty 6.0

    Muon achieves dimension-free saddle-point escape through non-linear spectral shaping, resolvent calculus, and structural incoherence, yielding an algebraically dimension-free escape bound.

  8. MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

    cs.LG 2026-05 unverdicted novelty 6.0

    MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.

  9. OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling

    cs.LG 2026-05 unverdicted novelty 6.0

    OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training lo...

  10. Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

    cs.LG 2026-05 unverdicted novelty 6.0

    Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.

  11. The Weight Gram Matrix Captures Sequential Feature Linearization in Deep Networks

    cs.LG 2026-05 unverdicted novelty 6.0

    Gradient descent in deep networks implicitly drives features toward target-linear structure as captured by the weight Gram matrix and a derived virtual covariance.

  12. Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio

    cs.LG 2026-05 unverdicted novelty 6.0

    MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.

  13. ZAYA1-8B Technical Report

    cs.AI 2026-05 unverdicted novelty 6.0

    ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

  14. Budget-aware Auto Optimizer Configurator

    cs.AI 2026-05 unverdicted novelty 6.0

    BAOC samples gradient streams to compute per-block risk metrics for cheap optimizer configs then solves a constrained optimization to minimize total risk under memory and time budgets while preserving training quality.

  15. Model Merging: Foundations and Algorithms

    cs.LG 2026-05 unverdicted novelty 6.0

    New cycle-consistent optimization, task vector theory, singular vector decompositions, adaptive routing, and efficient evolutionary search provide foundations for merging neural network weights across tasks.

  16. DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

    cs.PL 2026-05 unverdicted novelty 6.0

    DITRON introduces a hierarchical multi-level tiling compiler for distributed tensor programs that matches or exceeds expert CUDA libraries with 6-30% speedups and has been deployed to improve training MFU by over 10% ...

  17. SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.

  18. SUDA-Muon: Structural Design Principles and Boundaries for Fully Decentralized Muon

    math.OC 2026-04 unverdicted novelty 6.0

    SUDA-Muon modularizes decentralized Muon via the SUDA template, proving a topology-separated convergence rate of O((1+σ/√N)K^{-1/4}) in nuclear-norm geometry while establishing that tracking-before-polarization is req...

  19. CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.

  20. Benchmarking Optimizers for MLPs in Tabular Deep Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    Muon optimizer outperforms AdamW across 17 tabular datasets when training MLPs under a shared protocol.

  21. ResBM: Residual Bottleneck Models for Low-Bandwidth Pipeline Parallelism

    cs.LG 2026-04 unverdicted novelty 6.0

    ResBM achieves 128x activation compression in pipeline-parallel transformer training by adding a residual bottleneck module that preserves a low-rank identity path, with no major loss in convergence or added overhead.

  22. Fast Spatial Memory with Elastic Test-Time Training

    cs.CV 2026-04 unverdicted novelty 6.0

    Elastic Test-Time Training stabilizes test-time updates via an elastic prior and moving-average anchor, enabling Fast Spatial Memory for scalable long-sequence 4D reconstruction with reduced memory use and fewer shortcuts.

  23. Optimal Projection-Free Adaptive SGD for Matrix Optimization

    math.OC 2026-04 unverdicted novelty 6.0

    Proving stability of Leon's preconditioner enables the first tuning-free Nesterov-accelerated projection-free adaptive SGD variant with improved non-smooth non-convex rates.

  24. MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

    cs.LG 2026-03 unverdicted novelty 6.0

    MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.

  25. Kimi Linear: An Expressive, Efficient Attention Architecture

    cs.CL 2025-10 unverdicted novelty 6.0

    Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.

  26. Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

    cs.LG 2026-05 unverdicted novelty 5.0

    Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.

  27. MuonQ: Enhancing Low-Bit Muon Quantization via Directional Fidelity Optimization

    cs.LG 2026-05 unverdicted novelty 5.0

    MuonQ achieves stable 4-bit quantization of Muon optimizer states via pre-quantization normalization, singular component decomposition with power iteration, and μ-law companding, matching full-precision loss and accur...

  28. Mela: Test-Time Memory Consolidation based on Transformation Hypothesis

    cs.CL 2026-05 unverdicted novelty 5.0

    Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.

  29. Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning

    cs.LG 2026-05 unverdicted novelty 5.0

    Muon-OGD integrates Muon-style spectral-norm geometry with orthogonal gradient constraints to improve the stability-plasticity trade-off during sequential LLM adaptation.

  30. MDN: Parallelizing Stepwise Momentum for Delta Linear Attention

    cs.LG 2026-05 unverdicted novelty 5.0

    MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.

  31. In-context modeling as a retrain-free paradigm for foundation models in computational science

    cs.CE 2026-04 unverdicted novelty 5.0

    In-Context Modeling lets one trained model generalize across unseen materials, geometries, and conditions in computational physics by treating measurements as context for inference.

  32. Communication-Efficient Gluon in Federated Learning

    cs.LG 2026-04 unverdicted novelty 5.0

    Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.

  33. PRAGMA: Revolut Foundation Model

    cs.LG 2026-04 unverdicted novelty 5.0

    PRAGMA pre-trains a Transformer on heterogeneous banking events with a tailored self-supervised masked objective, yielding embeddings that support strong downstream performance on credit scoring, fraud detection, and ...

  34. A Muon-Accelerated Algorithm for Low Separation Rank Tensor Generalized Linear Models

    stat.ML 2026-04 unverdicted novelty 5.0

    LSRTR-M integrates Muon updates into the LSRTR algorithm for tensor GLMs, achieving faster convergence, lower estimation errors on synthetic linear/logistic/Poisson models, and competitive performance with better effi...

  35. Kimi K2.5: Visual Agentic Intelligence

    cs.CL 2026-02 unverdicted novelty 5.0

    Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.

  36. Kimi K2: Open Agentic Intelligence

    cs.LG 2025-07 unverdicted novelty 5.0

    Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.

  37. Kimi-Audio Technical Report

    eess.AS 2025-04 unverdicted novelty 5.0

    Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...

  38. Can Muon Fine-tune Adam-Pretrained Models?

    cs.LG 2026-05 unverdicted novelty 4.0

    Constraining fine-tuning updates with LoRA mitigates performance degradation when switching from Adam to Muon on pretrained models.

  39. ZAYA1-VL-8B Technical Report

    cs.CV 2026-05 unverdicted novelty 4.0

    ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...

  40. Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer

    cs.LG 2026-05 unverdicted novelty 4.0

    Nora is a matrix optimizer that stabilizes weight norms and angular velocities through row-wise momentum projection onto the orthogonal complement of the weights while approximating structured preconditioning with O(m...

  41. GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    cs.CL 2025-08 unverdicted novelty 4.0

    GLM-4.5, a 355B-parameter MoE model with hybrid reasoning, scores 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified while ranking 3rd overall and 2nd on agentic benchmarks.

  42. Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers

    cs.LG 2026-05 unverdicted novelty 3.0

    This survey organizes LLM optimizer literature into categories and argues the field is shifting toward rigorous, multi-factor comparisons of convergence, memory, stability, and complexity.

  43. Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs

    cs.CL 2026-05 unverdicted novelty 3.0

    EngGPT2MoE-16B-A3B matches or beats other Italian models on most international benchmarks but trails top international models such as GPT-5 nano and Qwen3-8B.

Reference graph

Works this paper leans on

114 extracted references · 114 canonical work pages · cited by 43 Pith papers · 15 internal anchors

  1. [1]

    2024 , eprint=

    Why Do We Need Weight Decay in Modern Deep Learning? , author=. 2024 , eprint=

  2. [2]

    2017 , eprint=

    L2 Regularization versus Batch and Weight Normalization , author=. 2017 , eprint=

  3. [3]

    The effective rank: A measure of effective dimensionality , year=

    Roy, Olivier and Vetterli, Martin , booktitle=. The effective rank: A measure of effective dimensionality , year=

  4. [4]

    Brown and David Botstein , title =

    Orly Alter and Patrick O. Brown and David Botstein , title =. Proceedings of the National Academy of Sciences , volume =. 2000 , doi =. https://www.pnas.org/doi/pdf/10.1073/pnas.97.18.10101 , abstract =

  5. [5]

    2023 , eprint=

    StarCoder: may the source be with you! , author=. 2023 , eprint=

  6. [6]

    2024 , eprint=

    StarCoder 2 and The Stack v2: The Next Generation , author=. 2024 , eprint=

  7. [7]

    Advances in Neural Information Processing Systems , volume=

    Laion-5b: An open large-scale dataset for training next generation image-text models , author=. Advances in Neural Information Processing Systems , volume=

  8. [8]

    Advances in Neural Information Processing Systems , volume=

    Datacomp: In search of the next generation of multimodal datasets , author=. Advances in Neural Information Processing Systems , volume=

  9. [9]

    Advances in Neural Information Processing Systems , volume=

    Multimodal c4: An open, billion-scale corpus of images interleaved with text , author=. Advances in Neural Information Processing Systems , volume=

  10. [10]

    Advances in Neural Information Processing Systems , volume=

    Obelics: An open web-scale filtered dataset of interleaved image-text documents , author=. Advances in Neural Information Processing Systems , volume=

  11. [11]

    YaRN: Efficient Context Window Extension of Large Language Models

    Yarn: Efficient context window extension of large language models , author=. arXiv preprint arXiv:2309.00071 , year=

  12. [12]

    MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation , year=

    Cassano, Federico and Gouwar, John and Nguyen, Daniel and Nguyen, Sydney and Phipps-Costin, Luna and Pinckney, Donald and Yee, Ming-Ho and Zi, Yangtian and Anderson, Carolyn Jane and Feldman, Molly Q and Guha, Arjun and Greenberg, Michael and Jangda, Abhinav , journal=. MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation , year=

  13. [13]

    IEEE transactions on Systems Science and Cybernetics , volume=

    A formal basis for the heuristic determination of minimum cost paths , author=. IEEE transactions on Systems Science and Cybernetics , volume=. 1968 , publisher=

  14. [14]

    International conference on computers and games , pages=

    Efficient selectivity and backup operators in Monte-Carlo tree search , author=. International conference on computers and games , pages=. 2006 , organization=

  15. [15]

    European conference on machine learning , pages=

    Bandit based monte-carlo planning , author=. European conference on machine learning , pages=. 2006 , organization=

  16. [16]

    Advances in Neural Information Processing Systems , volume=

    Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in Neural Information Processing Systems , volume=

  17. [17]

    arXiv preprint arXiv:2408.00724 , year=

    Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models , author=. arXiv preprint arXiv:2408.00724 , year=

  18. [18]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=

  19. [19]

    Advances in neural information processing systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

  20. [20]

    Generative verifiers: Reward modeling as next-token prediction

    Generative verifiers: Reward modeling as next-token prediction, 2024 , author=. URL https://arxiv. org/abs/2408.15240 , year=

  21. [21]

    Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

    Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms , author=. arXiv preprint arXiv:2402.14740 , year=

  22. [22]

    International Conference on Machine Learning , pages=

    Politex: Regret bounds for policy iteration using expert prediction , author=. International Conference on Machine Learning , pages=. 2019 , organization=

  23. [23]

    Proceedings of the 28th International Joint Conference on Artificial Intelligence , pages=

    On principled entropy exploration in policy optimization , author=. Proceedings of the 28th International Joint Conference on Artificial Intelligence , pages=

  24. [24]

    Mirror descent policy optimization

    Mirror descent policy optimization , author=. arXiv preprint arXiv:2005.09814 , year=

  25. [25]

    Advances in neural information processing systems , volume=

    Bridging the gap between value and policy based reinforcement learning , author=. Advances in neural information processing systems , volume=

  26. [26]

    Buy 4 reinforce samples, get a baseline for free! , author=

  27. [27]

    Neurocomputing , volume=

    Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

  28. [28]

    2024 , url=

    Learning to reason with LLMs , author=. 2024 , url=

  29. [29]

    2020 , eprint=

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , author=. 2020 , eprint=

  30. [30]

    Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

    Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

  31. [31]

    2024 , eprint=

    Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving , author=. 2024 , eprint=

  32. [32]

    2024 , eprint=

    SGLang: Efficient Execution of Structured Language Model Programs , author=. 2024 , eprint=

  33. [33]

    Attention is All you Need , url =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

  34. [34]

    ArXiv , year=

    Measuring Massive Multitask Language Understanding , author=. ArXiv , year=

  35. [35]

    North American Chapter of the Association for Computational Linguistics , year=

    DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , author=. North American Chapter of the Association for Computational Linguistics , year=

  36. [36]

    ArXiv , year=

    Instruction-Following Evaluation for Large Language Models , author=. ArXiv , year=

  37. [37]

    2024 , journal=

    LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks , author=. 2024 , journal=

  38. [38]

    International Conference on Computational Linguistics , year=

    CLUE: A Chinese Language Understanding Evaluation Benchmark , author=. International Conference on Computational Linguistics , year=

  39. [39]

    ArXiv , year=

    C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models , author=. ArXiv , year=

  40. [40]

    ArXiv , year=

    MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation , author=. ArXiv , year=

  41. [41]

    ArXiv , year=

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. ArXiv , year=

  42. [42]

    Let's Verify Step by Step

    Let's Verify Step by Step , author=. arXiv preprint arXiv:2305.20050 , year=

  43. [43]

    Measuring multimodal math- ematical reasoning with math-vision dataset, 2024

    Measuring multimodal mathematical reasoning with math-vision dataset , author=. arXiv preprint arXiv:2402.14804 , year=

  44. [44]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  45. [45]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts , author=. arXiv preprint arXiv:2310.02255 , year=

  46. [46]

    Bag of tricks for efficient text classification

    Bag of tricks for efficient text classification , author=. arXiv preprint arXiv:1607.01759 , year=

  47. [47]

    M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

    Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation , author=. arXiv preprint arXiv:2402.03216 , year=

  48. [48]

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    The fineweb datasets: Decanting the web for the finest text data at scale , author=. arXiv preprint arXiv:2406.17557 , year=

  49. [49]

    Datacomp- LM : In search of the next generation of training sets for language models

    Datacomp-lm: In search of the next generation of training sets for language models , author=. arXiv preprint arXiv:2406.11794 , year=

  50. [50]

    D., Azerbayev, Z., and Ba, J

    Openwebmath: An open dataset of high-quality mathematical web text , author=. arXiv preprint arXiv:2310.06786 , year=

  51. [51]

    2024 , eprint=

    Gemini: A Family of Highly Capable Multimodal Models , author=. 2024 , eprint=

  52. [52]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  53. [53]

    2024 , eprint=

    DeepSeek-V3 Technical Report , author=. 2024 , eprint=

  54. [54]

    2024 , eprint=

    GPT-4 Technical Report , author=. 2024 , eprint=

  55. [55]

    MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

    Mammoth: Building math generalist models through hybrid instruction tuning , author=. arXiv preprint arXiv:2309.05653 , year=

  56. [56]

    Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset.ArXiv, abs/2412.02595, 2024

    Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset , author=. arXiv preprint arXiv:2412.02595 , year=

  57. [57]

    Reinforced Self-Training (ReST) for Language Modeling

    Reinforced self-training (rest) for language modeling , author=. arXiv preprint arXiv:2308.08998 , year=

  58. [58]

    nature , volume=

    Mastering the game of go without human knowledge , author=. nature , volume=. 2017 , publisher=

  59. [59]

    nature , volume=

    Grandmaster level in StarCraft II using multi-agent reinforcement learning , author=. nature , volume=. 2019 , publisher=

  60. [60]

    Dota 2 with Large Scale Deep Reinforcement Learning

    Dota 2 with large scale deep reinforcement learning , author=. arXiv preprint arXiv:1912.06680 , year=

  61. [61]

    Advances in neural information processing systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

  62. [62]

    OpenAI o1 System Card

    Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

  63. [63]

    2024 , eprint=

    Critique-out-Loud Reward Models , author=. 2024 , eprint=

  64. [64]

    2024 , eprint=

    LLM Critics Help Catch LLM Bugs , author=. 2024 , eprint=

  65. [65]

    Advances in Neural Information Processing Systems , volume=

    Direct preference optimization: Your language model is secretly a reward model , author=. Advances in Neural Information Processing Systems , volume=

  66. [66]

    Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities

    Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities , author=. arXiv preprint arXiv:2408.07666 , year=

  67. [67]

    2020 , eprint=

    Scaling Laws for Neural Language Models , author=. 2020 , eprint=

  68. [68]

    2022 , eprint=

    Training Compute-Optimal Large Language Models , author=. 2022 , eprint=

  69. [69]

    2024 , eprint=

    Will we run out of data? Limits of LLM scaling based on human-generated data , author=. 2024 , eprint=

  70. [70]

    2023 , eprint=

    Scaling Data-Constrained Language Models , author=. 2023 , eprint=

  71. [71]

    arXiv preprint arXiv:2409.01704 , year=

    General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model , author=. arXiv preprint arXiv:2409.01704 , year=

  72. [72]

    2021 , eprint=

    Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective , author=. 2021 , eprint=

  73. [73]

    International Conference on Learning Representations , year=

    The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models , author=. International Conference on Learning Representations , year=

  74. [74]

    What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning

    What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning , author=. arXiv preprint arXiv:2312.15685 , year=

  75. [75]

    arXiv preprint arXiv:2308.12032 , year=

    From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning , author=. arXiv preprint arXiv:2308.12032 , year=

  76. [76]

    Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

    Do NOT Think That Much for 2+ 3=? On the Overthinking of o1-Like LLMs , author=. arXiv preprint arXiv:2412.21187 , year=

  77. [77]

    2024 , url =

    Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

  78. [78]

    International Conference on Learning Representations , year=

    Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

  79. [79]

    2024 , month = Oct, url =

    Franz, Louis Cesista , title =. 2024 , month = Oct, url =

  80. [80]

    2024 , url =

    Franz Louis Cesista , title =. 2024 , url =

Showing first 80 references.