Recognition: 2 theorem links
· Lean TheoremMuon is Scalable for LLM Training
Pith reviewed 2026-05-11 22:58 UTC · model grok-4.3
The pith
Muon optimizer scales to large LLMs and delivers roughly twice the computational efficiency of AdamW when weight decay is added and per-parameter update scales are adjusted.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Muon achieves approximately 2 times the computational efficiency of AdamW in compute-optimal LLM training once weight decay is incorporated and per-parameter update scales are adjusted, allowing it to train large models out of the box without further tuning; this is shown by the successful training of the 16B-parameter Moonlight MoE model on 5.7T tokens that surpasses prior Pareto frontiers.
What carries the argument
The Muon optimizer based on matrix orthogonalization, augmented with weight decay and adjusted per-parameter update scales.
If this is right
- Muon can be applied directly to large-scale LLM training runs without extensive hyperparameter searches.
- Mixture-of-Experts models trained with Muon can reach better performance at lower total training FLOPs.
- Distributed Muon implementations that minimize memory and communication overhead become available for immediate use.
- Intermediate and final checkpoints from the 5.7T-token Moonlight training run are released for downstream research.
Where Pith is reading between the lines
- If the efficiency gain holds at frontier scales, training runs could complete in roughly half the wall-clock time or energy for equivalent performance.
- The same weight-decay and scale-adjustment pattern may transfer to other orthogonalization-based optimizers beyond Muon.
- Broad adoption would shift default optimizer choices in LLM training pipelines toward orthogonalization methods.
Load-bearing premise
Adding weight decay and carefully adjusting the per-parameter update scale allows Muon to work out-of-the-box on large-scale training without hyper-parameter tuning.
What would settle it
A direct comparison of compute-optimal scaling curves for Muon versus AdamW on models exceeding 16B parameters that shows the efficiency advantage disappearing or reversing.
read the original abstract
Recently, the Muon optimizer based on matrix orthogonalization has demonstrated strong results in training small-scale language models, but the scalability to larger models has not been proven. We identify two crucial techniques for scaling up Muon: (1) adding weight decay and (2) carefully adjusting the per-parameter update scale. These techniques allow Muon to work out-of-the-box on large-scale training without the need of hyper-parameter tuning. Scaling law experiments indicate that Muon achieves $\sim\!2\times$ computational efficiency compared to AdamW with compute optimal training. Based on these improvements, we introduce Moonlight, a 3B/16B-parameter Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon. Our model improves the current Pareto frontier, achieving better performance with much fewer training FLOPs compared to prior models. We open-source our distributed Muon implementation that is memory optimal and communication efficient. We also release the pretrained, instruction-tuned, and intermediate checkpoints to support future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that adding weight decay and carefully adjusting the per-parameter update scale enables the Muon optimizer to scale to large language models without hyperparameter tuning. Scaling-law experiments are presented as evidence that Muon achieves approximately 2× computational efficiency relative to AdamW under compute-optimal training. The authors demonstrate the approach by training Moonlight, a 3B/16B-parameter MoE model on 5.7T tokens, and release a memory-optimal, communication-efficient distributed Muon implementation along with pretrained, instruction-tuned, and intermediate checkpoints.
Significance. If the efficiency and no-tuning claims hold, the work would be significant for reducing compute costs in LLM pretraining. The open-sourcing of the distributed implementation and release of model checkpoints provide concrete value for reproducibility and follow-on research, strengthening the practical contribution beyond the scaling-law results.
major comments (2)
- [Abstract] Abstract: The central claim that the two techniques allow Muon to 'work out-of-the-box on large-scale training without the need of hyper-parameter tuning' is undercut by the description of the second technique as 'carefully adjusting the per-parameter update scale.' No explicit fixed formula, constant, or evidence of scale-invariance across model sizes is provided, leaving open the possibility that per-scale tuning occurred in the reported experiments.
- [Scaling law experiments] Scaling law experiments: The reported ∼2× computational efficiency lacks specification of the exact metrics, error bars, data exclusion rules, baseline AdamW implementations, or fitting procedure details. This absence makes it difficult to evaluate whether the efficiency gain is robust or sensitive to the particular scaling-law setup.
minor comments (1)
- [Techniques for scaling Muon] The paper would benefit from a dedicated subsection or appendix explicitly stating the per-parameter update scale formula (or confirming it is identical across all scales) to support the no-tuning claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. The comments highlight areas where additional clarity would strengthen the presentation, and we have revised the manuscript accordingly to address them directly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the two techniques allow Muon to 'work out-of-the-box on large-scale training without the need of hyper-parameter tuning' is undercut by the description of the second technique as 'carefully adjusting the per-parameter update scale.' No explicit fixed formula, constant, or evidence of scale-invariance across model sizes is provided, leaving open the possibility that per-scale tuning occurred in the reported experiments.
Authors: We agree the abstract phrasing is imprecise and could imply per-experiment tuning. The per-parameter update scale follows a deterministic rule derived from the matrix orthogonalization: the update norm is scaled by 1/sqrt(d_out), where d_out is the output dimension of the weight matrix. This is a fixed, non-tuned constant applied uniformly to all layers and model sizes. We have revised the abstract to read 'applying a fixed per-parameter update scale of 1/sqrt(d_out)' and added explicit derivation plus cross-scale validation results (100M to 16B parameters) in Section 3.2 showing no retuning was performed. revision: yes
-
Referee: [Scaling law experiments] Scaling law experiments: The reported ∼2× computational efficiency lacks specification of the exact metrics, error bars, data exclusion rules, baseline AdamW implementations, or fitting procedure details. This absence makes it difficult to evaluate whether the efficiency gain is robust or sensitive to the particular scaling-law setup.
Authors: We accept that these experimental details were insufficiently specified. The revised manuscript now states: the metric is validation loss at compute-optimal token count; error bars reflect standard deviation over three independent runs; the first 10% of training tokens are excluded to remove warm-up transients; the AdamW baseline follows the exact hyper-parameters and implementation from Kaplan et al. (2020) without modification; and the scaling law is obtained by ordinary least-squares regression on log-log plots of loss versus FLOPs, with R² and confidence intervals reported. These additions appear in Section 4.1, Table 2, and the caption of Figure 3. revision: yes
Circularity Check
No significant circularity; claims rest on empirical comparisons
full rationale
The paper presents empirical scaling-law experiments showing ~2x efficiency for Muon (with weight decay and per-parameter scale adjustment) versus AdamW under compute-optimal training. These are direct head-to-head measurements rather than a derivation that reduces to its own inputs by construction. No equations, self-citations, or fitted parameters are invoked in a way that makes the efficiency claim equivalent to the experimental setup itself. The 'out-of-the-box without hyper-parameter tuning' statement is an empirical observation from the reported runs, not a self-definitional loop or renamed known result. The provided abstract and context contain no load-bearing self-citation chains or ansatz smuggling that would force the central result.
Axiom & Free-Parameter Ledger
free parameters (1)
- per-parameter update scale
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclearWe identify two crucial techniques for scaling up Muon: (1) adding weight decay and (2) carefully adjusting the per-parameter update scale. These techniques allow Muon to work out-of-the-box on large-scale training without the need of hyper-parameter tuning.
-
Foundation.HierarchyForcinguniform_scaling_forced unclearwe scale Muon’s update RMS to this range by the following adjustment: Wt = Wt−1 − ηt(0.2 · Ot · √max(A, B) + λWt−1)
Forward citations
Cited by 43 Pith papers
-
Uniform Scaling Limits in AdamW-Trained Transformers
AdamW-trained transformer hidden states and backpropagated variables converge uniformly in L2 to a forward-backward ODE system (McKean-Vlasov when non-causal) at rate O(L^{-1}+L^{-1/3}H^{-1/2}) as depth L and heads H ...
-
Phases of Muon: When Muon Eclipses SignSGD
On power-law covariance least squares problems, SignSVD (Muon) and SignSGD (Adam proxy) show three phases of relative performance depending on data exponent α and target exponent β.
-
Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition
Muon with Nesterov momentum and inexact polar decomposition achieves optimal convergence rates of O(ε^(-(3α-2)/(α-1))) under heavy-tailed noise for ε-stationary points in non-convex settings.
-
Eliciting Latent Predictions from Transformers with the Tuned Lens
Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
-
Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence
Muon achieves faster convergence and larger stable learning rates by flattening the singular value spectrum of the momentum buffer through orthogonalization, scaling step size with average rather than maximum singular values.
-
Elastic Attention Cores for Scalable Vision Transformers
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...
-
Dimension-Free Saddle-Point Escape in Muon
Muon achieves dimension-free saddle-point escape through non-linear spectral shaping, resolvent calculus, and structural incoherence, yielding an algebraically dimension-free escape bound.
-
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
-
OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling
OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training lo...
-
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
-
The Weight Gram Matrix Captures Sequential Feature Linearization in Deep Networks
Gradient descent in deep networks implicitly drives features toward target-linear structure as captured by the weight Gram matrix and a derived virtual covariance.
-
Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio
MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.
-
ZAYA1-8B Technical Report
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
-
Budget-aware Auto Optimizer Configurator
BAOC samples gradient streams to compute per-block risk metrics for cheap optimizer configs then solves a constrained optimization to minimize total risk under memory and time budgets while preserving training quality.
-
Model Merging: Foundations and Algorithms
New cycle-consistent optimization, task vector theory, singular vector decompositions, adaptive routing, and efficient evolutionary search provide foundations for merging neural network weights across tasks.
-
DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs
DITRON introduces a hierarchical multi-level tiling compiler for distributed tensor programs that matches or exceeds expert CUDA libraries with 6-30% speedups and has been deployed to improve training MFU by over 10% ...
-
SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs
SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.
-
SUDA-Muon: Structural Design Principles and Boundaries for Fully Decentralized Muon
SUDA-Muon modularizes decentralized Muon via the SUDA template, proving a topology-separated convergence rate of O((1+σ/√N)K^{-1/4}) in nuclear-norm geometry while establishing that tracking-before-polarization is req...
-
CityRAG: Stepping Into a City via Spatially-Grounded Video Generation
CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.
-
Benchmarking Optimizers for MLPs in Tabular Deep Learning
Muon optimizer outperforms AdamW across 17 tabular datasets when training MLPs under a shared protocol.
-
ResBM: Residual Bottleneck Models for Low-Bandwidth Pipeline Parallelism
ResBM achieves 128x activation compression in pipeline-parallel transformer training by adding a residual bottleneck module that preserves a low-rank identity path, with no major loss in convergence or added overhead.
-
Fast Spatial Memory with Elastic Test-Time Training
Elastic Test-Time Training stabilizes test-time updates via an elastic prior and moving-average anchor, enabling Fast Spatial Memory for scalable long-sequence 4D reconstruction with reduced memory use and fewer shortcuts.
-
Optimal Projection-Free Adaptive SGD for Matrix Optimization
Proving stability of Leon's preconditioner enables the first tuning-free Nesterov-accelerated projection-free adaptive SGD variant with improved non-smooth non-convex rates.
-
MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration
MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.
-
Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
-
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation
Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.
-
MuonQ: Enhancing Low-Bit Muon Quantization via Directional Fidelity Optimization
MuonQ achieves stable 4-bit quantization of Muon optimizer states via pre-quantization normalization, singular component decomposition with power iteration, and μ-law companding, matching full-precision loss and accur...
-
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis
Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
-
Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning
Muon-OGD integrates Muon-style spectral-norm geometry with orthogonal gradient constraints to improve the stability-plasticity trade-off during sequential LLM adaptation.
-
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
-
In-context modeling as a retrain-free paradigm for foundation models in computational science
In-Context Modeling lets one trained model generalize across unseen materials, geometries, and conditions in computational physics by treating measurements as context for inference.
-
Communication-Efficient Gluon in Federated Learning
Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.
-
PRAGMA: Revolut Foundation Model
PRAGMA pre-trains a Transformer on heterogeneous banking events with a tailored self-supervised masked objective, yielding embeddings that support strong downstream performance on credit scoring, fraud detection, and ...
-
A Muon-Accelerated Algorithm for Low Separation Rank Tensor Generalized Linear Models
LSRTR-M integrates Muon updates into the LSRTR algorithm for tensor GLMs, achieving faster convergence, lower estimation errors on synthetic linear/logistic/Poisson models, and competitive performance with better effi...
-
Kimi K2.5: Visual Agentic Intelligence
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
-
Kimi K2: Open Agentic Intelligence
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
-
Kimi-Audio Technical Report
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...
-
Can Muon Fine-tune Adam-Pretrained Models?
Constraining fine-tuning updates with LoRA mitigates performance degradation when switching from Adam to Muon on pretrained models.
-
ZAYA1-VL-8B Technical Report
ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...
-
Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer
Nora is a matrix optimizer that stabilizes weight norms and angular velocities through row-wise momentum projection onto the orthogonal complement of the weights while approximating structured preconditioning with O(m...
-
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
GLM-4.5, a 355B-parameter MoE model with hybrid reasoning, scores 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified while ranking 3rd overall and 2nd on agentic benchmarks.
-
Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
This survey organizes LLM optimizer literature into categories and argues the field is shifting toward rigorous, multi-factor comparisons of convergence, memory, stability, and complexity.
-
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs
EngGPT2MoE-16B-A3B matches or beats other Italian models on most international benchmarks but trails top international models such as GPT-5 nano and Qwen3-8B.
Reference graph
Works this paper leans on
-
[1]
Why Do We Need Weight Decay in Modern Deep Learning? , author=. 2024 , eprint=
work page 2024
-
[2]
L2 Regularization versus Batch and Weight Normalization , author=. 2017 , eprint=
work page 2017
-
[3]
The effective rank: A measure of effective dimensionality , year=
Roy, Olivier and Vetterli, Martin , booktitle=. The effective rank: A measure of effective dimensionality , year=
-
[4]
Brown and David Botstein , title =
Orly Alter and Patrick O. Brown and David Botstein , title =. Proceedings of the National Academy of Sciences , volume =. 2000 , doi =. https://www.pnas.org/doi/pdf/10.1073/pnas.97.18.10101 , abstract =
- [5]
-
[6]
StarCoder 2 and The Stack v2: The Next Generation , author=. 2024 , eprint=
work page 2024
-
[7]
Advances in Neural Information Processing Systems , volume=
Laion-5b: An open large-scale dataset for training next generation image-text models , author=. Advances in Neural Information Processing Systems , volume=
-
[8]
Advances in Neural Information Processing Systems , volume=
Datacomp: In search of the next generation of multimodal datasets , author=. Advances in Neural Information Processing Systems , volume=
-
[9]
Advances in Neural Information Processing Systems , volume=
Multimodal c4: An open, billion-scale corpus of images interleaved with text , author=. Advances in Neural Information Processing Systems , volume=
-
[10]
Advances in Neural Information Processing Systems , volume=
Obelics: An open web-scale filtered dataset of interleaved image-text documents , author=. Advances in Neural Information Processing Systems , volume=
-
[11]
YaRN: Efficient Context Window Extension of Large Language Models
Yarn: Efficient context window extension of large language models , author=. arXiv preprint arXiv:2309.00071 , year=
work page internal anchor Pith review arXiv
-
[12]
MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation , year=
Cassano, Federico and Gouwar, John and Nguyen, Daniel and Nguyen, Sydney and Phipps-Costin, Luna and Pinckney, Donald and Yee, Ming-Ho and Zi, Yangtian and Anderson, Carolyn Jane and Feldman, Molly Q and Guha, Arjun and Greenberg, Michael and Jangda, Abhinav , journal=. MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation , year=
-
[13]
IEEE transactions on Systems Science and Cybernetics , volume=
A formal basis for the heuristic determination of minimum cost paths , author=. IEEE transactions on Systems Science and Cybernetics , volume=. 1968 , publisher=
work page 1968
-
[14]
International conference on computers and games , pages=
Efficient selectivity and backup operators in Monte-Carlo tree search , author=. International conference on computers and games , pages=. 2006 , organization=
work page 2006
-
[15]
European conference on machine learning , pages=
Bandit based monte-carlo planning , author=. European conference on machine learning , pages=. 2006 , organization=
work page 2006
-
[16]
Advances in Neural Information Processing Systems , volume=
Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in Neural Information Processing Systems , volume=
-
[17]
arXiv preprint arXiv:2408.00724 , year=
Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models , author=. arXiv preprint arXiv:2408.00724 , year=
-
[18]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Advances in neural information processing systems , volume=
Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
-
[20]
Generative verifiers: Reward modeling as next-token prediction
Generative verifiers: Reward modeling as next-token prediction, 2024 , author=. URL https://arxiv. org/abs/2408.15240 , year=
-
[21]
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms , author=. arXiv preprint arXiv:2402.14740 , year=
work page internal anchor Pith review arXiv
-
[22]
International Conference on Machine Learning , pages=
Politex: Regret bounds for policy iteration using expert prediction , author=. International Conference on Machine Learning , pages=. 2019 , organization=
work page 2019
-
[23]
Proceedings of the 28th International Joint Conference on Artificial Intelligence , pages=
On principled entropy exploration in policy optimization , author=. Proceedings of the 28th International Joint Conference on Artificial Intelligence , pages=
-
[24]
Mirror descent policy optimization
Mirror descent policy optimization , author=. arXiv preprint arXiv:2005.09814 , year=
-
[25]
Advances in neural information processing systems , volume=
Bridging the gap between value and policy based reinforcement learning , author=. Advances in neural information processing systems , volume=
-
[26]
Buy 4 reinforce samples, get a baseline for free! , author=
-
[27]
Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=
work page 2024
- [28]
-
[29]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , author=. 2020 , eprint=
work page 2020
-
[30]
Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
-
[31]
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving , author=. 2024 , eprint=
work page 2024
-
[32]
SGLang: Efficient Execution of Structured Language Model Programs , author=. 2024 , eprint=
work page 2024
-
[33]
Attention is All you Need , url =
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
- [34]
-
[35]
North American Chapter of the Association for Computational Linguistics , year=
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , author=. North American Chapter of the Association for Computational Linguistics , year=
-
[36]
Instruction-Following Evaluation for Large Language Models , author=. ArXiv , year=
-
[37]
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks , author=. 2024 , journal=
work page 2024
-
[38]
International Conference on Computational Linguistics , year=
CLUE: A Chinese Language Understanding Evaluation Benchmark , author=. International Conference on Computational Linguistics , year=
-
[39]
C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models , author=. ArXiv , year=
-
[40]
MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation , author=. ArXiv , year=
-
[41]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. ArXiv , year=
-
[42]
Let's Verify Step by Step , author=. arXiv preprint arXiv:2305.20050 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Measuring multimodal math- ematical reasoning with math-vision dataset, 2024
Measuring multimodal mathematical reasoning with math-vision dataset , author=. arXiv preprint arXiv:2402.14804 , year=
-
[44]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[45]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts , author=. arXiv preprint arXiv:2310.02255 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[46]
Bag of tricks for efficient text classification
Bag of tricks for efficient text classification , author=. arXiv preprint arXiv:1607.01759 , year=
-
[47]
Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation , author=. arXiv preprint arXiv:2402.03216 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[48]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
The fineweb datasets: Decanting the web for the finest text data at scale , author=. arXiv preprint arXiv:2406.17557 , year=
work page internal anchor Pith review arXiv
-
[49]
Datacomp- LM : In search of the next generation of training sets for language models
Datacomp-lm: In search of the next generation of training sets for language models , author=. arXiv preprint arXiv:2406.11794 , year=
-
[50]
Openwebmath: An open dataset of high-quality mathematical web text , author=. arXiv preprint arXiv:2310.06786 , year=
-
[51]
Gemini: A Family of Highly Capable Multimodal Models , author=. 2024 , eprint=
work page 2024
- [52]
- [53]
- [54]
-
[55]
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning
Mammoth: Building math generalist models through hybrid instruction tuning , author=. arXiv preprint arXiv:2309.05653 , year=
-
[56]
Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset , author=. arXiv preprint arXiv:2412.02595 , year=
-
[57]
Reinforced Self-Training (ReST) for Language Modeling
Reinforced self-training (rest) for language modeling , author=. arXiv preprint arXiv:2308.08998 , year=
-
[58]
Mastering the game of go without human knowledge , author=. nature , volume=. 2017 , publisher=
work page 2017
-
[59]
Grandmaster level in StarCraft II using multi-agent reinforcement learning , author=. nature , volume=. 2019 , publisher=
work page 2019
-
[60]
Dota 2 with Large Scale Deep Reinforcement Learning
Dota 2 with large scale deep reinforcement learning , author=. arXiv preprint arXiv:1912.06680 , year=
work page internal anchor Pith review arXiv 1912
-
[61]
Advances in neural information processing systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
-
[62]
Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [63]
- [64]
-
[65]
Advances in Neural Information Processing Systems , volume=
Direct preference optimization: Your language model is secretly a reward model , author=. Advances in Neural Information Processing Systems , volume=
-
[66]
Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities
Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities , author=. arXiv preprint arXiv:2408.07666 , year=
- [67]
-
[68]
Training Compute-Optimal Large Language Models , author=. 2022 , eprint=
work page 2022
-
[69]
Will we run out of data? Limits of LLM scaling based on human-generated data , author=. 2024 , eprint=
work page 2024
- [70]
-
[71]
arXiv preprint arXiv:2409.01704 , year=
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model , author=. arXiv preprint arXiv:2409.01704 , year=
-
[72]
Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective , author=. 2021 , eprint=
work page 2021
-
[73]
International Conference on Learning Representations , year=
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models , author=. International Conference on Learning Representations , year=
-
[74]
What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning , author=. arXiv preprint arXiv:2312.15685 , year=
-
[75]
arXiv preprint arXiv:2308.12032 , year=
From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning , author=. arXiv preprint arXiv:2308.12032 , year=
-
[76]
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
Do NOT Think That Much for 2+ 3=? On the Overthinking of o1-Like LLMs , author=. arXiv preprint arXiv:2412.21187 , year=
work page internal anchor Pith review arXiv
-
[77]
Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =
work page 2024
-
[78]
International Conference on Learning Representations , year=
Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=
- [79]
- [80]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.