Recognition: 3 theorem links
· Lean Theorem2 OLMo 2 Furious
Pith reviewed 2026-05-11 16:42 UTC · model grok-4.3
The pith
OLMo 2 models reach or exceed open-model performance benchmarks while using fewer FLOPs and releasing all training artifacts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By combining a revised model architecture for training stability, an updated pretraining recipe, and the specialized Dolmino Mix 1124 introduced via late-stage curriculum training, the OLMo 2 base models sit at the Pareto frontier of performance to training compute. They often match or outperform open-weight models like Llama 3.1, Qwen 2.5, and Gemma 2 while using fewer FLOPs, with all training data, code, and recipes released openly. The OLMo 2-Instruct models, developed by extending Tulu 3 practices and applying final-stage reinforcement learning with verifiable rewards, remain competitive with open models of similar size and some proprietary models such as GPT-3.5 Turbo and GPT-4o Mini.
What carries the argument
Late-stage curriculum training with the Dolmino Mix 1124 specialized data mixture, which focuses targeted data in the annealing phase to raise downstream benchmark performance and per-token efficiency.
If this is right
- Complete public release of weights, data, code, logs, and intermediate checkpoints enables full reproduction and incremental research by any group.
- Efficiency improvements allow comparable or better model quality at reduced training compute cost.
- The curriculum approach with specialized late-stage data can be applied to other model scales to lift task performance without proportional increases in total training.
- Verifiable-reward reinforcement learning supports more reliable final alignment in open instruct models.
Where Pith is reading between the lines
- Widespread adoption of full artifact releases could raise the standard for verifiable claims in language-model research.
- Targeted data curricula may become a standard lever for efficient capability gains once their effects are confirmed across additional domains.
- Open training data allows external groups to study and refine data selection practices that affect model behavior.
Load-bearing premise
That the chosen downstream benchmarks accurately capture the benefits of the architecture changes and Dolmino Mix 1124, and that FLOPs comparisons to other models are fair and comprehensive.
What would settle it
Independent runs on a new, broad benchmark suite not used in the paper that show OLMo 2 models falling short of the claimed performance levels relative to the listed comparators, or an external audit that finds higher actual training FLOPs than reported for the achieved results.
read the original abstract
We present OLMo 2, the next generation of our fully open language models. OLMo 2 includes a family of dense autoregressive language models at 7B, 13B and 32B scales with fully released artifacts -- model weights, full training data, training code and recipes, training logs and thousands of intermediate checkpoints. In this work, we describe our modified model architecture and training recipe, focusing on techniques for achieving better training stability and improved per-token efficiency. Our updated pretraining data mixture introduces a new, specialized data mix called Dolmino Mix 1124, which significantly improves model capabilities across many downstream task benchmarks when introduced via late-stage curriculum training (i.e. specialized data during the annealing phase of pretraining). Finally, we incorporate best practices from T\"ulu 3 to develop OLMo 2-Instruct, focusing on permissive data and extending our final-stage reinforcement learning with verifiable rewards (RLVR). Our OLMo 2 base models sit at the Pareto frontier of performance to training compute, often matching or outperforming open-weight only models like Llama 3.1, Qwen 2.5, and Gemma 2 while using fewer FLOPs and with fully transparent training data, code, and recipe. Our fully open OLMo 2-Instruct models are competitive with open-weight only models of comparable size and even some proprietary models like GPT-3.5 Turbo and GPT 4o Mini.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents OLMo 2, a family of fully open dense autoregressive language models at 7B, 13B, and 32B scales, with complete release of weights, training data, code, recipes, logs, and intermediate checkpoints. It describes modifications to architecture and training recipe aimed at stability and per-token efficiency, introduces Dolmino Mix 1124 for late-stage curriculum (annealing) pretraining to improve downstream capabilities, and applies Tulu 3 practices plus RLVR to create OLMo 2-Instruct models. The central claim is that the base OLMo 2 models occupy the Pareto frontier of performance versus training compute, matching or exceeding open-weight models such as Llama 3.1, Qwen 2.5, and Gemma 2 while using fewer FLOPs, with the instruct variants competitive with some proprietary systems.
Significance. If the performance and efficiency claims are substantiated, the work makes a substantial contribution by advancing fully transparent, high-performing open models and providing extensive artifacts that enable independent verification and follow-on research. The emphasis on training stability techniques and targeted late-stage data curricula offers practical, reproducible insights for efficient pretraining at scale. Full release of thousands of checkpoints and logs is a notable strength that distinguishes this from typical closed or partially open releases.
major comments (3)
- [§6] §6 (Experimental Results, Pareto frontier comparisons): The central claim that OLMo 2 models achieve superior or matched performance at lower training compute requires uniform, architecture-aware FLOPs accounting (e.g., consistent application of the 6ND approximation, exact token counts, and adjustments for any differences in attention or other components). The manuscript does not provide the detailed breakdown or verification steps for these calculations against Llama 3.1, Qwen 2.5, and Gemma 2, leaving the 'fewer FLOPs' assertion difficult to confirm independently.
- [§5] §5 (Dolmino Mix 1124 and late-stage curriculum): The performance gains on downstream benchmarks are attributed to the new data mix introduced during annealing, yet no ablation studies isolate its contribution from concurrent architecture changes or stability hyperparameter adjustments. Without such controls, the attribution required to support the efficiency and capability claims remains insecure.
- [§6.1] §6.1 (Benchmark evaluation protocols): The reported improvements on standard benchmarks (MMLU and others) lack explicit details on evaluation harness versions, prompt formatting, number of runs, or any post-hoc selection criteria. This information is load-bearing for verifying that the observed gains genuinely reflect the training innovations rather than evaluation artifacts.
minor comments (3)
- [Table 1] Table 1 (model specifications): Clarify whether the reported parameter counts include embedding layers and whether FLOPs estimates account for the full forward pass during training.
- [Figure 3] Figure 3 (training curves): The y-axis scaling and legend for loss vs. tokens across model sizes could be made more precise to facilitate direct comparison of stability improvements.
- [Related Work] Related work section: Add citations to recent work on curriculum learning in large-scale pretraining and verifiable reward RL to better contextualize the Dolmino Mix and RLVR contributions.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive review of our manuscript. We have carefully considered each major comment and prepared point-by-point responses below. Where appropriate, we have revised the manuscript to address the concerns and improve clarity and verifiability.
read point-by-point responses
-
Referee: [§6] §6 (Experimental Results, Pareto frontier comparisons): The central claim that OLMo 2 models achieve superior or matched performance at lower training compute requires uniform, architecture-aware FLOPs accounting (e.g., consistent application of the 6ND approximation, exact token counts, and adjustments for any differences in attention or other components). The manuscript does not provide the detailed breakdown or verification steps for these calculations against Llama 3.1, Qwen 2.5, and Gemma 2, leaving the 'fewer FLOPs' assertion difficult to confirm independently.
Authors: We agree that explicit, uniform FLOPs accounting is necessary to substantiate the Pareto frontier claims. In the revised manuscript we have added a new appendix (Appendix C) that provides a complete breakdown for OLMo 2 and all comparison models. This includes: (1) exact training token counts for each model, (2) consistent application of the 6ND approximation, (3) adjustments for architectural differences such as attention head counts and MLP ratios, and (4) step-by-step verification of the resulting FLOP totals. These calculations confirm that the OLMo 2 models achieve the reported performance levels with fewer total training FLOPs than the cited open-weight baselines. revision: yes
-
Referee: [§5] §5 (Dolmino Mix 1124 and late-stage curriculum): The performance gains on downstream benchmarks are attributed to the new data mix introduced during annealing, yet no ablation studies isolate its contribution from concurrent architecture changes or stability hyperparameter adjustments. Without such controls, the attribution required to support the efficiency and capability claims remains insecure.
Authors: The referee is correct that the manuscript does not contain isolated ablation studies separating the Dolmino Mix 1124 from concurrent architecture and stability changes. Our development followed an incremental process in which these elements were refined together under compute constraints that precluded exhaustive ablations. In the revision we have expanded §5 with a more granular description of the Dolmino Mix composition, its curation rationale, and the specific downstream capabilities it targets during annealing. We also include references to internal checkpoint comparisons that show performance jumps coinciding with the introduction of the new mix. While we acknowledge that dedicated ablations would provide stronger causal evidence, the available evidence supports the contribution of the late-stage curriculum. revision: partial
-
Referee: [§6.1] §6.1 (Benchmark evaluation protocols): The reported improvements on standard benchmarks (MMLU and others) lack explicit details on evaluation harness versions, prompt formatting, number of runs, or any post-hoc selection criteria. This information is load-bearing for verifying that the observed gains genuinely reflect the training innovations rather than evaluation artifacts.
Authors: We thank the referee for identifying this omission. In the revised §6.1 we have added a dedicated evaluation protocol subsection that specifies: the exact version of the evaluation harness, the full prompt templates and formatting used for each benchmark, the number of independent runs performed with different random seeds, and explicit confirmation that no post-hoc model selection or cherry-picking occurred. All models were evaluated under identical conditions to ensure the reported gains reflect training differences. revision: yes
Circularity Check
No significant circularity; performance claims rely on external benchmarks and released artifacts
full rationale
The paper describes OLMo 2 architecture modifications, Dolmino Mix 1124 late-stage curriculum, and training stability techniques, then reports empirical results on downstream benchmarks. The central Pareto-frontier claim compares measured performance and training FLOPs against independently trained models (Llama 3.1, Qwen 2.5, Gemma 2). No equations, fitted parameters, or derivations are presented as predictions that reduce to the inputs by construction. A reference to Tulu 3 practices exists but is not load-bearing for the base-model performance assertions. The work is self-contained against external benchmarks and released artifacts.
Axiom & Free-Parameter Ledger
free parameters (2)
- Dolmino Mix 1124 composition and timing
- Training stability hyperparameters
axioms (2)
- standard math Next-token prediction is the appropriate core objective for language modeling
- domain assumption Downstream task benchmarks reliably indicate general model improvements
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclearOur OLMo 2 base models sit at the Pareto frontier of performance to training compute, often matching or outperforming open-weight only models like Llama 3.1, Qwen 2.5, and Gemma 2 while using fewer FLOPs...
-
IndisputableMonolith.Foundation.PhiForcingphi_fixed_point unclearOur updated pretraining data mixture introduces a new, specialized data mix called Dolmino Mix 1124, which significantly improves model capabilities across many downstream task benchmarks when introduced via late-stage curriculum training
-
IndisputableMonolith.Foundation.DimensionForcingalexander_duality_circle_linking unclearWe adopt a decoder-only transformer architecture... RMSNorm... QK-norm... Z-Loss... RoPE θ=5e5
Forward citations
Cited by 35 Pith papers
-
Characterizing the Expressivity of Local Attention in Transformers
Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressive...
-
Demystifying the Silence of Correctness Bugs in PyTorch Compiler
First empirical study of correctness bugs in torch.compile characterizes their patterns and proposes AlignGuard, which found 23 confirmed new bugs via LLM-guided test mutation.
-
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
-
Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs
RL on binary rewards boosts LLM factual recall by ~27% relative across models by redistributing probability mass to latent correct answers rather than acquiring new knowledge.
-
Implicit Representations of Grammaticality in Language Models
Linear probes on LM hidden states detect grammaticality better than string probabilities, generalize to human benchmarks and other languages, and correlate weakly with likelihood.
-
The Hidden Cost of Thinking: Energy Use and Environmental Impact of LMs Beyond Pretraining
Full development of 7B and 32B Olmo 3 models used 12.3 GWh datacenter energy and emitted 4,251 tCO2eq, with development overheads accounting for 82% of compute and reasoning models costing 17x more to post-train than ...
-
Supernodes and Halos: Loss-Critical Hubs in LLM Feed-Forward Layers
In LLM feed-forward networks, the top 1% of channels per layer carry a median 58.7% of loss sensitivity, forming supernodes whose protection enables effective 50% sparsity pruning with much lower perplexity than baselines.
-
EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training
EmbodiedMidtrain mid-trains VLMs on curated VLA-aligned data subsets to improve downstream performance on robot manipulation benchmarks.
-
Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models
Supervised fine-tuning of LLMs often fails to fully internalize all training instances due to five recurring causes including missing prerequisites and data conflicts, as diagnosed via a new framework across multiple models.
-
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
-
Annotations Mitigate Post-Training Mode Collapse
Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.
-
A Geometric Perspective on Next-Token Prediction in Large Language Models: Three Emerging Phases
LLMs exhibit three geometric phases in next-token prediction—seeding multiplexing, hoisting overriding, and focal convergence—where predictive subspaces rise, stabilize, and converge across layers.
-
Continuous Latent Diffusion Language Model
Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing l...
-
SparseForge: Efficient Semi-Structured LLM Sparsification via Annealing of Hessian-Guided Soft-Mask
SparseForge achieves 57.27% zero-shot accuracy on LLaMA-2-7B at 2:4 sparsity using only 5B retraining tokens, beating the dense baseline and nearly matching a 40B-token SOTA method.
-
Learning Rate Transfer in Normalized Transformers
νGPT is a modified parameterization of normalized transformers that enables learning rate transfer across width, depth, and token horizon.
-
When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient
Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward des...
-
Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling
X-GRAM applies data-aware dynamic token injection with hybrid hashing and local feature extraction to achieve up to 4.4 accuracy point gains over vanilla backbones and 3.2 over retrieval baselines at 0.73B-1.15B scale...
-
The Recurrent Transformer: Greater Effective Depth and Efficient Decoding
Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-mat...
-
OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens
OmniMouse demonstrates data-driven scaling in multi-task brain models on a 150B-token neural dataset, achieving SOTA across prediction, decoding, and forecasting while model size gains saturate.
-
Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale
Evidence for cross-modal representational convergence weakens substantially at scale and in realistic many-to-many settings, indicating models learn rich but distinct representations.
-
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
-
Measuring Distribution Shift in User Prompts and Its Effects on LLM Performance
The LENS framework applied to 192 real-world settings shows moderate natural prompt distribution shifts cause 73% average performance loss in deployed LLMs, especially across user groups and regions.
-
Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation
RISE applies CountSketch to dual lexical and semantic channels derived from output-layer gradient outer products, cutting data attribution storage by up to 112x and enabling retrospective and prospective influence ana...
-
Causal Drawbridges: Characterizing Gradient Blocking of Syntactic Islands in Transformer LMs
Causal interventions reveal that coordination islands block filler-gap mechanisms in Transformers in a gradient way matching humans, yielding the hypothesis that 'and' encodes relational dependencies differently in ex...
-
The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment
The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MAT...
-
Exclusive Unlearning
Exclusive Unlearning makes LLMs safe by forgetting all but retained domain knowledge, protecting against jailbreaks while preserving useful responses in areas like medicine and math.
-
SAM 3D: 3Dfy Anything in Images
SAM 3D reconstructs 3D objects from single images with geometry, texture, and pose using human-model annotated data at scale and synthetic-to-real training, achieving 5:1 human preference wins.
-
LLMs Get Lost In Multi-Turn Conversation
LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.
-
Muon is Scalable for LLM Training
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
-
Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models
Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.
-
GiVA: Gradient-Informed Bases for Vector-Based Adaptation
GiVA uses gradients to initialize vector adapters so they match LoRA performance at eight times lower rank while keeping extreme parameter efficiency.
-
VLA Foundry: A Unified Framework for Training Vision-Language-Action Models
VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.
-
Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling
Nexusformer uses a three-stage nonlinear mapping in attention to enable stable, inheritable scaling of transformers, matching baseline perplexity with up to 41.5% less compute when growing from 240M to 440M parameters.
-
(How) Learning Rates Regulate Catastrophic Overtraining
Learning rate decay during SFT increases pretrained model sharpness, which exacerbates catastrophic forgetting and causes overtraining in LLMs.
-
Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization
Outcome-level RL with binary or composite rewards improves compositional generalization over supervised fine-tuning by avoiding overfitting to frequent training patterns.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.