pith. machine review for the scientific record. sign in

arxiv: 2501.00656 · v3 · submitted 2024-12-31 · 💻 cs.CL · cs.LG

Recognition: 3 theorem links

· Lean Theorem

2 OLMo 2 Furious

Authors on Pith no claims yet

Pith reviewed 2026-05-11 16:42 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords open language modelspretraining data mixturecurriculum trainingmodel architecturetraining stabilityreinforcement learningperformance efficiencymodel transparency
0
0 comments X

The pith

OLMo 2 models reach or exceed open-model performance benchmarks while using fewer FLOPs and releasing all training artifacts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OLMo 2 as the next generation of fully open dense autoregressive language models at 7B, 13B, and 32B scales. The authors apply architecture modifications and training adjustments to increase stability and per-token efficiency during pretraining. They introduce the Dolmino Mix 1124 data mixture for late-stage curriculum training in the annealing phase, which raises scores across many downstream tasks. Base models are positioned at the Pareto frontier of performance versus training compute, often matching or beating models such as Llama 3.1, Qwen 2.5, and Gemma 2 with lower FLOPs and complete transparency in weights, data, code, logs, and checkpoints. Instruct versions built with permissive data and reinforcement learning with verifiable rewards compete with both open models and certain proprietary systems.

Core claim

By combining a revised model architecture for training stability, an updated pretraining recipe, and the specialized Dolmino Mix 1124 introduced via late-stage curriculum training, the OLMo 2 base models sit at the Pareto frontier of performance to training compute. They often match or outperform open-weight models like Llama 3.1, Qwen 2.5, and Gemma 2 while using fewer FLOPs, with all training data, code, and recipes released openly. The OLMo 2-Instruct models, developed by extending Tulu 3 practices and applying final-stage reinforcement learning with verifiable rewards, remain competitive with open models of similar size and some proprietary models such as GPT-3.5 Turbo and GPT-4o Mini.

What carries the argument

Late-stage curriculum training with the Dolmino Mix 1124 specialized data mixture, which focuses targeted data in the annealing phase to raise downstream benchmark performance and per-token efficiency.

If this is right

  • Complete public release of weights, data, code, logs, and intermediate checkpoints enables full reproduction and incremental research by any group.
  • Efficiency improvements allow comparable or better model quality at reduced training compute cost.
  • The curriculum approach with specialized late-stage data can be applied to other model scales to lift task performance without proportional increases in total training.
  • Verifiable-reward reinforcement learning supports more reliable final alignment in open instruct models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread adoption of full artifact releases could raise the standard for verifiable claims in language-model research.
  • Targeted data curricula may become a standard lever for efficient capability gains once their effects are confirmed across additional domains.
  • Open training data allows external groups to study and refine data selection practices that affect model behavior.

Load-bearing premise

That the chosen downstream benchmarks accurately capture the benefits of the architecture changes and Dolmino Mix 1124, and that FLOPs comparisons to other models are fair and comprehensive.

What would settle it

Independent runs on a new, broad benchmark suite not used in the paper that show OLMo 2 models falling short of the claimed performance levels relative to the listed comparators, or an external audit that finds higher actual training FLOPs than reported for the achieved results.

read the original abstract

We present OLMo 2, the next generation of our fully open language models. OLMo 2 includes a family of dense autoregressive language models at 7B, 13B and 32B scales with fully released artifacts -- model weights, full training data, training code and recipes, training logs and thousands of intermediate checkpoints. In this work, we describe our modified model architecture and training recipe, focusing on techniques for achieving better training stability and improved per-token efficiency. Our updated pretraining data mixture introduces a new, specialized data mix called Dolmino Mix 1124, which significantly improves model capabilities across many downstream task benchmarks when introduced via late-stage curriculum training (i.e. specialized data during the annealing phase of pretraining). Finally, we incorporate best practices from T\"ulu 3 to develop OLMo 2-Instruct, focusing on permissive data and extending our final-stage reinforcement learning with verifiable rewards (RLVR). Our OLMo 2 base models sit at the Pareto frontier of performance to training compute, often matching or outperforming open-weight only models like Llama 3.1, Qwen 2.5, and Gemma 2 while using fewer FLOPs and with fully transparent training data, code, and recipe. Our fully open OLMo 2-Instruct models are competitive with open-weight only models of comparable size and even some proprietary models like GPT-3.5 Turbo and GPT 4o Mini.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper presents OLMo 2, a family of fully open dense autoregressive language models at 7B, 13B, and 32B scales, with complete release of weights, training data, code, recipes, logs, and intermediate checkpoints. It describes modifications to architecture and training recipe aimed at stability and per-token efficiency, introduces Dolmino Mix 1124 for late-stage curriculum (annealing) pretraining to improve downstream capabilities, and applies Tulu 3 practices plus RLVR to create OLMo 2-Instruct models. The central claim is that the base OLMo 2 models occupy the Pareto frontier of performance versus training compute, matching or exceeding open-weight models such as Llama 3.1, Qwen 2.5, and Gemma 2 while using fewer FLOPs, with the instruct variants competitive with some proprietary systems.

Significance. If the performance and efficiency claims are substantiated, the work makes a substantial contribution by advancing fully transparent, high-performing open models and providing extensive artifacts that enable independent verification and follow-on research. The emphasis on training stability techniques and targeted late-stage data curricula offers practical, reproducible insights for efficient pretraining at scale. Full release of thousands of checkpoints and logs is a notable strength that distinguishes this from typical closed or partially open releases.

major comments (3)
  1. [§6] §6 (Experimental Results, Pareto frontier comparisons): The central claim that OLMo 2 models achieve superior or matched performance at lower training compute requires uniform, architecture-aware FLOPs accounting (e.g., consistent application of the 6ND approximation, exact token counts, and adjustments for any differences in attention or other components). The manuscript does not provide the detailed breakdown or verification steps for these calculations against Llama 3.1, Qwen 2.5, and Gemma 2, leaving the 'fewer FLOPs' assertion difficult to confirm independently.
  2. [§5] §5 (Dolmino Mix 1124 and late-stage curriculum): The performance gains on downstream benchmarks are attributed to the new data mix introduced during annealing, yet no ablation studies isolate its contribution from concurrent architecture changes or stability hyperparameter adjustments. Without such controls, the attribution required to support the efficiency and capability claims remains insecure.
  3. [§6.1] §6.1 (Benchmark evaluation protocols): The reported improvements on standard benchmarks (MMLU and others) lack explicit details on evaluation harness versions, prompt formatting, number of runs, or any post-hoc selection criteria. This information is load-bearing for verifying that the observed gains genuinely reflect the training innovations rather than evaluation artifacts.
minor comments (3)
  1. [Table 1] Table 1 (model specifications): Clarify whether the reported parameter counts include embedding layers and whether FLOPs estimates account for the full forward pass during training.
  2. [Figure 3] Figure 3 (training curves): The y-axis scaling and legend for loss vs. tokens across model sizes could be made more precise to facilitate direct comparison of stability improvements.
  3. [Related Work] Related work section: Add citations to recent work on curriculum learning in large-scale pretraining and verifiable reward RL to better contextualize the Dolmino Mix and RLVR contributions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive review of our manuscript. We have carefully considered each major comment and prepared point-by-point responses below. Where appropriate, we have revised the manuscript to address the concerns and improve clarity and verifiability.

read point-by-point responses
  1. Referee: [§6] §6 (Experimental Results, Pareto frontier comparisons): The central claim that OLMo 2 models achieve superior or matched performance at lower training compute requires uniform, architecture-aware FLOPs accounting (e.g., consistent application of the 6ND approximation, exact token counts, and adjustments for any differences in attention or other components). The manuscript does not provide the detailed breakdown or verification steps for these calculations against Llama 3.1, Qwen 2.5, and Gemma 2, leaving the 'fewer FLOPs' assertion difficult to confirm independently.

    Authors: We agree that explicit, uniform FLOPs accounting is necessary to substantiate the Pareto frontier claims. In the revised manuscript we have added a new appendix (Appendix C) that provides a complete breakdown for OLMo 2 and all comparison models. This includes: (1) exact training token counts for each model, (2) consistent application of the 6ND approximation, (3) adjustments for architectural differences such as attention head counts and MLP ratios, and (4) step-by-step verification of the resulting FLOP totals. These calculations confirm that the OLMo 2 models achieve the reported performance levels with fewer total training FLOPs than the cited open-weight baselines. revision: yes

  2. Referee: [§5] §5 (Dolmino Mix 1124 and late-stage curriculum): The performance gains on downstream benchmarks are attributed to the new data mix introduced during annealing, yet no ablation studies isolate its contribution from concurrent architecture changes or stability hyperparameter adjustments. Without such controls, the attribution required to support the efficiency and capability claims remains insecure.

    Authors: The referee is correct that the manuscript does not contain isolated ablation studies separating the Dolmino Mix 1124 from concurrent architecture and stability changes. Our development followed an incremental process in which these elements were refined together under compute constraints that precluded exhaustive ablations. In the revision we have expanded §5 with a more granular description of the Dolmino Mix composition, its curation rationale, and the specific downstream capabilities it targets during annealing. We also include references to internal checkpoint comparisons that show performance jumps coinciding with the introduction of the new mix. While we acknowledge that dedicated ablations would provide stronger causal evidence, the available evidence supports the contribution of the late-stage curriculum. revision: partial

  3. Referee: [§6.1] §6.1 (Benchmark evaluation protocols): The reported improvements on standard benchmarks (MMLU and others) lack explicit details on evaluation harness versions, prompt formatting, number of runs, or any post-hoc selection criteria. This information is load-bearing for verifying that the observed gains genuinely reflect the training innovations rather than evaluation artifacts.

    Authors: We thank the referee for identifying this omission. In the revised §6.1 we have added a dedicated evaluation protocol subsection that specifies: the exact version of the evaluation harness, the full prompt templates and formatting used for each benchmark, the number of independent runs performed with different random seeds, and explicit confirmation that no post-hoc model selection or cherry-picking occurred. All models were evaluated under identical conditions to ensure the reported gains reflect training differences. revision: yes

Circularity Check

0 steps flagged

No significant circularity; performance claims rely on external benchmarks and released artifacts

full rationale

The paper describes OLMo 2 architecture modifications, Dolmino Mix 1124 late-stage curriculum, and training stability techniques, then reports empirical results on downstream benchmarks. The central Pareto-frontier claim compares measured performance and training FLOPs against independently trained models (Llama 3.1, Qwen 2.5, Gemma 2). No equations, fitted parameters, or derivations are presented as predictions that reduce to the inputs by construction. A reference to Tulu 3 practices exists but is not load-bearing for the base-model performance assertions. The work is self-contained against external benchmarks and released artifacts.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

Claims rest on empirical benchmark results and standard LLM training assumptions; the new data mix and hyperparameters are tuned elements without independent derivation.

free parameters (2)
  • Dolmino Mix 1124 composition and timing
    New specialized data mixture introduced during annealing phase, with details on proportions and schedule fitted to improve benchmarks.
  • Training stability hyperparameters
    Modified recipe parameters chosen to achieve better stability and efficiency.
axioms (2)
  • standard math Next-token prediction is the appropriate core objective for language modeling
    Implicit in the autoregressive dense model setup.
  • domain assumption Downstream task benchmarks reliably indicate general model improvements
    Used to validate the impact of the new data mix and recipe.

pith-pipeline@v0.9.0 · 5739 in / 1429 out tokens · 63362 ms · 2026-05-11T16:42:19.421329+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 39 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Characterizing the Expressivity of Local Attention in Transformers

    cs.CL 2026-05 unverdicted novelty 8.0

    Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressive...

  2. Demystifying the Silence of Correctness Bugs in PyTorch Compiler

    cs.SE 2026-04 conditional novelty 8.0

    First empirical study of correctness bugs in torch.compile characterizes their patterns and proposes AlignGuard, which found 23 confirmed new bugs via LLM-guided test mutation.

  3. How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

    cs.LG 2026-05 unverdicted novelty 7.0

    The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, an...

  4. From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation

    cs.LG 2026-05 conditional novelty 7.0

    Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.

  5. Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

    cs.CL 2026-05 unverdicted novelty 7.0

    RL on binary rewards boosts LLM factual recall by ~27% relative across models by redistributing probability mass to latent correct answers rather than acquiring new knowledge.

  6. Implicit Representations of Grammaticality in Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    Linear probes on LM hidden states detect grammaticality better than string probabilities, generalize to human benchmarks and other languages, and correlate weakly with likelihood.

  7. The Hidden Cost of Thinking: Energy Use and Environmental Impact of LMs Beyond Pretraining

    cs.CY 2026-05 unverdicted novelty 7.0

    Full development of 7B and 32B Olmo 3 models used 12.3 GWh datacenter energy and emitted 4,251 tCO2eq, with development overheads accounting for 82% of compute and reasoning models costing 17x more to post-train than ...

  8. Supernodes and Halos: Loss-Critical Hubs in LLM Feed-Forward Layers

    cs.LG 2026-04 unverdicted novelty 7.0

    In LLM feed-forward networks, the top 1% of channels per layer carry a median 58.7% of loss sensitivity, forming supernodes whose protection enables effective 50% sparsity pruning with much lower perplexity than baselines.

  9. EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training

    cs.CV 2026-04 unverdicted novelty 7.0

    EmbodiedMidtrain mid-trains VLMs on curated VLA-aligned data subsets to improve downstream performance on robot manipulation benchmarks.

  10. Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Supervised fine-tuning of LLMs often fails to fully internalize all training instances due to five recurring causes including missing prerequisites and data conflicts, as diagnosed via a new framework across multiple models.

  11. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    cs.LG 2025-02 unverdicted novelty 7.0

    A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

  12. Towards Resource-Efficient LLMs: End-to-End Energy Accounting of Distillation Pipelines

    cs.LG 2026-05 unverdicted novelty 6.0

    An end-to-end energy measurement framework for LLM distillation pipelines reveals hidden teacher-side costs and yields selection guidelines plus an open-source harness.

  13. Annotations Mitigate Post-Training Mode Collapse

    cs.CL 2026-05 unverdicted novelty 6.0

    Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.

  14. A Geometric Perspective on Next-Token Prediction in Large Language Models: Three Emerging Phases

    cs.LG 2026-05 unverdicted novelty 6.0

    LLMs exhibit three geometric phases in next-token prediction—seeding multiplexing, hoisting overriding, and focal convergence—where predictive subspaces rise, stabilize, and converge across layers.

  15. Continuous Latent Diffusion Language Model

    cs.CL 2026-05 unverdicted novelty 6.0

    Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing l...

  16. SparseForge: Efficient Semi-Structured LLM Sparsification via Annealing of Hessian-Guided Soft-Mask

    cs.LG 2026-05 unverdicted novelty 6.0

    SparseForge achieves 57.27% zero-shot accuracy on LLaMA-2-7B at 2:4 sparsity using only 5B retraining tokens, beating the dense baseline and nearly matching a 40B-token SOTA method.

  17. Learning Rate Transfer in Normalized Transformers

    cs.LG 2026-04 unverdicted novelty 6.0

    νGPT is a modified parameterization of normalized transformers that enables learning rate transfer across width, depth, and token horizon.

  18. When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient

    cs.LG 2026-04 unverdicted novelty 6.0

    Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward des...

  19. Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling

    cs.CL 2026-04 unverdicted novelty 6.0

    X-GRAM applies data-aware dynamic token injection with hybrid hashing and local feature extraction to achieve up to 4.4 accuracy point gains over vanilla backbones and 3.2 over retrieval baselines at 0.73B-1.15B scale...

  20. The Recurrent Transformer: Greater Effective Depth and Efficient Decoding

    cs.LG 2026-04 unverdicted novelty 6.0

    Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-mat...

  21. OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens

    q-bio.NC 2026-04 unverdicted novelty 6.0

    OmniMouse demonstrates data-driven scaling in multi-task brain models on a 150B-token neural dataset, achieving SOTA across prediction, decoding, and forecasting while model size gains saturate.

  22. Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale

    cs.CV 2026-04 unverdicted novelty 6.0

    Evidence for cross-modal representational convergence weakens substantially at scale and in realistic many-to-many settings, indicating models learn rich but distinct representations.

  23. Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

    cs.LG 2026-04 unverdicted novelty 6.0

    BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.

  24. Measuring Distribution Shift in User Prompts and Its Effects on LLM Performance

    cs.CL 2026-04 unverdicted novelty 6.0

    The LENS framework applied to 192 real-world settings shows moderate natural prompt distribution shifts cause 73% average performance loss in deployed LLMs, especially across user groups and regions.

  25. Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation

    cs.LG 2026-04 unverdicted novelty 6.0

    RISE applies CountSketch to dual lexical and semantic channels derived from output-layer gradient outer products, cutting data attribution storage by up to 112x and enabling retrospective and prospective influence ana...

  26. Causal Drawbridges: Characterizing Gradient Blocking of Syntactic Islands in Transformer LMs

    cs.CL 2026-04 unverdicted novelty 6.0

    Causal interventions reveal that coordination islands block filler-gap mechanisms in Transformers in a gradient way matching humans, yielding the hypothesis that 'and' encodes relational dependencies differently in ex...

  27. The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment

    cs.LG 2026-04 unverdicted novelty 6.0

    The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MAT...

  28. Exclusive Unlearning

    cs.CL 2026-04 unverdicted novelty 6.0

    Exclusive Unlearning makes LLMs safe by forgetting all but retained domain knowledge, protecting against jailbreaks while preserving useful responses in areas like medicine and math.

  29. AD-Copilot: A Vision-Language Assistant for Industrial Anomaly Detection via Visual In-context Comparison

    cs.CV 2026-03 conditional novelty 6.0

    AD-Copilot trains an MLLM on a new curated industrial dataset Chat-AD with a Comparison Encoder that uses cross-attention on image pairs, reaching 82.3% accuracy on MMAD and 3.35x gains on MMAD-BBox while generalizing...

  30. SAM 3D: 3Dfy Anything in Images

    cs.CV 2025-11 unverdicted novelty 6.0

    SAM 3D reconstructs 3D objects from single images with geometry, texture, and pose using human-model annotated data at scale and synthetic-to-real training, achieving 5:1 human preference wins.

  31. LLMs Get Lost In Multi-Turn Conversation

    cs.CL 2025-05 unverdicted novelty 6.0

    LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.

  32. Muon is Scalable for LLM Training

    cs.LG 2025-02 unverdicted novelty 6.0

    Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.

  33. Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models

    cs.LG 2026-04 unverdicted novelty 5.0

    Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.

  34. GiVA: Gradient-Informed Bases for Vector-Based Adaptation

    cs.CL 2026-04 unverdicted novelty 5.0

    GiVA uses gradients to initialize vector adapters so they match LoRA performance at eight times lower rank while keeping extreme parameter efficiency.

  35. VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 5.0

    VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.

  36. Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling

    cs.LG 2026-04 unverdicted novelty 5.0

    Nexusformer uses a three-stage nonlinear mapping in attention to enable stable, inheritable scaling of transformers, matching baseline perplexity with up to 41.5% less compute when growing from 240M to 440M parameters.

  37. (How) Learning Rates Regulate Catastrophic Overtraining

    cs.LG 2026-04 unverdicted novelty 5.0

    Learning rate decay during SFT increases pretrained model sharpness, which exacerbates catastrophic forgetting and causes overtraining in LLMs.

  38. Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space

    cs.CL 2026-03 unverdicted novelty 5.0

    Inclusion-of-Thoughts purifies multiple-choice questions by keeping only plausible options, stabilizing LLM preferences and improving chain-of-thought results on reasoning benchmarks.

  39. Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization

    cs.LG 2026-05 unverdicted novelty 4.0

    Outcome-level RL with binary or composite rewards improves compositional generalization over supervised fine-tuning by avoiding overfitting to frequent training patterns.