pith. machine review for the scientific record. sign in

arxiv: 2503.18892 · v3 · submitted 2025-03-24 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Junxian He, Keqing He, Qian Liu, Weihao Zeng, Wei Liu, Yuzhen Huang, Zejun Ma

Pith reviewed 2026-05-13 08:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords zero RLreinforcement learningbase modelschain-of-thoughtaha momentreasoning accuracyopen modelsquery difficulty
0
0 comments X

The pith

Zero RL training succeeds on diverse base models when format rewards and query difficulty are tuned.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether zero RL, a simple reinforcement learning setup using only rule-based rewards, can produce long chain-of-thought reasoning when started directly from base models outside the Qwen family. Experiments cover ten models ranging from Llama3-8B and Mistral-7B/24B to DeepSeek-Math-7B and all sizes of Qwen2.5. By adjusting the format reward and controlling query difficulty, the authors obtain clear gains in reasoning accuracy and longer responses in most cases. Training dynamics differ across families, yet verification behaviors labeled the aha moment appear for the first time in small non-Qwen models. The work supplies the exact design choices and releases code plus models so others can repeat the process.

Core claim

Zero RL training, which applies rule-based rewards directly to base models without prior supervised fine-tuning, yields substantial gains in reasoning accuracy and response length when format rewards are adjusted and query difficulty is controlled. Distinct training patterns appear across model families, but the emergence of verification behaviors known as the aha moment is documented for the first time in small models outside the Qwen series.

What carries the argument

Zero RL process using rule-based rewards together with format reward adjustment and query difficulty control.

If this is right

  • Reasoning accuracy rises across most of the ten tested base models.
  • Average response lengths increase, indicating longer chain-of-thought sequences.
  • Verification behaviors emerge in small models from non-Qwen families.
  • Training curves vary by model family, so monitoring must be family-specific.
  • The released code and models make the same training setup replicable on new base models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same tweaks may allow zero RL to succeed on models smaller than 7B if difficulty is scaled appropriately.
  • The method could extend to tasks outside math reasoning such as code generation if rule-based rewards are redefined.
  • Base models with weaker initial instruction following may require even stronger format rewards to avoid early collapse.
  • Zero RL now appears viable for any sufficiently open base model, reducing reliance on proprietary fine-tuned starting points.

Load-bearing premise

The observed accuracy gains, longer responses, and aha moments arise primarily from the zero RL procedure and the two design tweaks rather than from model-specific starting abilities or random training variation.

What would settle it

Applying the identical zero RL procedure to Llama3-8B or Mistral-7B without the format reward adjustment or query difficulty control and finding no increase in reasoning accuracy or response length.

read the original abstract

DeepSeek-R1 has shown that long chain-of-thought (CoT) reasoning can naturally emerge through a simple reinforcement learning (RL) framework with rule-based rewards, where the training may directly start from the base models-a paradigm referred to as zero RL training. Most recent efforts to reproduce zero RL training have primarily focused on the Qwen2.5 model series, which may not be representative as we find the base models already exhibit strong instruction-following and self-reflection abilities. In this work, we investigate zero RL training across 10 diverse base models, spanning different families and sizes including LLama3-8B, Mistral-7B/24B, DeepSeek-Math-7B, Qwen2.5-math-7B, and all Qwen2.5 models from 0.5B to 32B. Leveraging several key design strategies-such as adjusting format reward and controlling query difficulty-we achieve substantial improvements in both reasoning accuracy and response length across most settings. However, by carefully monitoring the training dynamics, we observe that different base models exhibit distinct patterns during training. For instance, the increased response length does not always correlate with the emergence of certain cognitive behaviors such as verification (i.e., the "aha moment"). Notably, we observe the "aha moment" for the first time in small models not from the Qwen family. We share the key designs that enable successful zero RL training, along with our findings and practices. To facilitate further research, we open-source the code, models, and analysis tools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript empirically investigates zero RL training starting from base models across 10 diverse checkpoints (Llama3-8B, Mistral-7B/24B, DeepSeek-Math-7B, Qwen2.5-math-7B, and Qwen2.5 models from 0.5B to 32B). It claims that two design interventions—adjusting the format reward weight and controlling query difficulty—produce substantial gains in reasoning accuracy and response length, while revealing model-family-specific training dynamics; notably, the 'aha moment' (emergence of verification behaviors) is reported for the first time in small non-Qwen models. The work releases code, models, and analysis tools.

Significance. If the reported improvements and dynamics are robustly attributable to the stated interventions, the study meaningfully broadens zero RL beyond the Qwen family, supplies practical guidelines for other open base models, and supplies reproducible artifacts that lower the barrier for follow-up work on RL-induced reasoning.

major comments (2)
  1. [Experiments] Experiments section: the central attribution of accuracy and length gains to format-reward adjustment and query-difficulty control is not isolated from base-model priors or implicit data filtering. No ablation is described that holds data distribution, initialization, and random seed fixed while toggling only those two knobs; without such controls the observed differences could reflect pre-existing instruction-following or self-reflection capabilities rather than the zero-RL process itself.
  2. [Training Dynamics] Training-dynamics analysis: the claim that increased response length does not always correlate with verification behaviors (the 'aha moment') is presented as a key finding, yet the operational definition and quantitative threshold used to detect the 'aha moment' across model families are not specified, making cross-model comparisons difficult to reproduce or falsify.
minor comments (2)
  1. [Abstract] Abstract and results tables: key quantitative outcomes (accuracy deltas, length statistics, error bars) are summarized qualitatively; adding a compact table of main metrics with confidence intervals would strengthen the empirical claims.
  2. [Method] Notation: the precise functional form of the format reward (its weight, scaling, and interaction with the rule-based reward) is referenced but not written explicitly; an equation or pseudocode block would remove ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The two major comments highlight important areas for improving clarity and rigor in our experimental design and analysis. We address each point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the central attribution of accuracy and length gains to format-reward adjustment and query-difficulty control is not isolated from base-model priors or implicit data filtering. No ablation is described that holds data distribution, initialization, and random seed fixed while toggling only those two knobs; without such controls the observed differences could reflect pre-existing instruction-following or self-reflection capabilities rather than the zero-RL process itself.

    Authors: We agree that a more tightly controlled ablation would strengthen causal attribution. Our current experiments apply the interventions consistently across 10 diverse base models while holding the core RL setup (optimizer, batch size, reward structure except for the format weight, and training duration) fixed, and we observe gains primarily when the interventions are active. However, we did not include an explicit ablation that freezes data distribution, initialization, and seed while toggling only format-reward weight and query difficulty. In the revision we will add this controlled ablation on at least two representative models (one Qwen and one non-Qwen) to isolate the contribution of each knob. revision: yes

  2. Referee: [Training Dynamics] Training-dynamics analysis: the claim that increased response length does not always correlate with verification behaviors (the 'aha moment') is presented as a key finding, yet the operational definition and quantitative threshold used to detect the 'aha moment' across model families are not specified, making cross-model comparisons difficult to reproduce or falsify.

    Authors: We acknowledge that the current manuscript lacks an explicit operational definition and detection threshold for the 'aha moment'. In the revision we will add a dedicated subsection that defines verification behavior quantitatively (e.g., the first occurrence of explicit self-checking phrases or answer-verification steps after a minimum response length, with the precise regex and length threshold used) and report the detection criteria uniformly across all model families. This will make the cross-model comparison reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical observations

full rationale

The paper is a purely empirical study reporting training runs of zero RL on 10 base models, with observations on accuracy, response length, and emergence of behaviors like verification. No equations, derivations, or first-principles predictions are present that could reduce to fitted inputs or self-citations by construction. All results are direct measurements from experiments, with design choices (format reward, query difficulty) described as interventions whose effects are measured rather than defined circularly. The work is self-contained against external benchmarks of training dynamics.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on empirical tuning rather than new theory. The work assumes rule-based rewards can elicit reasoning and that query difficulty can be controlled to stabilize training.

free parameters (1)
  • format reward weight
    Adjusted during experiments to enable stable training and longer responses across models.
axioms (1)
  • domain assumption Rule-based rewards on format and correctness suffice to drive reasoning emergence in base models
    Inherited from the DeepSeek-R1 observation referenced in the abstract.

pith-pipeline@v0.9.0 · 5607 in / 1321 out tokens · 58754 ms · 2026-05-13T08:15:54.584166+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation/ArrowOfTime arrow_from_z echoes

    Notably, we observe the 'aha moment' for the first time in small models not from the Qwen family... by carefully monitoring the training dynamics, we observe that different base models exhibit distinct patterns during training. For instance, the increased response length does not always correlate with the emergence of certain cognitive behaviors such as verification

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

    cs.SE 2026-05 unverdicted novelty 7.0

    StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.

  2. BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preservin...

  3. The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits

    cs.LG 2026-05 unverdicted novelty 7.0

    The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interv...

  4. DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards

    cs.LG 2026-05 unverdicted novelty 7.0

    DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.

  5. Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR

    cs.LG 2026-05 unverdicted novelty 7.0

    HORA adaptively allocates rollouts using hit utility to improve Pass@K over compute-matched GRPO on math reasoning benchmarks while preserving Pass@1.

  6. Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

    cs.CL 2026-05 unverdicted novelty 7.0

    RL improves LLM reasoning by sparse policy selection at high-entropy tokens rather than new capability learning, and a minimal RL-free method matches its gains at three orders of magnitude lower cost.

  7. BoostLoRA: Growing Effective Rank by Boosting Adapters

    cs.LG 2026-04 unverdicted novelty 7.0

    BoostLoRA grows effective adapter rank linearly via iterative boosting on hard examples with orthogonal low-rank updates, outperforming both single-shot ultra-low-rank adapters and full fine-tuning on math and code ta...

  8. Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration

    cs.LG 2026-04 unverdicted novelty 7.0

    NExt accelerates RLVR training for LLMs by nonlinearly extrapolating low-rank parameter trajectories extracted from LoRA runs.

  9. Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.

  10. SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

    cs.AI 2026-04 unverdicted novelty 7.0

    SUPERNOVA adapts instruction-tuning data for RLVR and achieves up to 52.8% relative gains on general reasoning benchmarks like BBEH through targeted task selection and mixing.

  11. H\"older Policy Optimisation

    cs.LG 2026-05 unverdicted novelty 6.0

    HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.

  12. HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control

    cs.LG 2026-05 unverdicted novelty 6.0

    HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchm...

  13. Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

    cs.AI 2026-05 unverdicted novelty 6.0

    CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and sp...

  14. Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and ...

  15. Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

    cs.CL 2026-05 unverdicted novelty 6.0

    RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magni...

  16. WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning

    cs.CL 2026-04 unverdicted novelty 6.0

    WebGen-R1 uses end-to-end RL with scaffold-driven generation and cascaded rewards for structure, function, and aesthetics to transform a 7B model into a generator of deployable multi-page websites that rivals much lar...

  17. Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data

    cs.LG 2026-04 unverdicted novelty 6.0

    A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.

  18. HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment

    cs.LG 2026-04 unverdicted novelty 6.0

    HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.

  19. From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

    cs.LG 2026-04 unverdicted novelty 6.0

    PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baseli...

  20. When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning

    cs.CL 2026-04 unverdicted novelty 6.0

    ATTC reduces 'Tool Ignored' errors in tool-integrated reasoning by adaptively trusting tool results according to generated code confidence, yielding 4.1-7.5% gains across models and datasets.

  21. MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    cs.CL 2025-06 unverdicted novelty 6.0

    MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...

  22. Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity

    cs.AI 2026-04 unverdicted novelty 5.0

    An LLM-as-a-judge evaluation framework for math reasoning outperforms symbolic methods by accurately assessing diverse answer representations and formats.

  23. Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models

    cs.AI 2026-04 unverdicted novelty 5.0

    Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...

  24. Adaptive Negative Reinforcement for LLM Reasoning:Dynamically Balancing Correction and Diversity in RLVR

    cs.LG 2026-05 unverdicted novelty 4.0

    Adaptive scheduling of penalties over training time plus confidence-based weighting of mistakes improves LLM performance on math reasoning benchmarks compared to fixed-penalty negative reinforcement.