Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning

· 2025 · cs.LG · arXiv 2509.25300

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

open full Pith review browse 7 citing papers arXiv PDF

abstract

While scaling laws for large language models (LLMs) during pre-training have been extensively studied, their behavior under reinforcement learning (RL) post-training remains largely unexplored. This paper presents a systematic empirical investigation of scaling behaviors in RL-based post-training, with a particular focus on mathematical reasoning. Based on a set of experiments across the full Qwen2.5 dense model series (0.5B to 72B), we characterize how model scale, data volume, and computational budget interact to shape performance. Our analysis leads to four key findings: 1. Larger models consistently exhibit superior learning efficiency on both compute and data metrics. 2. The relationship between test loss, compute, and data can be modeled by a predictive power-law which is robust across both base and instruction-tuned models. 3. Although larger models exhibit higher learning efficiency, the analytical learning efficiency term k(N) in the power-law reveals a latent saturation trend in learning efficiency as model size continues to increase. 4. In data-constrained regimes, repeated reuse of high-quality data proves highly effective, as final performance is primarily governed by the total number of optimization steps rather than the uniqueness of samples. Collectively, these results provide a principled foundation and practical guidelines for efficiently scaling the reasoning capabilities of LLMs through RL post-training.

citation-role summary

background 2

citation-polarity summary

background 1 unclear 1

representative citing papers

Don't Let Gains FADE: Breaking Down Policy Gradient Weights in RL

cs.LG · 2026-07-01 · unverdicted · novelty 6.0

FADE is a self-adapting advantage for policy-gradient RL that reads training dynamics to balance positive/negative gradient mass and difficulty focus, yielding faster peak performance and better accuracy-diversity trade-offs than static baselines on LLM reasoning benchmarks.

Beyond One-Size-Fits-All: Diagnosis-Driven Online Reinforcement Learning with Offline Priors

cs.LG · 2026-06-24 · unverdicted · novelty 6.0

Argues for shifting to diagnosis-driven tension management of offline priors in online RL, supported by a framework on prior roles, experiments showing help-or-hurt reversals, and cross-domain evidence.

Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

Forgetting in LLM continual post-training is a geometry conflict between task-induced covariance structures and the evolving model state, controlled by gating Wasserstein barycenter merging on measured conflict.

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

cs.AI · 2026-05-07 · unverdicted · novelty 6.0 · 3 refs

RL training compute for logical reasoning follows a power law with horizon depth whose exponent rises with logical expressiveness, yielding better downstream transfer when models train on richer logics.

Agentic Frameworks for Reasoning Tasks: An Empirical Study

cs.AI · 2026-04-17 · unverdicted · novelty 6.0

An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.

Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

cs.AI · 2026-04-20 · unverdicted · novelty 5.0

Mixed-complexity procedural datasets provide up to 5x sample efficiency for RLVR on small models in low-data regimes, with low-to-high complexity generalization observed across counting, graph, and spatial tasks.

Continued AI Scaling Requires Repeated Efficiency Doublings

cs.LG · 2026-03-30 · unverdicted · novelty 3.0

Continued AI scaling remains feasible only if efficiency doublings recur repeatedly to keep logical compute affordable.

citing papers explorer

Showing 7 of 7 citing papers.

Don't Let Gains FADE: Breaking Down Policy Gradient Weights in RL cs.LG · 2026-07-01 · unverdicted · none · ref 63 · internal anchor
FADE is a self-adapting advantage for policy-gradient RL that reads training dynamics to balance positive/negative gradient mass and difficulty focus, yielding faster peak performance and better accuracy-diversity trade-offs than static baselines on LLM reasoning benchmarks.
Beyond One-Size-Fits-All: Diagnosis-Driven Online Reinforcement Learning with Offline Priors cs.LG · 2026-06-24 · unverdicted · none · ref 65 · internal anchor
Argues for shifting to diagnosis-driven tension management of offline priors in online RL, supported by a framework on prior roles, experiments showing help-or-hurt reversals, and cross-domain evidence.
Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training cs.LG · 2026-05-10 · unverdicted · none · ref 7 · internal anchor
Forgetting in LLM continual post-training is a geometry conflict between task-induced covariance structures and the evolving model state, controlled by gating Wasserstein barycenter merging on measured conflict.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key cs.AI · 2026-05-07 · unverdicted · none · ref 94 · 3 links · internal anchor
RL training compute for logical reasoning follows a power law with horizon depth whose exponent rises with logical expressiveness, yielding better downstream transfer when models train on richer logics.
Agentic Frameworks for Reasoning Tasks: An Empirical Study cs.AI · 2026-04-17 · unverdicted · none · ref 64 · internal anchor
An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.
Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes cs.AI · 2026-04-20 · unverdicted · none · ref 13 · internal anchor
Mixed-complexity procedural datasets provide up to 5x sample efficiency for RLVR on small models in low-data regimes, with low-to-high complexity generalization observed across counting, graph, and spatial tasks.
Continued AI Scaling Requires Repeated Efficiency Doublings cs.LG · 2026-03-30 · unverdicted · none · ref 15 · internal anchor
Continued AI scaling remains feasible only if efficiency doublings recur repeatedly to keep logical compute affordable.

Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer