arxiv: 2604.06298 · v1 · submitted 2026-04-07 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Limits of Difficulty Scaling: Hard Samples Yield Diminishing Returns in GRPO-Tuned SLMs

Parth Goyal, Siddharth Yadav, Suraj Yadav

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:36 UTC · model grok-4.3

classification 💻 cs.LG

keywords GRPOsmall language modelsmath reasoningdifficulty scalingdiminishing returnsGSM8KMATHpreference optimization

0 comments

The pith

GRPO training on lower-difficulty math problems achieves full-dataset accuracy with only 45% of the training steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether preference optimization like GRPO can improve reasoning in small language models up to 3B parameters on math datasets GSM8K and MATH. It finds that model accuracy plateaus as problem difficulty rises, indicating a hard limit on what GRPO can achieve for the toughest problems. Instead of helping the model solve harder items, GRPO mostly shifts probability toward already reachable solutions. Training exclusively on easier problems delivers the same accuracy levels on all difficulty tiers but requires substantially fewer optimization steps. This suggests that in resource-limited settings, focusing on moderate difficulty yields better efficiency than including the hardest examples.

Core claim

As problem difficulty increases, GRPO-tuned SLM accuracy plateaus, revealing a capacity boundary where the method primarily reshapes output preferences without reliably improving performance on the hardest tier. Training GRPO only on lower-difficulty problems matches full-dataset accuracy across difficulty tiers while using only about 45% training steps, and GSM8K-trained models outperform MATH-trained ones on the numeric subset of MATH by 3-5%. Best gains depend on the base model's prior reasoning competence and the dataset's difficulty profile.

What carries the argument

Difficulty-stratified analysis combined with controlled subset training experiments on GRPO with LoRA, showing equivalence between partial and full data regimes.

If this is right

GRPO does not expand the model's fundamental capacity to solve problems beyond a certain difficulty threshold.
Subset training on lower difficulty problems is sufficient to reach peak performance achievable by the full set.
Cross-dataset transfer from simpler datasets like GSM8K can exceed direct training on more varied ones like MATH for specific subsets.
Improvement magnitude is constrained by the starting model's reasoning baseline and how hard the overall dataset is.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If true, training curricula for small models should de-emphasize the hardest examples to conserve compute.
Similar diminishing returns might appear in other domains where preference optimization is applied to capacity-limited models.
Evaluation protocols may need to focus more on moderate difficulty to avoid overestimating gains from hard samples.

Load-bearing premise

The accuracy plateaus on hard problems reflect true boundaries of the small models' reasoning capacity rather than being caused by the specific GRPO algorithm, how problems are labeled by difficulty, or the way performance is measured.

What would settle it

Retraining the same models with a different optimization technique or with refined difficulty labels that finds meaningful accuracy gains on the hardest problems would show the plateaus are not fundamental capacity limits.

Figures

Figures reproduced from arXiv: 2604.06298 by Parth Goyal, Siddharth Yadav, Suraj Yadav.

**Figure 2.** Figure 2: Difficulty-stratified MATH accuracy for base and GRPO-tuned models. GRPO trained on [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Cross-dataset generalization under GRPO. Accuracy on the Numerical subset of MATH for 0.5B/1.5B/3B models trained with GRPO on (i) MATH L1–L3, (ii) MATH L1–L5, and (iii) GSM8K. The transfer from GSM8K improves with scale: at 1.5B, GSM8K-trained model reaches 44.6% accuracy vs. 39.4%/39.0% for the (MATH L1– L3/L1–L5) models, and at 3B it achieves 54.1% vs. 50.4%/51.2%. Conversely, the 0.5B model does not b… view at source ↗

**Figure 4.** Figure 4: Answer Extraction Failure Rate vs. Difficulty (MATH Full). Percentage of generations without a parsable <answer>...</answer> span. Failures increase with difficulty; the 1.5B model spikes at Level 5 (18.1%), while 3B remains more stable (13.3%). L1 L2 L3 L4 L5 Difficulty Level 200 250 300 350 400 Avg. Completion Tokens 0.5B 1.5B 3B [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Average Completion Tokens vs. Difficulty (MATH Full). Generation length increases with difficulty across model sizes (approaching ∼400 tokens at L5 for 1.5B/3B), indicating sustained reasoning effort even when answer extraction fails. D FINE-GRAINED PERFORMANCE ANALYSIS In this section, we break down model performance by subject and difficulty level to investigate transfer effects across training distribu… view at source ↗

**Figure 6.** Figure 6: illustrates the average reward per training step. 0 500 1000 Training Steps −2 0 Mean Reward MATH L1-L5 0 200 400 Training Steps −2 0 2 Mean Reward MATH L1-L3 0 500 1000 Training Steps 0 2 Mean Reward GSM8K 0.5B 1.5B 3B [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Reward Standard Deviation: Right (GSM8K): Variance decreases significantly as training progresses (from ∼ 1.8 to ∼ 0.6), indicating the models are converging on high-confidence solutions. Left (MATH): Variance remains high and volatile for the 0.5B and 1.5B models, suggesting they are relying on stochastic exploration (”guessing”) rather than settling on stable reasoning paths. F FULL FINE-TUNING (FFT) AB… view at source ↗

**Figure 8.** Figure 8: Full Fine-Tuning (FFT) accuracy comparison for 0.5B model on the MATH dataset. Models trained on the easier subset (MATH L1–L3) match the performance of models trained on the full difficulty spectrum (MATH L1–L5), mirroring the LoRAbased capacity boundary. We replicated the Group Relative Policy Optimization (GRPO) training pipeline using full parameter updates instead of LoRA adapters. We maintained… view at source ↗

**Figure 9.** Figure 9: Accuracy comparison of the 1.5B model across three training distributions: the MATH L1–L3 dataset, the MATH L1–L5 subset, and a size-matched random sample of the full dataset MATH L1–L5 (Random 45%). To isolate the variable of difficulty, we constructed a size-matched control dataset. We took a random sample of the full MATH L1–L5 dataset, but capped the total sample count to exactly match the size of the … view at source ↗

**Figure 10.** Figure 10: Training dynamics for the size-matched Random 45% control experiment. (a) Mean Re [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

read the original abstract

Recent alignment work on Large Language Models (LLMs) suggests preference optimization can improve reasoning by shifting probability mass toward better solutions. We test this claim in a resource-constrained setting by applying GRPO with LoRA to SLMs (up to 3B) for math reasoning on GSM8K and MATH datasets with difficulty-stratified analyses. As problem difficulty increases, accuracy plateaus, revealing a capacity boundary: GRPO primarily reshapes output preferences without reliably improving hardest-tier solving. Consistent with this, training GRPO only on lower-difficulty problems matches full-dataset accuracy across difficulty tiers while using only ~45% training steps, indicating diminishing returns from harder samples in this regime. We also find a cross-dataset generalization effect: GSM8K-trained GRPO achieves higher accuracy on the numeric subset of MATH than MATH-trained GRPO, exceeding it by ~5% at 1.5B and by ~3% at 3B. We show that the best achievable gains depend strongly on the base model's prior reasoning competence and the dataset's difficulty profile.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GRPO on SLMs up to 3B shows that lower-difficulty subsets nearly match full-dataset accuracy at ~45% steps, plus a modest GSM8K-to-MATH numeric generalization edge, but the capacity-boundary claim needs tighter separation from reward sparsity effects.

read the letter

The paper's core observation is practical: when GRPO with LoRA is run on small models for GSM8K and MATH, accuracy stops improving much once you move past the easier problems, and training only on the lower-difficulty slice gets you within a few points of the full run while cutting steps roughly in half. The cross-dataset note—that GSM8K-tuned models beat MATH-tuned ones on the numeric subset of MATH by a few points—is also new in this specific GRPO+SLM setting and not just a restatement of earlier preference work.

Referee Report

2 major / 2 minor

Summary. The paper applies GRPO with LoRA to SLMs (up to 3B parameters) on GSM8K and MATH for math reasoning, using difficulty-stratified analysis. It reports accuracy plateaus at higher difficulty levels, indicating a capacity boundary where GRPO reshapes preferences without reliably solving hardest-tier problems. Training on lower-difficulty subsets alone matches full-dataset accuracy across tiers at ~45% training steps, suggesting diminishing returns from hard samples. It also finds GSM8K-trained models outperform MATH-trained ones on numeric MATH subsets (by ~5% at 1.5B, ~3% at 3B) and that gains depend on base-model prior competence and dataset difficulty profile.

Significance. If the empirical trends hold after addressing potential confounds, the work offers practical guidance for efficient SLM alignment by showing that hard-sample inclusion can yield limited gains beyond a capacity threshold. It contributes to understanding interactions between preference optimization, model scale, and data difficulty in reasoning tasks, with potential to reduce compute in resource-constrained settings.

major comments (2)

[Abstract and experimental results] The headline result that lower-difficulty training matches full-dataset accuracy at ~45% steps (Abstract) is load-bearing for the diminishing-returns claim, yet the manuscript provides no details on statistical tests, variance across runs, or controls for GRPO reward sparsity/exploration limits on hard items. This leaves open whether plateaus reflect capacity boundaries or optimization artifacts (e.g., noisier exact-match signals on MATH hard tier).
[Results on cross-dataset generalization] The cross-dataset generalization claim (GSM8K-trained GRPO exceeding MATH-trained on numeric MATH by 3-5%) requires explicit controls for dataset overlap, numeric-subset definition, and base-model prior competence to be interpretable; without these, the effect size cannot be confidently attributed to training regime rather than confounding factors.

minor comments (2)

[Methods] Define difficulty stratification criteria explicitly for both GSM8K and MATH (e.g., how tiers are binned and verified).
[Figures and tables] Include error bars, run counts, and significance markers on all accuracy plots to support trend claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work examining GRPO tuning limits for small language models on math reasoning tasks. The comments highlight areas where additional rigor and clarity will strengthen the manuscript, and we address each point below with planned revisions.

read point-by-point responses

Referee: [Abstract and experimental results] The headline result that lower-difficulty training matches full-dataset accuracy at ~45% steps (Abstract) is load-bearing for the diminishing-returns claim, yet the manuscript provides no details on statistical tests, variance across runs, or controls for GRPO reward sparsity/exploration limits on hard items. This leaves open whether plateaus reflect capacity boundaries or optimization artifacts (e.g., noisier exact-match signals on MATH hard tier).

Authors: We agree that the manuscript would benefit from greater statistical detail on this key result. In revision, we will report accuracy variance across three independent random seeds for both the full-dataset and lower-difficulty subset runs, and include paired statistical tests confirming that the ~45% step equivalence holds within confidence intervals. We will also add a dedicated paragraph analyzing GRPO reward sparsity: the exact-match reward (0/1 per problem) produces fewer positive signals on hard-tier MATH items due to longer solution chains and higher baseline error rates, which can constrain policy exploration. However, the same plateau pattern appears on GSM8K (where solutions are shorter), supporting our interpretation of a capacity boundary rather than an artifact unique to MATH. These additions will be placed in the results and discussion sections. revision: yes
Referee: [Results on cross-dataset generalization] The cross-dataset generalization claim (GSM8K-trained GRPO exceeding MATH-trained on numeric MATH by 3-5%) requires explicit controls for dataset overlap, numeric-subset definition, and base-model prior competence to be interpretable; without these, the effect size cannot be confidently attributed to training regime rather than confounding factors.

Authors: We accept that these controls should be stated explicitly. The revised manuscript will define the numeric MATH subset as problems whose ground-truth solutions consist solely of numerical answers (no symbolic or expression-based outputs), with the exact filtering code and count of such problems provided in the appendix. We will report the overlap between GSM8K training problems and this numeric MATH subset (approximately 4% shared instances after de-duplication) and confirm that removing overlapping items does not change the reported 3-5% gap. Finally, we will include pre-GRPO baseline accuracies on the numeric subset for each base model size to demonstrate that the differential gains arise after GRPO rather than from differing starting competence. These clarifications will appear in Section 4.3 and the experimental setup. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons only

full rationale

The paper reports direct experimental results from GRPO+LoRA training on GSM8K/MATH with difficulty-stratified splits. The key claim (lower-difficulty subset matches full-dataset accuracy at ~45% steps) is an observed outcome of controlled training runs, not a derivation that reduces to fitted parameters or self-referential definitions. No equations, uniqueness theorems, or ansatzes are invoked; results rest on accuracy measurements across tiers. Self-citations, if present, are not load-bearing for the central empirical finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is empirical and relies on standard machine learning assumptions about data stratification and model capacity; no new free parameters, axioms beyond domain norms, or invented entities are introduced.

axioms (1)

domain assumption Difficulty stratification of math problems is well-defined and correlates with model capacity limits.
Invoked when interpreting accuracy plateaus as capacity boundaries rather than optimization artifacts.

pith-pipeline@v0.9.0 · 5492 in / 1344 out tokens · 66676 ms · 2026-05-10T19:36:13.778348+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
training GRPO only on lower-difficulty problems matches full-dataset accuracy across difficulty tiers while using only ~45% training steps, indicating diminishing returns from harder samples
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
capacity boundary: GRPO primarily reshapes output preferences without reliably improving hardest-tier solving

Reference graph

Works this paper leans on

13 extracted references · 3 canonical work pages · 3 internal anchors

[1]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

URLhttps://arxiv.org/abs/2305.18290. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402. 03300. Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, and Jun...

work page internal anchor Pith review arXiv 2024
[2]

Llama 2: Open Foundation and Fine-Tuned Chat Models

URLhttps://arxiv.org/abs/2307.09288. Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallou´edec. TRL: Transformers Reinforce- ment Learning, 2020. URLhttps://github.com/huggingface/trl. Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borg...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[3]

Emergent Abilities of Large Language Models

URLhttps://arxiv.org/abs/2206.07682. 7 I Can’t Believe It’s Not Better Workshop @ ICLR 2026 A EXTENDEDRELATEDWORK Our research sits at the intersection of efficient language modeling, reinforcement learning for math- ematical reasoning, and data curriculum design. A.1 ALIGNMENT FORREASONING Standard alignment techniques, such as Reinforcement Learning wit...

work page internal anchor Pith review arXiv 2026
[12]

SinceK,O, andMare the centers of the three semi-circles, KO=OM=32 √

Therefore,KO=OM=32 √ 3. SinceK,O, andMare the centers of the three semi-circles, KO=OM=32 √
[13]

Therefore,KO=OM=32 √ 3. SinceK,O, andMare the centers of the three semi-circles, KO=OM=32 √ 3 Gold:2040Predicted:- 25 I Can’t Believe It’s Not Better Workshop @ ICLR 2026 Sample 13: Response of 0.5B model trained on MATH L1-L3 (Success) Subject:Geometry (Level 4) Question: Computesin1755 ◦. Answer: <reasoning> To computesin1755 ◦, we can use the periodici...

2026
[14]

Divide 629 by 7:629÷7=89remainder6
[15]

Divide 89 by 7:89÷7=12remainder5
[16]

Divide 12 by 7:12÷7=1remainder5
[17]

Reading the remainders from bottom to top, we get629 10 =1556 7

Divide 1 by 7:1÷7=0remainder1. Reading the remainders from bottom to top, we get629 10 =1556 7. Next, let’s convert629 10 to base 8:
[18]

Divide 629 by 8:629÷8=78remainder5
[19]

Divide 78 by 8:78÷8=9remainder6
[20]

Divide 9 by 8:9÷8=1remainder1
[21]

Reading the remainders from bottom to top, we get629 10 =1165 8

Divide 1 by 8:1÷8=0remainder1. Reading the remainders from bottom to top, we get629 10 =1165 8. Now, we compare the digits of1556 7 and1165 8: - The digit ’1’ appears in both numbers. - The digit ’5’ appears in both numbers. - The digit ’6’ appears in both numbers. 27 I Can’t Believe It’s Not Better Workshop @ ICLR 2026 Thus, there are 3 common digits bet...

2026