arxiv: 2604.18381 · v1 · submitted 2026-04-20 · 💻 cs.AI · cs.LG

Recognition: unknown

Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

Justin Bauer , Thomas Walshe , Derek Pham , Harit Vishwakarma , Armin Parchami , Frederic Sala , Paroma Varma

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:51 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords RLVRlow data regimesprocedural datasetssample efficiencysmall language modelsreasoning generalizationtask complexityfine-tuning

0 comments

The pith

Small language models gain up to 5x sample efficiency in low-data RLVR when trained on mixed-complexity procedural data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies Reinforcement Learning with Verifiable Rewards on small language models under tight data and compute constraints. It creates three procedural datasets for number counting, graph reasoning, and spatial reasoning that let researchers vary size, diversity, and complexity at will. Experiments show that mixed-complexity training produces the largest gains, reaching five times the sample efficiency of easy-only training, while models trained only on simple tasks still succeed on harder versions. These patterns matter because they point to concrete ways to improve reasoning capabilities without needing massive annotated datasets.

Core claim

Using three new procedural datasets for number counting, graph reasoning, and spatial reasoning, the work shows that small language models trained via RLVR on mixed-complexity data achieve superior performance in low-data regimes compared to uniform complexity training. Specifically, low-complexity training generalizes to high-complexity evaluation, and mixed datasets provide up to 5x the sample efficiency of easy-only datasets. The use of procedurally generated data enables detailed control and analysis of how dataset size, diversity, and complexity influence fine-tuning outcomes across these tasks.

What carries the argument

Procedural data generators that produce reasoning tasks at adjustable complexity levels, allowing controlled measurement of how data composition affects RLVR outcomes in scarce-data conditions.

If this is right

Models trained only on lower-complexity tasks can solve higher-complexity versions of the same problem types.
Mixing task complexities during training maximizes performance per example more effectively than using only easy or only hard tasks.
Procedural generation supplies a scalable route to diverse training data with known properties, reducing dependence on large human-annotated sets.
These efficiency patterns appear consistently across counting, graph, and spatial reasoning, suggesting broader applicability to other controllable reasoning domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mixed-complexity strategy could be tested in supervised fine-tuning or other reward methods to stretch limited data budgets.
Data curation pipelines for reasoning models might shift toward generating balanced difficulty distributions rather than maximizing average hardness.
Validating the results on existing non-procedural benchmarks would clarify how much the controllable generation itself contributes to the observed gains.
Extending the approach to larger models or additional diversity axes such as problem format could uncover further efficiency improvements.

Load-bearing premise

The three procedural datasets are representative enough of broader reasoning capabilities that the efficiency gains and easy-to-hard generalization will appear in other tasks or real data.

What would settle it

Run the same RLVR low-data protocol on a new domain such as arithmetic word problems and find neither the 5x efficiency advantage for mixed complexity nor generalization from low- to high-complexity test sets.

Figures

Figures reproduced from arXiv: 2604.18381 by Armin Parchami, Derek Pham, Frederic Sala, Harit Vishwakarma, Justin Bauer, Paroma Varma, Thomas Walshe.

**Figure 1.** Figure 1: Overall model-based evaluation results across all generated samples for Counting Problems, Graph Reasoning, and Spatial Reasoning. correctly: • Easy: 67–100% of models answered correctly. • Medium: 34–66% of models answered correctly. • Hard: 0–33% of models answered correctly. We then curated multiple subsets for downstream use by sampling based on difficulty. The following dataset configurations were c… view at source ↗

**Figure 2.** Figure 2: Training reward curves across all three datasets (left: easy-only, right: mixed-difficulty). Colors: blue = 100, orange = 200, green = 500 examples. Light shaded lines show training rewards; dark solid lines with diamond markers show validation rewards [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Test accuracy by question difficulty across all three datasets (left: easy-trained, right: mixed-trained). Colors: blue = 100, orange = 200, green = 500 examples. Bars show accuracy on the held-out test set [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 5.** Figure 5: Reward component breakdown across training configurations. For counting, Easy-100 correctness collapses after step 150, consistent with gradient norm instability ( [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 4.** Figure 4: Gradient norm over training steps for three counting configurations. Easy-100 exhibits spikes exceeding 850× baseline between steps 150–300, coinciding with the reward collapse in Figure 2a. Easy-200 and Mixed-100 remain stable throughout, supporting a minimum diversity threshold for stable optimization. Counting Problems 0 50 100 % of Completions Easy-100 Easy-200 Easy-500 0 100 200 300 Training Step 0 50… view at source ↗

**Figure 6.** Figure 6: Test accuracy on different types of queries in spatial reasoning. Here, AO, AL stand for absolute orientation and absolute location, and similarly, RO, RL stand for relative queries. In both easy and mixed settings, we see improvements across all types of queries. The improvements are more pronounced on location and relative orientation queries. Moreover, in the mixed setting, the performance is much bett… view at source ↗

read the original abstract

Fine-tuning Large Language Models (LLMs) typically relies on large quantities of high-quality annotated data, or questions with well-defined ground truth answers in the case of Reinforcement Learning with Verifiable Rewards (RLVR). While previous work has explored the benefits to model reasoning capabilities by scaling both data and compute used for RLVR, these results lack applicability in many real-world settings where annotated data and accessible compute may be scarce. In this work, we present a comprehensive empirical study of open-source Small Language Model (SLM) performance after RLVR in low data regimes. Across three novel datasets covering number counting problems, graph reasoning, and spatial reasoning, we characterize how model performance scales with dataset size, diversity, and complexity. We demonstrate that (1) procedural datasets allow for fine-grained evaluation and training dataset development with controllable properties (size, diversity, and complexity), (2) under RLVR, models trained on lower complexity tasks can generalize to higher complexity tasks, and (3) training on mixed complexity datasets is associated with the greatest benefits in low data regimes, providing up to 5x sample efficiency versus training on easy tasks. These findings inspire future work on the development of data scaling laws for RLVR and the use of procedural data generators to further understand effective data development for efficient LLM fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows mixed-complexity training on three new procedural datasets gives up to 5x sample efficiency for RLVR in low-data regimes and some easy-to-hard generalization, but the patterns stay tied to those specific synthetic generators.

read the letter

The main thing to know is that training on a mix of easy and hard examples from their procedural datasets beats training only on easy ones for RLVR on small models when data is limited, with reported gains around 5x sample efficiency, and that models trained on simpler tasks can handle harder ones later. They also built three controllable generators for number counting, graph reasoning, and spatial reasoning to make this possible. This setup lets them vary size, diversity, and complexity in a clean way, which is a practical step for studying data scaling without relying on scarce annotated data. The experiments are focused and the procedural approach avoids some of the mess of real-world datasets, so the observed patterns on mixed training and transfer are at least internally consistent. The soft spot is that everything rests on these three synthetic families. Complexity is defined by things like step counts or node statistics inside the generators, so the efficiency ordering and generalization might not appear the same way if complexity is measured differently or on established benchmarks like GSM8K or MATH. The abstract gives no details on error bars, statistical tests, or how baselines were chosen, which makes it harder to judge how robust the 5x claim is. This work is aimed at people doing low-resource RLVR fine-tuning who can generate their own procedural data. It has enough controlled empirical content to deserve a serious referee, even if the authors would need to test broader distributions or add more statistical backing before it lands as general guidance.

Referee Report

3 major / 2 minor

Summary. The manuscript reports a controlled empirical study of RLVR fine-tuning for small language models in low-data regimes. Using three novel procedural datasets (number counting, graph reasoning, spatial reasoning) with controllable size, diversity, and complexity, the authors characterize scaling behavior and report that (1) lower-complexity training generalizes to higher-complexity tasks and (2) mixed-complexity training yields the largest gains, including up to 5x sample efficiency relative to easy-only training.

Significance. If the empirical patterns hold beyond the specific generators, the work would supply practical guidance for data curation under compute and annotation constraints and support the development of data scaling laws for RLVR. The procedural generators are a clear methodological strength, enabling fine-grained ablation of size/diversity/complexity that is difficult with static benchmarks.

major comments (3)

[Abstract and §4 (Results)] Abstract and §4 (Results): the central claims—mixed-complexity training providing up to 5x sample efficiency and low-to-high complexity generalization—are demonstrated exclusively on the three author-introduced procedural distributions. No transfer experiments or scaling curves are shown on established reasoning corpora (GSM8K, MATH, or code-generation suites), so it remains possible that the observed efficiency ordering is an artifact of how complexity is parameterized within each generator family.
[§3 (Experimental Setup)] §3 (Experimental Setup): the manuscript does not report the number of random seeds, statistical significance tests, or confidence intervals for the 5x efficiency figure or the generalization results. Without these, it is impossible to determine whether the reported advantages are robust or sensitive to post-hoc dataset splits or hyperparameter choices.
[§2 (Datasets)] §2 (Datasets): while the generators permit controllable complexity, the paper provides no external validation that the chosen complexity metrics (step count, graph statistics, spatial relations) align with LLM reasoning difficulty on out-of-distribution tasks. This weakens the claim that the observed transfer and efficiency patterns constitute a general principle for RLVR data design.

minor comments (2)

[Abstract] Abstract: 'low data regimes' should be quantified (e.g., exact sample counts or token budgets) to allow direct comparison with prior RLVR scaling studies.
[Figures] Figures: scaling plots should include error bars and explicit legends distinguishing mixed-, easy-, and hard-only conditions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive feedback on our manuscript. We provide point-by-point responses to the major comments below, indicating revisions where appropriate.

read point-by-point responses

Referee: [Abstract and §4 (Results)] Abstract and §4 (Results): the central claims—mixed-complexity training providing up to 5x sample efficiency and low-to-high complexity generalization—are demonstrated exclusively on the three author-introduced procedural distributions. No transfer experiments or scaling curves are shown on established reasoning corpora (GSM8K, MATH, or code-generation suites), so it remains possible that the observed efficiency ordering is an artifact of how complexity is parameterized within each generator family.

Authors: We appreciate the referee's concern regarding the generalizability of our findings. The procedural datasets were specifically designed to allow fine-grained control over complexity, diversity, and size, enabling us to rigorously test hypotheses about RLVR data curation that would be challenging with fixed benchmarks like GSM8K or MATH. Our results demonstrate consistent patterns across three distinct domains (counting, graphs, spatial), supporting the robustness of low-to-high complexity generalization and mixed-complexity benefits. While transfer to standard corpora is desirable, it is beyond the scope of this controlled study focused on low-data regimes. In the revised version, we will add a Limitations section explicitly discussing this scope and suggesting future work on transfer experiments. revision: partial
Referee: [§3 (Experimental Setup)] §3 (Experimental Setup): the manuscript does not report the number of random seeds, statistical significance tests, or confidence intervals for the 5x efficiency figure or the generalization results. Without these, it is impossible to determine whether the reported advantages are robust or sensitive to post-hoc dataset splits or hyperparameter choices.

Authors: We agree that reporting statistical details is essential for assessing robustness. We will update §3 (Experimental Setup) and the results in §4 to specify the number of random seeds (we used 3 seeds for all experiments), include error bars or confidence intervals in figures, and perform statistical significance tests (paired t-tests) for the key efficiency comparisons. This will be incorporated in the revised manuscript. revision: yes
Referee: [§2 (Datasets)] §2 (Datasets): while the generators permit controllable complexity, the paper provides no external validation that the chosen complexity metrics (step count, graph statistics, spatial relations) align with LLM reasoning difficulty on out-of-distribution tasks. This weakens the claim that the observed transfer and efficiency patterns constitute a general principle for RLVR data design.

Authors: The complexity metrics were chosen based on established notions of reasoning difficulty in the literature (e.g., number of operations or structural complexity). To address this, we will revise §2 to provide more justification for these metrics, including references to prior work on reasoning complexity, and include an additional analysis showing correlation between our complexity levels and model performance on the procedural tasks themselves. We believe this supports the patterns as indicative of general principles, though we acknowledge broader validation would be beneficial. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical measurements on synthetic datasets

full rationale

The paper reports direct experimental results from RLVR fine-tuning of SLMs on three author-generated procedural datasets (number counting, graph reasoning, spatial reasoning). All reported findings—scaling with data size/diversity/complexity, generalization from low- to high-complexity tasks, and up to 5x sample efficiency for mixed-complexity training—are measured performance numbers, not derived predictions or first-principles results. No equations, ansatzes, uniqueness theorems, or self-citations are invoked to define or force the central claims; the work contains no load-bearing derivations that reduce to their own inputs by construction. The study is self-contained against its own benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work is empirical and relies on standard RLVR assumptions plus the representativeness of the three new procedural datasets; no free parameters, invented entities, or non-standard axioms are introduced.

axioms (2)

domain assumption RLVR with verifiable rewards produces measurable improvements in reasoning on the chosen tasks
Implicit in the experimental design and performance reporting.
domain assumption Procedural generation yields datasets whose size, diversity, and complexity can be independently controlled without introducing unintended biases
Central to the claim that these datasets enable fine-grained evaluation.

pith-pipeline@v0.9.0 · 5554 in / 1354 out tokens · 29138 ms · 2026-05-10T04:51:09.461675+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 8 internal anchors

[1]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Avail- able at: https://assets.anthropic. com/m/12f214efcc2f457a/original/ Claude-Sonnet-4-5-System-Card.pdf. Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agen...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

arXiv preprint arXiv:2503.16219

Dang, Q.-A. and Ngo, C. Reinforcement learning for reason- ing in small llms: What works and what doesn’t.arXiv preprint arXiv:2503.16219,

work page arXiv
[3]

Automat- ing benchmark design.arXiv preprint arXiv:2510.25039,

Dsouza, A., Vishwakarma, H., Qi, Z., Bauer, J., Pham, D., Walshe, T., Parchami, A., Sala, F., and Varma, P. Automat- ing benchmark design.arXiv preprint arXiv:2510.25039,

work page arXiv
[4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

arXiv preprint arXiv:2504.11456 , year=

He, Z., Liang, T., Xu, J., Liu, Q., Chen, X., Wang, Y ., Song, L., Yu, D., Liang, Z., Wang, W., et al. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.arXiv preprint arXiv:2504.11456,

work page arXiv
[6]

Training Compute-Optimal Large Language Models

Measuring the Effectiveness of RLVR in Low Data and Compute Regimes Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., an...

work page internal anchor Pith review arXiv
[7]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[8]

Dhillon, David Brandfonbrener, and Rishabh Agarwal

Khatri, D., Madaan, L., Tiwari, R., Bansal, R., Duvvuri, S. S., Zaheer, M., Dhillon, I. S., Brandfonbrener, D., and Agarwal, R. The art of scaling reinforcement learn- ing compute for llms.arXiv preprint arXiv:2510.13786,

work page arXiv
[9]

Limr: Less is more for rl scaling.arXiv preprint arXiv:2502.11886, 2025

Li, X., Zou, H., and Liu, P. Limr: Less is more for rl scaling. arXiv preprint arXiv:2502.11886,

work page arXiv
[10]

Trust, but verify: A self-verification approach to reinforcement learn- ing with verifiable rewards, 2025

Liu, X., Liang, T., He, Z., Xu, J., Wang, W., He, P., Tu, Z., Mi, H., and Yu, D. Trust, but verify: A self-verification ap- proach to reinforcement learning with verifiable rewards. arXiv preprint arXiv:2505.13445,

work page arXiv
[11]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

doi: 10.64434/tml. 20250929. https://thinkingmachines.ai/blog/lora/. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y . K., Wu, Y ., and Guo, D. Deepseek- math: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.64434/tml
[12]

Exploring data scaling trends and effects in reinforcement learning from human feedback

Shen, W., Liu, G., Wu, Z., Zhu, R., Yang, Q., Xin, C., Yue, Y ., and Yan, L. Exploring data scaling trends and effects in reinforcement learning from human feedback.arXiv preprint arXiv:2503.22230,

work page arXiv
[13]

Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning

Tan, Z., Geng, H., Yu, X., Zhang, M., Wan, G., Zhou, Y ., He, Q., Xue, X., Zhou, H., Fan, Y ., Li, Z., Zhang, Z., Zhang, G., Zhang, C., Yin, Z., Torr, P., and Bai, L. Scaling behaviors of llm reinforcement learning post-training: An empirical study in mathematical reasoning.arXiv preprint arXiv:2509.25300,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Tina: Tiny reasoning models via LoRA.arXiv preprint arXiv:2504.15777, 2025b

Wang, S., Asilis, J., Akg¨ul, ¨O. F., Bilgin, E. B., Liu, O., and Neiswanger, W. Tina: Tiny reasoning models via lora. arXiv preprint arXiv:2504.15777, 2025a. Wang, Y ., Yang, Q., Zeng, Z., Ren, L., Liu, L., Peng, B., Cheng, H., He, X., Wang, K., Gao, J., et al. Reinforce- ment learning for reasoning in large language models with one training example.arXi...

work page arXiv
[15]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q...

work page internal anchor Pith review Pith/arXiv arXiv
[16]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Zeng, A., Lv, X., Zheng, Q., Hou, Z., Chen, B., Xie, C., Wang, C., Yin, D., Zeng, H., Zhang, J., et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv:2508.06471,

work page internal anchor Pith review arXiv
[17]

When scaling meets LLM finetuning: The effect of data, model and finetuning method.arXiv preprint arXiv:2402.17193, 2024

Zhang, B., Liu, Z., Cherry, C., and Firat, O. When scaling meets LLM finetuning: The effect of data, model and finetuning method.arXiv preprint arXiv:2402.17193, 2024

work page arXiv 2024