arxiv: 2604.08477 · v1 · submitted 2026-04-09 · 💻 cs.AI · cs.CL· cs.LG

Recognition: unknown

SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

Ashima Suvarna , Kendrick Phan , Mehrab Beikzadeh , Hritik Bansal , Saadia Gabriel

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:12 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords reinforcement learninglarge language modelsdata curationgeneral reasoningverifiable rewardsinstruction tuningtask selectionbenchmark evaluation

0 comments

The pith

Curating expert-annotated instruction data for verifiable rewards extends reinforcement learning to general reasoning tasks in language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SUPERNOVA, a framework that selects and mixes tasks from existing instruction-tuning datasets to create training data for reinforcement learning with verifiable rewards. This targets the gap where RLVR has succeeded in math and code but struggles with broader skills such as causal inference and temporal reasoning. Over 100 controlled experiments show that source task selection based on individual target performance beats average-based mixing, and that synthetic quality improvements further help. Models trained this way outperform strong baselines on BBEH, Zebralogic, and MMLU-Pro, with relative gains reaching 52.8 percent on BBEH across sizes. The results indicate that principled curation of human-annotated data can systematically adapt natural instructions into effective RL signals for general reasoning.

Core claim

SUPERNOVA shows that instruction-tuning datasets containing expert-annotated ground-truth encode reasoning patterns that can be adapted into high-quality verifiable rewards for RLVR. Source task selection proves non-trivial and matters more when done per target task than by overall average performance. Task mixing strategies and synthetic interventions for data quality also affect outcomes. Models trained on the resulting data surpass baselines including Qwen3.5 on challenging general reasoning benchmarks, achieving up to 52.8 percent relative improvement on BBEH.

What carries the argument

SUPERNOVA, a data curation framework that examines source task selection, task mixing strategies, and synthetic interventions to convert natural instructions into RLVR training data.

If this is right

Source task selection tailored to individual target tasks yields better downstream results than selection based on average performance across tasks.
Synthetic interventions improve the quality of adapted data for use in verifiable-reward training.
Principled curation of human-annotated resources enables RLVR to address general reasoning skills beyond formal domains.
Models trained under this approach deliver measurable gains on benchmarks requiring causal inference and temporal understanding.
The 100-plus controlled experiments isolate the impact of data design choices from other training variables.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar curation principles could be tested on other human-annotated resources to extend RLVR into domains with sparse verifiable signals.
The findings suggest that data design choices may interact with model scale in ways that warrant separate scaling experiments.
Automated versions of the per-target task selection process could reduce the manual effort needed to apply the framework.
The work implies that future RLVR pipelines might benefit from hybrid human-synthetic data pipelines rather than purely synthetic generation.
Extensions might explore whether the same curation logic applies when the base model has already undergone extensive instruction tuning.

Load-bearing premise

That expert-annotated instruction datasets contain reasoning patterns that can be reliably converted into verifiable reward signals without losing their effectiveness for general tasks.

What would settle it

Training models on SUPERNOVA-curated data and measuring no outperformance or even underperformance relative to baselines on BBEH, Zebralogic, and MMLU-Pro would falsify the central effectiveness claim.

Figures

Figures reproduced from arXiv: 2604.08477 by Ashima Suvarna, Hritik Bansal, Kendrick Phan, Mehrab Beikzadeh, Saadia Gabriel.

**Figure 1.** Figure 1: SuperNova elicits strong general reasoning. We show that training with our curated SUPERNOVA data leads to consistent pass@k improvements across varying values of k on a challenging benchmark, BBEH-test. We highlight that SUPERNOVA is effective on various models sizes from Qwen3 family. 1 arXiv:2604.08477v1 [cs.AI] 9 Apr 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: SUPERNOVA Framework: In this work, we curate reasoning data from natural instruction to enhance general reasoning capabilities in LLMs. First, we study the impact of task selection on downstream reasoning performance. Then, we explore strategies to mix diverse tasks in source data. Finally, we examine whether synthetic data interventions can enhance data quality and improve downstream reasoning. reasoning … view at source ↗

**Figure 3.** Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison with Other Reasoning Datasets. We report the relative gains achieved by training Qwen3-0.6B on SUPERNOVA and existing reasoning datasets. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: (Left) We train Qwen3.5-2B and LLaMA3.2-3B-Instruct with SUPERNOVA and show relatives gains over the baseline model on BBEH-mini. (Right) We show the performance comparison between the baseline model and SUPERNOVA-0.6B by scaling values of k till 128 on BBEH-mini. 4B outperforms Qwen3-4B by 21pp. This suggests that training on SUPERNOVA enhances logical reasoning capabilities, particularly for constraint s… view at source ↗

**Figure 6.** Figure 6: Pass@k accuracy of task-specific models across various values of k. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: (a)We categorize the source tasks based on target reasoning skill and task type. We [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: (Left) Correlation between semantic similarity and task performance. (Middle) Correlation between lexical similarity and task performance. (Right) Correlation between win rate and task performance. We observe weak correlation for all three approaches with downstream task performance on BBEH-mini [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Distribution of reasoning skills in SUPERNOVA. H Similarity and Task Difficulty We find that semantic similarity and lexical similarity between the tasks and validation benchmark are poor predictors of task utility for RLVR. As shown in [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved large language model (LLM) reasoning in formal domains such as mathematics and code. Despite these advancements, LLMs still struggle with general reasoning tasks requiring capabilities such as causal inference and temporal understanding. Extending RLVR to general reasoning is fundamentally constrained by the lack of high-quality, verifiable training data that spans diverse reasoning skills. To address this challenge, we propose SUPERNOVA, a data curation framework for RLVR aimed at enhancing general reasoning. Our key insight is that instruction-tuning datasets containing expert-annotated ground-truth encode rich reasoning patterns that can be systematically adapted for RLVR. To study this, we conduct 100+ controlled RL experiments to analyze how data design choices impact downstream reasoning performance. In particular, we investigate three key factors: (i) source task selection, (ii) task mixing strategies, and (iii) synthetic interventions for improving data quality. Our analysis reveals that source task selection is non-trivial and has a significant impact on downstream reasoning performance. Moreover, selecting tasks based on their performance for individual target tasks outperforms strategies based on overall average performance. Finally, models trained on SUPERNOVA outperform strong baselines (e.g., Qwen3.5) on challenging reasoning benchmarks including BBEH, Zebralogic, and MMLU-Pro. In particular, training on SUPERNOVA yields relative improvements of up to 52.8\% on BBEH across model sizes, demonstrating the effectiveness of principled data curation for RLVR. Our findings provide practical insights for curating human-annotated resources to extend RLVR to general reasoning. The code and data is available at https://github.com/asuvarna31/supernova.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SUPERNOVA shows that targeted selection from instruction datasets can lift RLVR performance on general reasoning benchmarks, but the 100+ experiments leave too many variables unaccounted for to credit the gains cleanly to the curation rules.

read the letter

The core point is that this paper takes existing instruction-tuning data with ground-truth answers and turns it into RLVR training for tasks like causal and temporal reasoning. Their main finding is that choosing source tasks by how well they perform on each specific target beats picking by average score across targets, and the resulting models beat Qwen3.5 baselines on BBEH, Zebralogic, and MMLU-Pro, with relative gains up to 52.8% on BBEH. They also test mixing strategies and synthetic data tweaks across more than 100 runs and release the code and data.

Referee Report

3 major / 2 minor

Summary. The paper proposes SUPERNOVA, a data curation framework that adapts expert-annotated instruction-tuning datasets into verifiable rewards for RLVR to elicit general reasoning capabilities (causal inference, temporal understanding) in LLMs. It reports results from over 100 controlled RL experiments examining source task selection, task mixing strategies, and synthetic interventions, concluding that per-target performance-based task selection outperforms average-performance selection and that SUPERNOVA-trained models achieve relative gains of up to 52.8% on BBEH (plus improvements on Zebralogic and MMLU-Pro) over strong baselines such as Qwen3.5.

Significance. If the reported gains prove robust under full experimental controls, the work would offer concrete, actionable guidance for extending RLVR from formal domains to general reasoning by leveraging existing human-annotated resources, with the open-sourced code and data providing a useful starting point for the community.

major comments (3)

[Experimental section] Experimental section (description of the 100+ RL runs): the manuscript states that experiments are 'controlled' and isolate the effects of source task selection, mixing, and synthetic interventions, yet supplies no complete hyperparameter table, multi-seed variance statistics, or explicit verification that reward functions (exact match vs. LLM judge) and other non-data variables remained identical across all compared configurations. Without these, performance deltas cannot be confidently attributed to the three studied factors.
[Results on BBEH] Results on BBEH (and related benchmark tables): the headline relative improvement of up to 52.8% is presented without accompanying statistical significance tests, standard-error bars, or precise baseline implementation details (e.g., whether Qwen3.5 was instruction-tuned with the same RLVR setup or merely prompted). This leaves open the possibility that gains arise from unaccounted confounds rather than the proposed data-curation principles.
[Task-selection ablation] Task-selection ablation: the claim that 'selecting tasks based on their performance for individual target tasks outperforms strategies based on overall average performance' requires clarification on how per-target performance was measured (on which split, with which metric) and whether selection was performed on held-out data to prevent leakage into the final evaluation.

minor comments (2)

[Abstract] Abstract: the acronym RLVR is used before its expansion ('Reinforcement Learning with Verifiable Rewards') is given; a parenthetical expansion on first use would improve readability.
[Data-curation description] Data-curation description: the precise mapping from expert-annotated ground-truth answers to automatic verifiable reward functions is described at a high level; a short pseudocode or explicit example for one task type would clarify how 'rich reasoning patterns' are preserved versus introducing shortcuts.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important aspects of experimental rigor and clarity that we will address in the revision. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [Experimental section] Experimental section (description of the 100+ RL runs): the manuscript states that experiments are 'controlled' and isolate the effects of source task selection, mixing, and synthetic interventions, yet supplies no complete hyperparameter table, multi-seed variance statistics, or explicit verification that reward functions (exact match vs. LLM judge) and other non-data variables remained identical across all compared configurations. Without these, performance deltas cannot be confidently attributed to the three studied factors.

Authors: We agree that additional documentation will strengthen the experimental section. In the revised manuscript we will add a complete hyperparameter table in the appendix covering all RL runs. All original experiments used a fixed random seed for reproducibility; we will also report variance statistics from a subset of multi-seed replications (three seeds) for the key configurations. We will insert explicit statements confirming that reward functions (exact match versus LLM judge) and all non-data variables were held identical across compared runs, consistent with the controlled design described in the paper. These additions will make the attribution of performance differences to the studied factors fully transparent. revision: partial
Referee: [Results on BBEH] Results on BBEH (and related benchmark tables): the headline relative improvement of up to 52.8% is presented without accompanying statistical significance tests, standard-error bars, or precise baseline implementation details (e.g., whether Qwen3.5 was instruction-tuned with the same RLVR setup or merely prompted). This leaves open the possibility that gains arise from unaccounted confounds rather than the proposed data-curation principles.

Authors: We will strengthen the results presentation by adding standard-error bars and statistical significance tests (paired t-tests across seeds) to the BBEH and related benchmark tables. For the baseline, Qwen3.5 was trained with the identical RLVR procedure and hyperparameters as the SUPERNOVA models (not merely prompted); we will clarify this explicitly in the text and provide the corresponding implementation details. These revisions will reduce the possibility of unaccounted confounds and allow readers to assess the robustness of the reported relative gains. revision: yes
Referee: [Task-selection ablation] Task-selection ablation: the claim that 'selecting tasks based on their performance for individual target tasks outperforms strategies based on overall average performance' requires clarification on how per-target performance was measured (on which split, with which metric) and whether selection was performed on held-out data to prevent leakage into the final evaluation.

Authors: We will expand the task-selection ablation section to provide the requested details. Per-target performance was measured on a held-out validation split (separate from all test benchmarks) using exact-match accuracy as the metric. Task selection was performed exclusively on this validation split before any final evaluation, thereby preventing leakage. The revised text will include the exact split sizes, the validation metric, and the full selection procedure to make the ablation fully reproducible and free of data leakage concerns. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical results from controlled experiments

full rationale

The paper reports outcomes from 100+ RL training runs on curated instruction data, with performance measured on external benchmarks (BBEH, Zebralogic, MMLU-Pro). No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. Claims rest on direct experimental deltas rather than any derivation that reduces to its own inputs by construction. This is self-contained empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one domain assumption about the content of existing datasets; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Instruction-tuning datasets containing expert-annotated ground-truth encode rich reasoning patterns that can be systematically adapted for RLVR
This is the key insight stated in the abstract that enables the entire data curation framework.

pith-pipeline@v0.9.0 · 5638 in / 1446 out tokens · 69565 ms · 2026-05-10T17:12:37.357342+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 26 canonical work pages · 12 internal anchors

[2]

Opencodereasoning: Advancing data distillation for competitive coding.arXiv preprint arXiv:2504.01943, 2025

URLhttps://arxiv.org/abs/2504.01943. Syeda Nahida Akter, Shrimai Prabhumoye, Matvei Novikov, Seungju Han, Ying Lin, Evelina Bakhturina, Eric Nyberg, Yejin Choi, Mostofa Patwary, Mohammad Shoeybi, et al. Nemotron-crossthink: Scaling self-learning beyond math reasoning. InProceedings of the 19th Conference of the European Chapter of the Association for Comp...

work page arXiv
[3]

Honeybee: Data recipes for vision-language reasoners.arXiv preprint arXiv:2510.12225,

Hritik Bansal, Devandra Singh Sachan, Kai-Wei Chang, Aditya Grover, Gargi Ghosh, Wen- tau Yih, and Ramakanth Pasunuru. Honeybee: Data recipes for vision-language reasoners. arXiv preprint arXiv:2510.12225,

work page arXiv
[4]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al

URLhttps://arxiv.org/abs/2509.20357. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evalu- ating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page arXiv
[5]

AceReason-Nemotron: Advancing math and code reasoning through reinforcement learning.arXiv preprint arXiv:2505.16400,

Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron: Advancing math and code reasoning through reinforcement learning.arXiv preprint arXiv:2505.16400,

work page arXiv
[6]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178,

work page internal anchor Pith review arXiv
[8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Measuring Mathematical Problem Solving With the MATH Dataset

URLhttps://arxiv.org/abs/2103.03874. Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learn- ing on the base model.arXiv preprint arXiv:2503.24290,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Does math reasoning improve general llm capabilities? understanding transferability of llm reasoning, 2025

URL https://arxiv. org/abs/2507.00432. 10 Preprint. Under review. Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January

work page arXiv
[11]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124,

work page internal anchor Pith review arXiv
[12]

Zebralogic: On the scaling limits of llms for logical reasoning

Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. Zebralogic: On the scaling limits of llms for logical reasoning. arXiv preprint arXiv:2502.01100,

work page arXiv
[13]

Synlogic: Synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond

URLhttps://arxiv.org/abs/2505.19641. Yang Liu, Jiaqi Li, and Zilong Zheng. Rulereasoner: Reinforced rule-based reasoning via domain-aware dynamic sampling,

work page arXiv
[14]

URLhttps://arxiv.org/abs/2506.08672. Ximing Lu, David Acuna, Jaehun Jung, Jian Hu, Di Zhang, Shizhe Diao, Yunheng Zou, Shaokun Zhang, Brandon Cui, Mingjie Liu, Hyunwoo Kim, Prithviraj Ammanabrolu, Jan Kautz, Yi Dong, and Yejin Choi. Golden goose: A simple trick to synthesize unlimited rlvr tasks from unverifiable internet text,

work page arXiv
[15]

Xueguang Ma, Qian Liu, Dongfu Jiang, Ge Zhang, Zejun Ma, and Wenhu Chen

URLhttps://arxiv.org/abs/2601.22975. Xueguang Ma, Qian Liu, Dongfu Jiang, Ge Zhang, Zejun Ma, and Wenhu Chen. General- reasoner: Advancing llm reasoning across all domains.arXiv preprint arXiv:2505.14652,

work page arXiv
[16]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 20286–20332,

2025
[17]

arXiv preprint arXiv:2512.13961,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Challenging big- bench tasks and whether chain-of-thought can solve them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big- bench tasks and whether chain-of-thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, pp. 13003–13051,

2023
[20]

Under review

11 Preprint. Under review. Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, Arut Selvan Dhanasekaran, Anjana Arunkumar, David Stap, et al. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. InProceedings of the 2022 conference on empirical methods in n...

2022
[21]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652,

work page internal anchor Pith review arXiv
[22]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

arXiv preprint arXiv:2502.03387 , year=

Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning.arXiv preprint arXiv:2502.03387,

work page arXiv
[24]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837,

work page internal anchor Pith review arXiv
[26]

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892,

work page internal anchor Pith review arXiv
[27]

Instruction tuning for large language models: A survey

URLhttps://arxiv.org/abs/2308.10792. Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024.https://huggingface.co/datasets/math-ai/aime24,

work page arXiv 2024
[28]

arXiv preprint arXiv:2503.19633 , year=

Han Zhao, Haotian Wang, Yiping Peng, Sitong Zhao, Xiaoyu Tian, Shuaiting Chen, Yunjie Ji, and Xiangang Li. 1.4 million open-source distilled reasoning dataset to empower large language model training.arXiv preprint arXiv:2503.19633,

work page arXiv
[29]

12 Preprint

URLhttps://arxiv.org/abs/2507.04391. 12 Preprint. Under review. A Related Work A.1 General-Purpose Reasoning in LLMs Several works have explored expanding the general reasoning capabilities of LLMs. Ma et al. (2025) constructs a large-scale dataset spanning multiple domains such as history, finance, and physics from web-scraped sources. Akter et al. (2026...

work page arXiv 2025
[30]

and logic puzzles (Liu et al., 2025). While effective for specialized reasoning, these approaches rely on highly curated logic and rule-based datasets that are challenging to scale and cover a limited range of reasoning types. In contrast, SUPERNOVAleverages instruction-tuning datasets, which are human-annotated and generally of higher quality than raw in...

2025
[31]

On the other hand, (Muennighoff et al., 2025; Ye et al.,

and RLVR (Chen et al., 2025; Hu et al., 2025), typically by scraping competition websites or distilling knowledge from larger models. On the other hand, (Muennighoff et al., 2025; Ye et al.,

2025
[32]

The answer i s

demonstrate that carefully curated, high-quality reasoning datasets can yield strong gains even with relatively small datasets. Guha et al. (2025) systemat- ically studies data design principles for SFT reasoning data at scale through controlled experiments, in a manner similar to SUPERNOVA. However, these efforts primarily focus on reasoning in formal do...

2025
[33]

The answer is:

We run our training for 250 steps (1 epoch). For our large scale experiments, we run 5000 steps (1 epoch) across 10,000 prompts and use a learning rate of 1e-6 for 0.6B models and 4e-6 for 1.7B and 4B models. B.3 Evaluation For our evaluations, we use the following prompt across all benchmarks Think step by step, and when you are ready to provide the fina...

work page arXiv
[34]

We observe that SUPERNOVAis able to improve on tasks like Hyperbaton, Multi-step Arithmetic and Shuffling Objects where the base model has near zero performance

We observe negligible gains on 7 out of the 23 sub-tasks. We observe that SUPERNOVAis able to improve on tasks like Hyperbaton, Multi-step Arithmetic and Shuffling Objects where the base model has near zero performance. E Pass@k Analysis Following Chen et al. (2021) and Yue et al. (2025), we analyze the pass@k curves of our task- specific models. Across 8...

2021
[35]

F Task Analysis We prompt an LLM (Claude-Opus-4.6) with the task descriptions from each task and gener- ate coarse category labels. We find that Multi-hop Reasoning and Coreference resolution emerge as the strongest categories, while narrative and surface-formatting tasks (e.g., Story Coherence, Date/Temporal format) consistently underperform (Figure 7). ...

2025
[36]

original_problem

and prompt GPT-5-mini with G. Since, we want to preserve the ground-truth answer, we apply these interventions only to the problem statement. Finally, to ensure the that the final answer is preserved, filter the augmented data with based on win-rate computed again with the augmented problem statements. In our experiments, we combine the original data and ...

2025