arxiv: 2605.08472 · v1 · submitted 2026-05-08 · 💻 cs.AI

Recognition: no theorem link

Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

Aswin RRV, Ben Zhou, Chitta Baral, Divij Handa, Jacob Dineen, Mihir Parmar, Swaroop Mishra

Pith reviewed 2026-05-12 01:07 UTC · model grok-4.3

classification 💻 cs.AI

keywords mid-trainingself-generated datareinforcement learninglanguage modelsreasoning benchmarksPolya problem-solvingmathematical reasoningout-of-distribution tasks

0 comments

The pith

Mid-training on multiple self-generated correct variants of answers improves subsequent reinforcement learning performance in language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores whether a mid-training step that exposes language models to several valid problem-solving methods can strengthen later reinforcement learning on reasoning tasks. Many reasoning problems admit multiple distinct approaches, yet standard pre-training data often exposes models to only a narrow subset of them. The authors generate diverse but correct variants of each training answer by following Polya's problem-solving strategies, fine-tune on this self-generated set, and then run RL. Models that receive this mid-training initialization show consistent gains on mathematical reasoning benchmarks and transfer to unrelated tasks such as code generation and narrative reasoning. A sympathetic reader would care because the result points to a practical way to enrich the policy before RL begins, potentially making the expensive RL stage more effective.

Core claim

Mid-training language models on bootstrapped self-generated data that contains multiple correct variants of answers, produced according to Polya's problem-solving approaches, improves the effectiveness of subsequent RL training. A theoretical argument shows that policy-gradient updates can then incentivize the model to combine several approaches rather than converging on one. Empirically, RL models initialized from this mid-trained checkpoint obtain consistent gains on mathematical reasoning benchmarks and on out-of-distribution tasks including code generation and narrative reasoning.

What carries the argument

The bootstrapped data-generation framework that uses Polya's problem-solving approaches to create multiple correct answer variants per question for mid-training before RL.

If this is right

RL models that begin from the mid-trained checkpoint achieve higher scores on mathematical reasoning benchmarks.
The same initialization produces gains on out-of-distribution tasks such as code generation and narrative reasoning.
Policy-gradient updates become able to combine multiple problem-solving approaches when the model has seen them during mid-training.
Learning several correct strategies through self-generated data supplies a stronger starting policy for RL.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may lower the total amount of RL data or compute needed to reach a target performance level.
It could be tested on domains beyond math where multiple valid strategies exist, such as planning or story generation.
If the generation process systematically favors certain styles of solution, gains might be limited on problems that require very different approaches.
Scaling the number of variants per question or applying the same mid-training step to other base models would be direct next experiments.

Load-bearing premise

The self-generated answer variants stay correct and diverse enough that they do not introduce systematic errors that would weaken the RL signal.

What would settle it

Running RL from the mid-trained checkpoint and finding no improvement over a standard initialization on the mathematical reasoning benchmarks, or discovering that a substantial fraction of the generated variants contain undetected errors that degrade policy quality.

Figures

Figures reproduced from arXiv: 2605.08472 by Aswin RRV, Ben Zhou, Chitta Baral, Divij Handa, Jacob Dineen, Mihir Parmar, Swaroop Mishra.

**Figure 1.** Figure 1: Overview of heuristic-guided mid-training. Top: We draw on Pólya’s problem-solving heuristics, each defined by a textual description and few-shot exemplars demonstrating the heuristic applied to example problems. Bottom right: For each question–heuristic pair, we sample 128 candidate responses from the base model conditioned on the heuristic prompt, filter for correctness via rule-based verification, and s… view at source ↗

**Figure 2.** Figure 2: Pass@k performance of RL-trained models on six mathematical reasoning benchmarks. The x-axis shows the number of samples k (log scale), and the y-axis shows accuracy. Each curve corresponds to a model first mid-trained with n ∈ {1, 2, 4, 8, 16, 32, 64} distinct Pólya-style problem-solving approaches and subsequently trained with RL. Highlighted curves indicate the best-performing n for each benchmark. Mode… view at source ↗

**Figure 3.** Figure 3: Behavior composition after mid-training and reinforcement learning. Left: composition rate [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Macro-avg. pass@k accuracy after RL across six benchmarks. (a) Comparing mid-trained models on many problems with a single solution vs. fewer problems with multiple solving approaches. (b) Mid-training with heuristic-guided reasoning chains that have incorrect final answers. truly give rise to emergent behaviors in large language models, or does it primarily compose existing behaviors learned during pre-tr… view at source ↗

**Figure 5.** Figure 5: Distill+RL vs. ours: p@1 (◦) and p@64 (•) across math benchmarks. We compute the Vendi Score (§ A.3) on the resulting dataset and compare against our n = 16 midtrained RL model in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Pass@k performance of RL-trained models on four mathematical reasoning benchmarks. The x-axis shows the number of samples k (log scale), and the y-axis shows accuracy. Each curve corresponds to a model first mid-trained with n ∈ {4, 8, 16, 32} distinct Pólya-style problem-solving heuristics and subsequently trained with reinforcement learning. Highlighted curves indicate the best-performing n for each benc… view at source ↗

**Figure 7.** Figure 7: Prompt template for heuristic-specific data generation. For each heuristic [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Reward model scoring pipeline. After filtering for correctness, Skywork-Reward-V2 ranks [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: RL training prompt format. Each question is paired with a chain-of-thought instruction; [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt template for LLM-as-judge heuristic classification using GPT-4o-mini. Each [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Human annotation interface. Annotators are shown the heuristic name, behavior descrip [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

**Figure 12.** Figure 12: An Example of Composition of Problem-solving approaches after RL training. Mid-trained model uses an informal analogy approach to answer the given question. Whereas the RLtrained model systematically restates the problem, decomposes it into 13 numbered steps, considers an analogous problem, and other problem-solving approaches in a single reasoning chain 28 [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗

**Figure 13.** Figure 13: Another Example of Composition of Problem-solving approaches after RL training. Mid-training uses an informal analogy. The RL model systematically restates the problem, decomposes into numbered steps, considers an analogous problem, and applies combinatorial reasoning. A.11 Pólya’s Heuristics [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗

read the original abstract

The effectiveness of Reinforcement Learning (RL) in Large Language Models (LLMs) depends on the nature and diversity of the data used before and during RL. In particular, reasoning problems can often be approached in multiple ways that rely on different forms of reasoning, and exposure to only a limited range of such approaches in the training data may limit the effectiveness of RL. Motivated by this, we investigate using diverse self-generated data during mid-training as an intermediate step before RL training. Specifically, we adopt a bootstrapped data-generation framework guided by George Polya's problem-solving approaches for generating multiple variants of correct answers for each question in the training data, and then perform fine-tuning. We first provide a theoretical perspective on how mid-training on such data improves RL and explain how policy-gradient updates can incentivize combining multiple approaches. We then empirically demonstrate that RL-trained models initialized with our mid-training data achieve consistent improvements across various mathematical reasoning benchmarks and other OOD tasks like code generation and narrative reasoning. Overall, our investigative study shows that a language model learning multiple problem-solving approaches, through self-generated data helps subsequent RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mid-training on Polya-guided self-generated variants before RL shows empirical gains on reasoning benchmarks, but the data quality controls are too thin to pin down why.

read the letter

The core takeaway is that inserting a mid-training stage on multiple self-generated reasoning variants per question, using Polya's heuristics, leads to better RL performance on math tasks and some OOD areas like code and narrative reasoning. They sketch a policy-gradient story for why exposure to diverse approaches should help the model combine them during RL updates. That pipeline is the concrete new piece relative to prior self-training work. The empirical side reports consistent benchmark lifts, which is the part worth paying attention to if you're running similar RL loops. They also try to ground the diversity claim in the theory section rather than just asserting it. The soft spots sit mostly in the data-generation step. The abstract claims correct answers but gives no error rates, verification method, or diversity numbers for the variants. If the generated paths contain shared mistakes or differ only in wording, the mid-training stage risks turning into noisy SFT instead of adding useful approaches, which would undercut the policy-gradient incentive they describe. Baselines, statistical tests, and exact generation details are also light in the summary. This matters because the whole claim hinges on the variants being both correct and meaningfully different. The paper is aimed at groups already doing RL fine-tuning on reasoning models and looking for cheap ways to improve the starting point. It is coherent on its own terms and engages the literature without obvious internal contradictions, so it clears the bar for a serious referee even though the current evidence is preliminary. I would send it out for review with a request for the missing data-quality metrics and stronger controls.

Referee Report

2 major / 2 minor

Summary. The paper claims that mid-training LLMs on diverse self-generated variants of reasoning problems—produced via a bootstrapped framework guided by Polya's heuristics—serves as an effective intermediate step before RL training. It sketches a policy-gradient argument explaining how exposure to multiple correct approaches incentivizes better combination of strategies during RL, and reports consistent empirical gains on mathematical reasoning benchmarks plus OOD tasks including code generation and narrative reasoning.

Significance. If the self-generated data is verifiably correct and diverse, the approach could offer a practical method to strengthen RL by addressing limited reasoning diversity in pre-RL data. The theoretical link to policy-gradient incentives provides a plausible mechanism, and the multi-benchmark evaluation (including OOD) suggests potential generality. This would be a useful contribution to data curation for RL in LLMs, though its impact hinges on securing the data-quality precondition.

major comments (2)

Abstract and data-generation framework: the claim that bootstrapped generation produces 'multiple variants of correct answers' for each question lacks any reported error-rate statistics, diversity metrics (e.g., semantic or reasoning-path distinctness), or verification details (model-based vs. external oracle). This is load-bearing for the central claim, because undetected systematic errors or superficial variants would reduce mid-training to noisy SFT and undermine the policy-gradient incentive argument that diversity enables combining approaches.
Empirical evaluation section: the reported consistent improvements lack specification of baselines, statistical significance tests, exact number of variants per question, data filtering rules, and the precise generation procedure. Without these, the gains cannot be fully attributed to the proposed mid-training or assessed for robustness.

minor comments (2)

The theoretical perspective would benefit from explicit equations or a formal derivation showing how mid-training diversity modifies the policy gradient, rather than a high-level sketch.
Add a specific citation to Polya's original work on heuristics when describing the generation framework.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify key aspects of our work. We address each major comment below and will revise the manuscript to incorporate the requested details and clarifications.

read point-by-point responses

Referee: Abstract and data-generation framework: the claim that bootstrapped generation produces 'multiple variants of correct answers' for each question lacks any reported error-rate statistics, diversity metrics (e.g., semantic or reasoning-path distinctness), or verification details (model-based vs. external oracle). This is load-bearing for the central claim, because undetected systematic errors or superficial variants would reduce mid-training to noisy SFT and undermine the policy-gradient incentive argument that diversity enables combining approaches.

Authors: We agree that quantitative support for data correctness and diversity is essential to substantiate the central claims and the policy-gradient argument. In the revised manuscript, we will add error-rate statistics from manual verification on a random sample of 200 generated variants (reporting the fraction that are verifiably correct), diversity metrics including the average number of distinct reasoning paths per question (measured by step-level edit distance) and semantic similarity scores via embedding cosine distance, and full verification details (hybrid model-based filtering followed by oracle checks on a subset). These additions will confirm that the variants are both correct and meaningfully diverse, directly bolstering the incentive mechanism described in the theoretical section. revision: yes
Referee: Empirical evaluation section: the reported consistent improvements lack specification of baselines, statistical significance tests, exact number of variants per question, data filtering rules, and the precise generation procedure. Without these, the gains cannot be fully attributed to the proposed mid-training or assessed for robustness.

Authors: We acknowledge the need for greater specificity to enable attribution and reproducibility. The revised manuscript will include: explicit baselines (vanilla RL without mid-training, SFT on original data only, and random variant augmentation); statistical significance via paired t-tests and p-values across 5 independent runs with different seeds; the exact number of variants generated per question (4 on average, with a range of 3-5); data filtering rules (duplicate removal via embedding similarity threshold, discard of variants below a model confidence threshold of 0.7); and a precise, step-by-step description of the Polya-guided bootstrapped generation procedure including prompt templates and hyperparameters. These revisions will allow readers to fully assess the robustness and source of the reported gains. revision: yes

Circularity Check

0 steps flagged

No circularity; theory is explanatory and empirical results independent

full rationale

The paper's chain consists of (1) generating self-bootstrapped variants via Polya heuristics, (2) a theoretical explanation that policy gradients can incentivize combining multiple approaches when diversity is present, and (3) empirical RL improvements on benchmarks. No equation or claim reduces by construction to a fitted parameter, self-citation, or renamed input. The theoretical section offers a general incentive argument rather than a closed-form prediction that matches the mid-training data by definition. Empirical gains are presented as separate validation, not forced by the theory. This satisfies the self-contained criterion; the skeptic concern about data quality is a correctness issue, not a circularity reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that Polya-style prompting reliably produces correct and diverse reasoning traces; no new entities are postulated and the only free parameters are standard training hyperparameters.

free parameters (1)

number of variants per question
The framework generates multiple correct answers per problem; the exact count and selection criteria are chosen by the authors.

axioms (1)

domain assumption Self-generated answers produced via Polya's problem-solving steps are both correct and diverse enough to improve downstream RL.
Invoked in the data-generation and mid-training description.

pith-pipeline@v0.9.0 · 5515 in / 1158 out tokens · 63110 ms · 2026-05-12T01:07:42.418810+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 18 internal anchors

[1]

arXiv preprint arXiv:2402.10571 , year=

Afra Amini, Tim Vieira, and Ryan Cotterell. Direct preference optimization with an off- set.ArXiv, abs/2402.10571, 2024. URL https://api.semanticscholar.org/CorpusID: 267740352

work page arXiv 2024
[2]

Matharena: Evaluating llms on uncontaminated math competitions, February 2025

Mislav Balunovi ´c, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi ´c, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, February 2025. URL https://matharena.ai/

work page 2025
[3]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page 2021
[4]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

From atomic to composite: Reinforcement learning enables general- ization in complementary reasoning.arXiv preprint arXiv:2512.01970, 2025

Sitao Cheng, Xunjian Yin, Ruiwen Zhou, Yuxuan Li, Xinyi Wang, Liangming Pan, William Yang Wang, and Victor Zhong. From atomic to composite: Reinforcement learning enables general- ization in complementary reasoning.arXiv preprint arXiv:2512.01970, 2025

work page arXiv 2025
[6]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

work page internal anchor Pith review arXiv 2025
[8]

Preprint, arXiv:2210.02410

Dan Friedman and Adji Bousso Dieng. The vendi score: A diversity evaluation metric for machine learning.arXiv preprint arXiv:2210.02410, 2022

work page arXiv 2022
[9]

Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars

Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503.01307, 2025

work page arXiv 2025
[10]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Targen: Targeted data genera- tion with large language models.arXiv preprint arXiv:2310.17876, 2023

Himanshu Gupta, Kevin Scaria, Ujjwala Anantheswaran, Shreyas Verma, Mihir Parmar, Saurabh Arjun Sawant, Chitta Baral, and Swaroop Mishra. Targen: Targeted data genera- tion with large language models.arXiv preprint arXiv:2310.17876, 2023

work page arXiv 2023
[14]

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yu- jie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun

Andre He, Daniel Fried, and Sean Welleck. Rewarding the unlikely: Lifting grpo beyond distribution sharpening.arXiv preprint arXiv:2506.02355, 2025. 10

work page arXiv 2025
[15]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

arXiv preprint arXiv:2504.11456 , year =

Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, et al. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.arXiv preprint arXiv:2504.11456, 2025

work page arXiv 2025
[17]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[18]

Unnatural instructions: Tuning language models with (almost) no human labor

Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14409–14428, 2023

work page 2023
[19]

Math-verify: Robust mathematical expression evaluator

Hugging Face. Math-verify: Robust mathematical expression evaluator. https://github. com/huggingface/Math-Verify, 2025. GitHub repository; version <vX.Y .Z>; accessed 2025-09-09

work page 2025
[20]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

arXiv preprint arXiv:2510.13786 , year=

Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S Dhillon, David Brandfonbrener, and Rishabh Agarwal. The art of scaling reinforcement learning compute for llms.arXiv preprint arXiv:2510.13786, 2025

work page arXiv 2025
[23]

Spoc: Search-based pseudocode to code.Advances in Neural Information Processing Systems, 32, 2019

Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, and Percy S Liang. Spoc: Search-based pseudocode to code.Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[24]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

work page 2023
[25]

Step-dpo: Step-wise preference optimization for long-chain reasoning of llms, 2024

Xin Lai, Zhuotao Tian, Yukang Chen, Senqiao Yang, Xiangru Peng, and Jiaya Jia. Step- dpo: Step-wise preference optimization for long-chain reasoning of llms.arXiv preprint arXiv:2406.18629, 2024

work page arXiv 2024
[26]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

The measurement of observer agreement for categorical data.biometrics, pages 159–174, 1977

J Richard Landis and Gary G Koch. The measurement of observer agreement for categorical data.biometrics, pages 159–174, 1977

work page 1977
[28]

Small models struggle to learn from strong reasoners

Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubramanian, and Radha Poovendran. Small models struggle to learn from strong reasoners. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 25366–25394, ...

work page doi:10.18653/v1/2025.findings-acl.1301 2025
[29]

arXiv preprint arXiv:2507.01352 , year=

Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, and Yahui Zhou. Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352, 2025

work page arXiv 2025
[30]

Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024

Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024

work page 2024
[31]

Cross-task gener- alization via natural language crowdsourcing instructions

Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task gener- alization via natural language crowdsourcing instructions. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, 2022

work page 2022
[32]

Orca 2: Teaching small language models how to reason.arXiv preprint arXiv:2311.11045, 2023

Arindam Mitra, Luciano Del Corro, Shweti Mahajan, Andres Codas, Clarisse Simoes, Sahaj Agarwal, Xuxi Chen, Anastasia Razdaibiedina, Erik Jones, Kriti Aggarwal, et al. Orca 2: Teaching small language models how to reason.arXiv preprint arXiv:2311.11045, 2023

work page arXiv 2023
[33]

Mid-training of large language models: A survey.arXiv preprint arXiv:2510.06826, 2025

Kaixiang Mo, Yuxin Shi, Weiwei Weng, Zhiqiang Zhou, Shuman Liu, Haibo Zhang, and Anxi- ang Zeng. Mid-training of large language models: A survey.arXiv preprint arXiv:2510.06826, 2025

work page arXiv 2025
[34]

Mukherjee, A

Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4.arXiv preprint arXiv:2306.02707, 2023

work page arXiv 2023
[35]

AMC 12A (2023): Problems and Solutions

Mathematical Association of America. AMC 12A (2023): Problems and Solutions. https: //artofproblemsolving.com/wiki/index.php/2023_AMC_12A_Problems, 2023. Com- petition held Nov 8, 2023. Problems © MAA AMC

work page 2023
[36]

Olmo 3

Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. Olmo 3.arXiv preprint arXiv:2512.13961, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

New embedding models and api updates

OpenAI. New embedding models and api updates. https://openai.com/index/ new-embedding-models-and-api-updates/, January 2024

work page 2024
[38]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[39]

How many data samples is an additional instruction worth? InFindings of the Association for Computational Linguistics: EACL 2023, pages 1042–1057, 2023

Ravsehaj Singh Puri, Swaroop Mishra, Mihir Parmar, and Chitta Baral. How many data samples is an additional instruction worth? InFindings of the Association for Computational Linguistics: EACL 2023, pages 1042–1057, 2023

work page 2023
[40]

Princeton science library

George Pólya and John Horton Conway.How to solve it. Princeton science library. Princeton University Press, Princeton [N.J.], expanded princeton science library ed. edition, 2004. ISBN 0-691-11966-X ; 978-0-691-11966-3. Includes bibliographical references

work page 2004
[41]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023
[42]

Thinktuning: Instilling cognitive reflections without distillation.arXiv preprint arXiv:2508.07616, 2025

Aswin RRV , Jacob Dineen, Divij Handa, Md Nayem Uddin, Mihir Parmar, Chitta Baral, and Ben Zhou. Thinktuning: Instilling cognitive reflections without distillation.arXiv preprint arXiv:2508.07616, 2025

work page arXiv 2025
[43]

Triple preference optimization: Achieving better alignment using a single step optimization.arXiv preprint arXiv:2405.16681, 2024

Amir Saeidi, Shivanshu Verma, Aswin RRV , Kashif Rasul, and Chitta Baral. Triple preference optimization: Achieving better alignment using a single step optimization.arXiv preprint arXiv:2405.16681, 2024

work page arXiv 2024
[44]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. 12

work page internal anchor Pith review Pith/arXiv arXiv 2017
[45]

Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold.Advances in Neural Information Processing Systems, 37:43000–43031, 2024

Amrith Setlur, Saurabh Garg, Xinyang Geng, Naman Garg, Virginia Smith, and Aviral Kumar. Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold.Advances in Neural Information Processing Systems, 37:43000–43031, 2024

work page 2024
[46]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Yankai Li, Yu Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Musr: Testing the limits of chain-of-thought with multistep soft reasoning.arXiv:2310.16049,

Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. Musr: Testing the limits of chain-of-thought with multistep soft reasoning.arXiv preprint arXiv:2310.16049, 2023

work page arXiv 2023
[49]

Qwq-32b: Embracing the power of reinforcement learning, March 2025

Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025. URL https://qwenlm.github.io/blog/qwq-32b/

work page 2025
[50]

A survey on llm mid-training.arXiv preprint arXiv:2510.23081, 2025

Chengying Tu, Xuemiao Zhang, Rongxiang Weng, Rumei Li, Chen Zhang, Yang Bai, Hongfei Yan, Jingang Wang, and Xunliang Cai. A survey on llm mid-training.arXiv preprint arXiv:2510.23081, 2025

work page arXiv 2025
[51]

Trl: Transformer reinforce- ment learning.https://github.com/huggingface/trl, 2020

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforce- ment learning.https://github.com/huggingface/trl, 2020

work page 2020
[52]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, 2025

work page internal anchor Pith review arXiv 2025
[53]

Self-instruct: Aligning language models with self-generated instruc- tions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instruc- tions. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 13484–13508, 2023

work page 2023
[54]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[55]

Williams

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist re- inforcement learning.Mach. Learn., 8(3–4):229–256, May 1992. ISSN 0885-6125. doi: 10.1007/BF00992696. URLhttps://doi.org/10.1007/BF00992696

work page doi:10.1007/bf00992696 1992
[56]

Wizardlm: Empowering large pre-trained language models to follow complex instructions

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qing- wei Lin, and Daxin Jiang. Wizardlm: Empowering large pre-trained language models to follow complex instructions. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[57]

Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding

Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding. 2025. URL https://arxiv. org/abs/2503.02951

work page arXiv 2025
[58]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

arXiv preprint arXiv:2509.25123 , year=

Lifan Yuan, Weize Chen, Yuchen Zhang, Ganqu Cui, Hanbin Wang, Ziming You, Ning Ding, Zhiyuan Liu, Maosong Sun, and Hao Peng. From f(x) and g(x) to f(g(x)) : Llms learn new skills in rl by composing old ones, 2025. URLhttps://arxiv.org/abs/2509.25123

work page arXiv 2025
[60]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

work page 2022
[62]

On the interplay of pre-training, mid-training, and rl on reasoning language models.arXiv preprint arXiv:2512.07783,

Charlie Zhang, Graham Neubig, and Xiang Yue. On the interplay of pre-training, mid-training, and rl on reasoning language models, 2025. URLhttps://arxiv.org/abs/2512.07783

work page arXiv 2025
[63]

American invitational mathematics examination (aime) 2024, 2024

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024

work page 2024
[64]

American invitational mathematics examination (aime) 2025, 2025

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2025, 2025

work page 2025
[65]

Each sentence in the generated text uses a second person

Yang Zhou, Sunzhu Li, Shunyu Liu, Wenkai Fang, Jiale Zhao, Jingwen Yang, Jianwei Lv, Kongcheng Zhang, Yihe Zhou, Hengtong Lu, et al. Breaking the exploration bottleneck: Rubric- scaffolded reinforcement learning for general llm reasoning.arXiv preprint arXiv:2508.16949, 2025. 14 Table of Contents A Appendix. . . . . . . . . . . . . . . . . . . . . . . . ....

work page arXiv 2025
[66]

Filter candidates by answer correctness via Math-Verify

work page
[67]

Score remaining candidates with reward model

work page
[68]

Decision: Yes

Select response with highest score as final heuristic-specific solution Figure 8: Reward model scoring pipeline. After filtering for correctness, Skywork-Reward-V2 ranks candidate responses by heuristic adherence. The highest-scoring response per (question, heuristic) pair is selected for mid-training. A.7.2 Reinforcement Learning RL Training Data Format ...

work page 2024
[69]

determination hope success

benchmark scale. These results further support the reliability of our automated classifier for large-scale heuristic annotation. Human Annotation Interface Instructions: For each sample, determine if the reasoning chain exhibits the described behavior. Click Yes or No.). Sample 1 / 40 Heuristic:Determination Hope Success Behavior Description: In your reas...

work page 2024
[70]

All the given conditions are satisfied, and the context makes sense

Let’s verify: If she sold 48 in April and 24 in May (which is indeed half of 48), their sum is 72. All the given conditions are satisfied, and the context makes sense. The answer is correct. Can You Derive Result Differently Try to reach the same answer using another method or ap- proach or viewpoint. Comparing two ways helps confirm the answer and gives ...

work page
[71]

Pappus First construct or create something that helps you explore the problem

By clearly naming each quantity, the solution pathway is visible and straightforward. Pappus First construct or create something that helps you explore the problem. Draw a figure, build an example, or add a new element that reveals how the parts are related. Think of problem solving as a process of invention — by con- structing and observing, you can ofte...

work page
[72]

First, find what seems to be the right answer or pattern by exploring ex- amples, calculating, or testing possibilities

33 Table 7 – continued Heuristic Description Example Problems To Find And Prove Approach the problem in two stages. First, find what seems to be the right answer or pattern by exploring ex- amples, calculating, or testing possibilities. Then, once you have a likely result, prove it by explaining why it must be true in all cases, using logical reasoning or...

work page
[73]

So, this possibility is inconsistent with the problem statement

But the problem says May’s sales were half of April’s, not double. So, this possibility is inconsistent with the problem statement. Therefore, May’s number must truly be half of April’s: 48 / 2 = 24. Adding April and May gives 48 + 24 = 72 clips. Redundant When analyzing the problem, check whether any part of the information or condition repeats, adds not...

work page
[74]

Now, adding both months together, 48 plus 24 gives 72

At this point, I’m reminded, ’Measure twice, cut once.’ Before I add, let me double-check that math: half of 48 is indeed 24. Now, adding both months together, 48 plus 24 gives 72. Careful checking helps avoid mistakes, just like the proverb says. 36 Table 7 – continued Heuristic Description Example Working Backwards You must start from the goal itself. A...

work page