pith. machine review for the scientific record. sign in

arxiv: 2605.08472 · v1 · submitted 2026-05-08 · 💻 cs.AI

Recognition: no theorem link

Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

Aswin RRV, Ben Zhou, Chitta Baral, Divij Handa, Jacob Dineen, Mihir Parmar, Swaroop Mishra

Pith reviewed 2026-05-12 01:07 UTC · model grok-4.3

classification 💻 cs.AI
keywords mid-trainingself-generated datareinforcement learninglanguage modelsreasoning benchmarksPolya problem-solvingmathematical reasoningout-of-distribution tasks
0
0 comments X

The pith

Mid-training on multiple self-generated correct variants of answers improves subsequent reinforcement learning performance in language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores whether a mid-training step that exposes language models to several valid problem-solving methods can strengthen later reinforcement learning on reasoning tasks. Many reasoning problems admit multiple distinct approaches, yet standard pre-training data often exposes models to only a narrow subset of them. The authors generate diverse but correct variants of each training answer by following Polya's problem-solving strategies, fine-tune on this self-generated set, and then run RL. Models that receive this mid-training initialization show consistent gains on mathematical reasoning benchmarks and transfer to unrelated tasks such as code generation and narrative reasoning. A sympathetic reader would care because the result points to a practical way to enrich the policy before RL begins, potentially making the expensive RL stage more effective.

Core claim

Mid-training language models on bootstrapped self-generated data that contains multiple correct variants of answers, produced according to Polya's problem-solving approaches, improves the effectiveness of subsequent RL training. A theoretical argument shows that policy-gradient updates can then incentivize the model to combine several approaches rather than converging on one. Empirically, RL models initialized from this mid-trained checkpoint obtain consistent gains on mathematical reasoning benchmarks and on out-of-distribution tasks including code generation and narrative reasoning.

What carries the argument

The bootstrapped data-generation framework that uses Polya's problem-solving approaches to create multiple correct answer variants per question for mid-training before RL.

If this is right

  • RL models that begin from the mid-trained checkpoint achieve higher scores on mathematical reasoning benchmarks.
  • The same initialization produces gains on out-of-distribution tasks such as code generation and narrative reasoning.
  • Policy-gradient updates become able to combine multiple problem-solving approaches when the model has seen them during mid-training.
  • Learning several correct strategies through self-generated data supplies a stronger starting policy for RL.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method may lower the total amount of RL data or compute needed to reach a target performance level.
  • It could be tested on domains beyond math where multiple valid strategies exist, such as planning or story generation.
  • If the generation process systematically favors certain styles of solution, gains might be limited on problems that require very different approaches.
  • Scaling the number of variants per question or applying the same mid-training step to other base models would be direct next experiments.

Load-bearing premise

The self-generated answer variants stay correct and diverse enough that they do not introduce systematic errors that would weaken the RL signal.

What would settle it

Running RL from the mid-trained checkpoint and finding no improvement over a standard initialization on the mathematical reasoning benchmarks, or discovering that a substantial fraction of the generated variants contain undetected errors that degrade policy quality.

Figures

Figures reproduced from arXiv: 2605.08472 by Aswin RRV, Ben Zhou, Chitta Baral, Divij Handa, Jacob Dineen, Mihir Parmar, Swaroop Mishra.

Figure 1
Figure 1. Figure 1: Overview of heuristic-guided mid-training. Top: We draw on Pólya’s problem-solving heuristics, each defined by a textual description and few-shot exemplars demonstrating the heuristic applied to example problems. Bottom right: For each question–heuristic pair, we sample 128 candidate responses from the base model conditioned on the heuristic prompt, filter for correctness via rule-based verification, and s… view at source ↗
Figure 2
Figure 2. Figure 2: Pass@k performance of RL-trained models on six mathematical reasoning benchmarks. The x-axis shows the number of samples k (log scale), and the y-axis shows accuracy. Each curve corresponds to a model first mid-trained with n ∈ {1, 2, 4, 8, 16, 32, 64} distinct Pólya-style problem-solving approaches and subsequently trained with RL. Highlighted curves indicate the best-performing n for each benchmark. Mode… view at source ↗
Figure 3
Figure 3. Figure 3: Behavior composition after mid-training and reinforcement learning. Left: composition rate [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Macro-avg. pass@k accuracy after RL across six benchmarks. (a) Comparing mid-trained models on many problems with a single solution vs. fewer problems with multiple solving approaches. (b) Mid-training with heuristic-guided reasoning chains that have incorrect final answers. truly give rise to emergent behaviors in large language models, or does it primarily compose existing behaviors learned during pre-tr… view at source ↗
Figure 5
Figure 5. Figure 5: Distill+RL vs. ours: p@1 (◦) and p@64 (•) across math benchmarks. We compute the Vendi Score (§ A.3) on the result￾ing dataset and compare against our n = 16 mid￾trained RL model in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Pass@k performance of RL-trained models on four mathematical reasoning benchmarks. The x-axis shows the number of samples k (log scale), and the y-axis shows accuracy. Each curve corresponds to a model first mid-trained with n ∈ {4, 8, 16, 32} distinct Pólya-style problem-solving heuristics and subsequently trained with reinforcement learning. Highlighted curves indicate the best-performing n for each benc… view at source ↗
Figure 7
Figure 7. Figure 7: Prompt template for heuristic-specific data generation. For each heuristic [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Reward model scoring pipeline. After filtering for correctness, Skywork-Reward-V2 ranks [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: RL training prompt format. Each question is paired with a chain-of-thought instruction; [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt template for LLM-as-judge heuristic classification using GPT-4o-mini. Each [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Human annotation interface. Annotators are shown the heuristic name, behavior descrip [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: An Example of Composition of Problem-solving approaches after RL training. Mid-trained model uses an informal analogy approach to answer the given question. Whereas the RL￾trained model systematically restates the problem, decomposes it into 13 numbered steps, considers an analogous problem, and other problem-solving approaches in a single reasoning chain 28 [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Another Example of Composition of Problem-solving approaches after RL training. Mid-training uses an informal analogy. The RL model systematically restates the problem, decom￾poses into numbered steps, considers an analogous problem, and applies combinatorial reasoning. A.11 Pólya’s Heuristics [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗
read the original abstract

The effectiveness of Reinforcement Learning (RL) in Large Language Models (LLMs) depends on the nature and diversity of the data used before and during RL. In particular, reasoning problems can often be approached in multiple ways that rely on different forms of reasoning, and exposure to only a limited range of such approaches in the training data may limit the effectiveness of RL. Motivated by this, we investigate using diverse self-generated data during mid-training as an intermediate step before RL training. Specifically, we adopt a bootstrapped data-generation framework guided by George Polya's problem-solving approaches for generating multiple variants of correct answers for each question in the training data, and then perform fine-tuning. We first provide a theoretical perspective on how mid-training on such data improves RL and explain how policy-gradient updates can incentivize combining multiple approaches. We then empirically demonstrate that RL-trained models initialized with our mid-training data achieve consistent improvements across various mathematical reasoning benchmarks and other OOD tasks like code generation and narrative reasoning. Overall, our investigative study shows that a language model learning multiple problem-solving approaches, through self-generated data helps subsequent RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that mid-training LLMs on diverse self-generated variants of reasoning problems—produced via a bootstrapped framework guided by Polya's heuristics—serves as an effective intermediate step before RL training. It sketches a policy-gradient argument explaining how exposure to multiple correct approaches incentivizes better combination of strategies during RL, and reports consistent empirical gains on mathematical reasoning benchmarks plus OOD tasks including code generation and narrative reasoning.

Significance. If the self-generated data is verifiably correct and diverse, the approach could offer a practical method to strengthen RL by addressing limited reasoning diversity in pre-RL data. The theoretical link to policy-gradient incentives provides a plausible mechanism, and the multi-benchmark evaluation (including OOD) suggests potential generality. This would be a useful contribution to data curation for RL in LLMs, though its impact hinges on securing the data-quality precondition.

major comments (2)
  1. Abstract and data-generation framework: the claim that bootstrapped generation produces 'multiple variants of correct answers' for each question lacks any reported error-rate statistics, diversity metrics (e.g., semantic or reasoning-path distinctness), or verification details (model-based vs. external oracle). This is load-bearing for the central claim, because undetected systematic errors or superficial variants would reduce mid-training to noisy SFT and undermine the policy-gradient incentive argument that diversity enables combining approaches.
  2. Empirical evaluation section: the reported consistent improvements lack specification of baselines, statistical significance tests, exact number of variants per question, data filtering rules, and the precise generation procedure. Without these, the gains cannot be fully attributed to the proposed mid-training or assessed for robustness.
minor comments (2)
  1. The theoretical perspective would benefit from explicit equations or a formal derivation showing how mid-training diversity modifies the policy gradient, rather than a high-level sketch.
  2. Add a specific citation to Polya's original work on heuristics when describing the generation framework.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify key aspects of our work. We address each major comment below and will revise the manuscript to incorporate the requested details and clarifications.

read point-by-point responses
  1. Referee: Abstract and data-generation framework: the claim that bootstrapped generation produces 'multiple variants of correct answers' for each question lacks any reported error-rate statistics, diversity metrics (e.g., semantic or reasoning-path distinctness), or verification details (model-based vs. external oracle). This is load-bearing for the central claim, because undetected systematic errors or superficial variants would reduce mid-training to noisy SFT and undermine the policy-gradient incentive argument that diversity enables combining approaches.

    Authors: We agree that quantitative support for data correctness and diversity is essential to substantiate the central claims and the policy-gradient argument. In the revised manuscript, we will add error-rate statistics from manual verification on a random sample of 200 generated variants (reporting the fraction that are verifiably correct), diversity metrics including the average number of distinct reasoning paths per question (measured by step-level edit distance) and semantic similarity scores via embedding cosine distance, and full verification details (hybrid model-based filtering followed by oracle checks on a subset). These additions will confirm that the variants are both correct and meaningfully diverse, directly bolstering the incentive mechanism described in the theoretical section. revision: yes

  2. Referee: Empirical evaluation section: the reported consistent improvements lack specification of baselines, statistical significance tests, exact number of variants per question, data filtering rules, and the precise generation procedure. Without these, the gains cannot be fully attributed to the proposed mid-training or assessed for robustness.

    Authors: We acknowledge the need for greater specificity to enable attribution and reproducibility. The revised manuscript will include: explicit baselines (vanilla RL without mid-training, SFT on original data only, and random variant augmentation); statistical significance via paired t-tests and p-values across 5 independent runs with different seeds; the exact number of variants generated per question (4 on average, with a range of 3-5); data filtering rules (duplicate removal via embedding similarity threshold, discard of variants below a model confidence threshold of 0.7); and a precise, step-by-step description of the Polya-guided bootstrapped generation procedure including prompt templates and hyperparameters. These revisions will allow readers to fully assess the robustness and source of the reported gains. revision: yes

Circularity Check

0 steps flagged

No circularity; theory is explanatory and empirical results independent

full rationale

The paper's chain consists of (1) generating self-bootstrapped variants via Polya heuristics, (2) a theoretical explanation that policy gradients can incentivize combining multiple approaches when diversity is present, and (3) empirical RL improvements on benchmarks. No equation or claim reduces by construction to a fitted parameter, self-citation, or renamed input. The theoretical section offers a general incentive argument rather than a closed-form prediction that matches the mid-training data by definition. Empirical gains are presented as separate validation, not forced by the theory. This satisfies the self-contained criterion; the skeptic concern about data quality is a correctness issue, not a circularity reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that Polya-style prompting reliably produces correct and diverse reasoning traces; no new entities are postulated and the only free parameters are standard training hyperparameters.

free parameters (1)
  • number of variants per question
    The framework generates multiple correct answers per problem; the exact count and selection criteria are chosen by the authors.
axioms (1)
  • domain assumption Self-generated answers produced via Polya's problem-solving steps are both correct and diverse enough to improve downstream RL.
    Invoked in the data-generation and mid-training description.

pith-pipeline@v0.9.0 · 5515 in / 1158 out tokens · 63110 ms · 2026-05-12T01:07:42.418810+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 18 internal anchors

  1. [1]

    arXiv preprint arXiv:2402.10571 , year=

    Afra Amini, Tim Vieira, and Ryan Cotterell. Direct preference optimization with an off- set.ArXiv, abs/2402.10571, 2024. URL https://api.semanticscholar.org/CorpusID: 267740352

  2. [2]

    Matharena: Evaluating llms on uncontaminated math competitions, February 2025

    Mislav Balunovi ´c, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi ´c, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, February 2025. URL https://matharena.ai/

  3. [3]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  4. [4]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  5. [5]

    From atomic to composite: Reinforcement learning enables general- ization in complementary reasoning.arXiv preprint arXiv:2512.01970, 2025

    Sitao Cheng, Xunjian Yin, Ruiwen Zhou, Yuxuan Li, Xinyi Wang, Liangming Pan, William Yang Wang, and Victor Zhong. From atomic to composite: Reinforcement learning enables general- ization in complementary reasoning.arXiv preprint arXiv:2512.01970, 2025

  6. [6]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  7. [7]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

  8. [8]

    Preprint, arXiv:2210.02410

    Dan Friedman and Adji Bousso Dieng. The vendi score: A diversity evaluation metric for machine learning.arXiv preprint arXiv:2210.02410, 2022

  9. [9]

    Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars

    Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503.01307, 2025

  10. [10]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  11. [12]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  12. [13]

    Targen: Targeted data genera- tion with large language models.arXiv preprint arXiv:2310.17876, 2023

    Himanshu Gupta, Kevin Scaria, Ujjwala Anantheswaran, Shreyas Verma, Mihir Parmar, Saurabh Arjun Sawant, Chitta Baral, and Swaroop Mishra. Targen: Targeted data genera- tion with large language models.arXiv preprint arXiv:2310.17876, 2023

  13. [14]

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yu- jie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun

    Andre He, Daniel Fried, and Sean Welleck. Rewarding the unlikely: Lifting grpo beyond distribution sharpening.arXiv preprint arXiv:2506.02355, 2025. 10

  14. [15]

    OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008, 2024

  15. [16]

    arXiv preprint arXiv:2504.11456 , year =

    Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, et al. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.arXiv preprint arXiv:2504.11456, 2025

  16. [17]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

  17. [18]

    Unnatural instructions: Tuning language models with (almost) no human labor

    Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14409–14428, 2023

  18. [19]

    Math-verify: Robust mathematical expression evaluator

    Hugging Face. Math-verify: Robust mathematical expression evaluator. https://github. com/huggingface/Math-Verify, 2025. GitHub repository; version <vX.Y .Z>; accessed 2025-09-09

  19. [20]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  20. [21]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  21. [22]

    arXiv preprint arXiv:2510.13786 , year=

    Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S Dhillon, David Brandfonbrener, and Rishabh Agarwal. The art of scaling reinforcement learning compute for llms.arXiv preprint arXiv:2510.13786, 2025

  22. [23]

    Spoc: Search-based pseudocode to code.Advances in Neural Information Processing Systems, 32, 2019

    Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, and Percy S Liang. Spoc: Search-based pseudocode to code.Advances in Neural Information Processing Systems, 32, 2019

  23. [24]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  24. [25]

    Step-dpo: Step-wise preference optimization for long-chain reasoning of llms, 2024

    Xin Lai, Zhuotao Tian, Yukang Chen, Senqiao Yang, Xiangru Peng, and Jiaya Jia. Step- dpo: Step-wise preference optimization for long-chain reasoning of llms.arXiv preprint arXiv:2406.18629, 2024

  25. [26]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

  26. [27]

    The measurement of observer agreement for categorical data.biometrics, pages 159–174, 1977

    J Richard Landis and Gary G Koch. The measurement of observer agreement for categorical data.biometrics, pages 159–174, 1977

  27. [28]

    Small models struggle to learn from strong reasoners

    Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubramanian, and Radha Poovendran. Small models struggle to learn from strong reasoners. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 25366–25394, ...

  28. [29]

    arXiv preprint arXiv:2507.01352 , year=

    Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, and Yahui Zhou. Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352, 2025

  29. [30]

    Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024

    Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024

  30. [31]

    Cross-task gener- alization via natural language crowdsourcing instructions

    Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task gener- alization via natural language crowdsourcing instructions. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, 2022

  31. [32]

    Orca 2: Teaching small language models how to reason.arXiv preprint arXiv:2311.11045, 2023

    Arindam Mitra, Luciano Del Corro, Shweti Mahajan, Andres Codas, Clarisse Simoes, Sahaj Agarwal, Xuxi Chen, Anastasia Razdaibiedina, Erik Jones, Kriti Aggarwal, et al. Orca 2: Teaching small language models how to reason.arXiv preprint arXiv:2311.11045, 2023

  32. [33]

    Mid-training of large language models: A survey.arXiv preprint arXiv:2510.06826, 2025

    Kaixiang Mo, Yuxin Shi, Weiwei Weng, Zhiqiang Zhou, Shuman Liu, Haibo Zhang, and Anxi- ang Zeng. Mid-training of large language models: A survey.arXiv preprint arXiv:2510.06826, 2025

  33. [34]

    Mukherjee, A

    Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4.arXiv preprint arXiv:2306.02707, 2023

  34. [35]

    AMC 12A (2023): Problems and Solutions

    Mathematical Association of America. AMC 12A (2023): Problems and Solutions. https: //artofproblemsolving.com/wiki/index.php/2023_AMC_12A_Problems, 2023. Com- petition held Nov 8, 2023. Problems © MAA AMC

  35. [36]

    Olmo 3

    Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. Olmo 3.arXiv preprint arXiv:2512.13961, 2025

  36. [37]

    New embedding models and api updates

    OpenAI. New embedding models and api updates. https://openai.com/index/ new-embedding-models-and-api-updates/, January 2024

  37. [38]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  38. [39]

    How many data samples is an additional instruction worth? InFindings of the Association for Computational Linguistics: EACL 2023, pages 1042–1057, 2023

    Ravsehaj Singh Puri, Swaroop Mishra, Mihir Parmar, and Chitta Baral. How many data samples is an additional instruction worth? InFindings of the Association for Computational Linguistics: EACL 2023, pages 1042–1057, 2023

  39. [40]

    Princeton science library

    George Pólya and John Horton Conway.How to solve it. Princeton science library. Princeton University Press, Princeton [N.J.], expanded princeton science library ed. edition, 2004. ISBN 0-691-11966-X ; 978-0-691-11966-3. Includes bibliographical references

  40. [41]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  41. [42]

    Thinktuning: Instilling cognitive reflections without distillation.arXiv preprint arXiv:2508.07616, 2025

    Aswin RRV , Jacob Dineen, Divij Handa, Md Nayem Uddin, Mihir Parmar, Chitta Baral, and Ben Zhou. Thinktuning: Instilling cognitive reflections without distillation.arXiv preprint arXiv:2508.07616, 2025

  42. [43]

    Triple preference optimization: Achieving better alignment using a single step optimization.arXiv preprint arXiv:2405.16681, 2024

    Amir Saeidi, Shivanshu Verma, Aswin RRV , Kashif Rasul, and Chitta Baral. Triple preference optimization: Achieving better alignment using a single step optimization.arXiv preprint arXiv:2405.16681, 2024

  43. [44]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. 12

  44. [45]

    Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold.Advances in Neural Information Processing Systems, 37:43000–43031, 2024

    Amrith Setlur, Saurabh Garg, Xinyang Geng, Naman Garg, Virginia Smith, and Aviral Kumar. Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold.Advances in Neural Information Processing Systems, 37:43000–43031, 2024

  45. [46]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Yankai Li, Yu Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  46. [47]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

  47. [48]

    Musr: Testing the limits of chain-of-thought with multistep soft reasoning.arXiv:2310.16049,

    Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. Musr: Testing the limits of chain-of-thought with multistep soft reasoning.arXiv preprint arXiv:2310.16049, 2023

  48. [49]

    Qwq-32b: Embracing the power of reinforcement learning, March 2025

    Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025. URL https://qwenlm.github.io/blog/qwq-32b/

  49. [50]

    A survey on llm mid-training.arXiv preprint arXiv:2510.23081, 2025

    Chengying Tu, Xuemiao Zhang, Rongxiang Weng, Rumei Li, Chen Zhang, Yang Bai, Hongfei Yan, Jingang Wang, and Xunliang Cai. A survey on llm mid-training.arXiv preprint arXiv:2510.23081, 2025

  50. [51]

    Trl: Transformer reinforce- ment learning.https://github.com/huggingface/trl, 2020

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforce- ment learning.https://github.com/huggingface/trl, 2020

  51. [52]

    Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, 2025

  52. [53]

    Self-instruct: Aligning language models with self-generated instruc- tions

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instruc- tions. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 13484–13508, 2023

  53. [54]

    Finetuned Language Models Are Zero-Shot Learners

    Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652, 2021

  54. [55]

    Williams

    Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist re- inforcement learning.Mach. Learn., 8(3–4):229–256, May 1992. ISSN 0885-6125. doi: 10.1007/BF00992696. URLhttps://doi.org/10.1007/BF00992696

  55. [56]

    Wizardlm: Empowering large pre-trained language models to follow complex instructions

    Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qing- wei Lin, and Daxin Jiang. Wizardlm: Empowering large pre-trained language models to follow complex instructions. InThe Twelfth International Conference on Learning Representations, 2024

  56. [57]

    Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding

    Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding. 2025. URL https://arxiv. org/abs/2503.02951

  57. [58]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  58. [59]

    arXiv preprint arXiv:2509.25123 , year=

    Lifan Yuan, Weize Chen, Yuchen Zhang, Ganqu Cui, Hanbin Wang, Ziming You, Ning Ding, Zhiyuan Liu, Maosong Sun, and Hao Peng. From f(x) and g(x) to f(g(x)) : Llms learn new skills in rl by composing old ones, 2025. URLhttps://arxiv.org/abs/2509.25123

  59. [60]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025

  60. [61]

    Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

  61. [62]

    On the interplay of pre-training, mid-training, and rl on reasoning language models.arXiv preprint arXiv:2512.07783,

    Charlie Zhang, Graham Neubig, and Xiang Yue. On the interplay of pre-training, mid-training, and rl on reasoning language models, 2025. URLhttps://arxiv.org/abs/2512.07783

  62. [63]

    American invitational mathematics examination (aime) 2024, 2024

    Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024

  63. [64]

    American invitational mathematics examination (aime) 2025, 2025

    Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2025, 2025

  64. [65]

    Each sentence in the generated text uses a second person

    Yang Zhou, Sunzhu Li, Shunyu Liu, Wenkai Fang, Jiale Zhao, Jingwen Yang, Jianwei Lv, Kongcheng Zhang, Yihe Zhou, Hengtong Lu, et al. Breaking the exploration bottleneck: Rubric- scaffolded reinforcement learning for general llm reasoning.arXiv preprint arXiv:2508.16949, 2025. 14 Table of Contents A Appendix. . . . . . . . . . . . . . . . . . . . . . . . ....

  65. [66]

    Filter candidates by answer correctness via Math-Verify

  66. [67]

    Score remaining candidates with reward model

  67. [68]

    Decision: Yes

    Select response with highest score as final heuristic-specific solution Figure 8: Reward model scoring pipeline. After filtering for correctness, Skywork-Reward-V2 ranks candidate responses by heuristic adherence. The highest-scoring response per (question, heuristic) pair is selected for mid-training. A.7.2 Reinforcement Learning RL Training Data Format ...

  68. [69]

    determination hope success

    benchmark scale. These results further support the reliability of our automated classifier for large-scale heuristic annotation. Human Annotation Interface Instructions: For each sample, determine if the reasoning chain exhibits the described behavior. Click Yes or No.). Sample 1 / 40 Heuristic:Determination Hope Success Behavior Description: In your reas...

  69. [70]

    All the given conditions are satisfied, and the context makes sense

    Let’s verify: If she sold 48 in April and 24 in May (which is indeed half of 48), their sum is 72. All the given conditions are satisfied, and the context makes sense. The answer is correct. Can You Derive Result Differently Try to reach the same answer using another method or ap- proach or viewpoint. Comparing two ways helps confirm the answer and gives ...

  70. [71]

    Pappus First construct or create something that helps you explore the problem

    By clearly naming each quantity, the solution pathway is visible and straightforward. Pappus First construct or create something that helps you explore the problem. Draw a figure, build an example, or add a new element that reveals how the parts are related. Think of problem solving as a process of invention — by con- structing and observing, you can ofte...

  71. [72]

    First, find what seems to be the right answer or pattern by exploring ex- amples, calculating, or testing possibilities

    33 Table 7 – continued Heuristic Description Example Problems To Find And Prove Approach the problem in two stages. First, find what seems to be the right answer or pattern by exploring ex- amples, calculating, or testing possibilities. Then, once you have a likely result, prove it by explaining why it must be true in all cases, using logical reasoning or...

  72. [73]

    So, this possibility is inconsistent with the problem statement

    But the problem says May’s sales were half of April’s, not double. So, this possibility is inconsistent with the problem statement. Therefore, May’s number must truly be half of April’s: 48 / 2 = 24. Adding April and May gives 48 + 24 = 72 clips. Redundant When analyzing the problem, check whether any part of the information or condition repeats, adds not...

  73. [74]

    Now, adding both months together, 48 plus 24 gives 72

    At this point, I’m reminded, ’Measure twice, cut once.’ Before I add, let me double-check that math: half of 48 is indeed 24. Now, adding both months together, 48 plus 24 gives 72. Careful checking helps avoid mistakes, just like the proverb says. 36 Table 7 – continued Heuristic Description Example Working Backwards You must start from the goal itself. A...