pith. machine review for the scientific record. sign in

arxiv: 2501.09686 · v3 · submitted 2025-01-16 · 💻 cs.AI · cs.CL

Recognition: no theorem link

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:17 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords large language modelsreinforcement learningreasoningtest-time scalinglarge reasoning modelssurvey
0
0 comments X

The pith

Reinforcement learning on reasoning trajectories combined with test-time token scaling points toward Large Reasoning Models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys how large language models can move beyond simple next-token prediction by treating sequences of intermediate thoughts as trainable reasoning processes. Reinforcement learning is applied to automatically discover and reinforce high-quality reasoning trajectories, which supplies far more training data than human annotation allows. At inference time, permitting the model to generate additional thinking tokens before answering further improves accuracy on hard tasks. Together these techniques define an emerging frontier the authors call Large Reasoning Models, with OpenAI's o1 series presented as the first prominent example. A reader cares because the approach reframes scaling laws around computation spent on thought rather than solely on model size or data volume.

Core claim

The central claim is that applying reinforcement learning to train models on automatically generated reasoning trajectories, paired with deliberate test-time scaling of thinking tokens, expands LLMs' reasoning capacity and marks a path to a new class of Large Reasoning Models.

What carries the argument

The reinforced reasoning paradigm, in which reinforcement learning optimizes the generation of thought sequences that represent intermediate reasoning steps through trial-and-error search.

If this is right

  • Automated trial-and-error search generates substantially more high-quality reasoning trajectories than manual curation, expanding available training data.
  • Allocating more tokens to thinking during inference measurably raises accuracy on complex tasks.
  • The combination of train-time RL and test-time scaling shifts the paradigm from pure autoregressive generation to learned search and reflection.
  • Open-source efforts are already constructing models that follow this reinforced-reasoning pattern.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the scaling holds, these models could extend to long-horizon planning problems in robotics or scientific discovery where intermediate verification is possible.
  • The method may reduce dependence on human-curated reasoning datasets, shifting data creation toward self-generated trajectories.
  • A practical test would measure whether the extra test-time tokens yield consistent gains across domains that lack clear reward signals.

Load-bearing premise

Reinforcement learning applied to reasoning trajectories will reliably expand LLMs' reasoning capacity without introducing systematic biases or hallucinations that are harder to detect than in standard generation.

What would settle it

An experiment demonstrating that RL-trained reasoning models produce no net accuracy gain or exhibit higher rates of subtle, hard-to-detect errors on held-out complex reasoning tasks compared with standard LLMs would falsify the claimed expansion of capacity.

read the original abstract

Language has long been conceived as an essential tool for human reasoning. The breakthrough of Large Language Models (LLMs) has sparked significant research interest in leveraging these models to tackle complex reasoning tasks. Researchers have moved beyond simple autoregressive token generation by introducing the concept of "thought" -- a sequence of tokens representing intermediate steps in the reasoning process. This innovative paradigm enables LLMs' to mimic complex human reasoning processes, such as tree search and reflective thinking. Recently, an emerging trend of learning to reason has applied reinforcement learning (RL) to train LLMs to master reasoning processes. This approach enables the automatic generation of high-quality reasoning trajectories through trial-and-error search algorithms, significantly expanding LLMs' reasoning capacity by providing substantially more training data. Furthermore, recent studies demonstrate that encouraging LLMs to "think" with more tokens during test-time inference can further significantly boost reasoning accuracy. Therefore, the train-time and test-time scaling combined to show a new research frontier -- a path toward Large Reasoning Model. The introduction of OpenAI's o1 series marks a significant milestone in this research direction. In this survey, we present a comprehensive review of recent progress in LLM reasoning. We begin by introducing the foundational background of LLMs and then explore the key technical components driving the development of large reasoning models, with a focus on automated data construction, learning-to-reason techniques, and test-time scaling. We also analyze popular open-source projects at building large reasoning models, and conclude with open challenges and future research directions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper surveys recent progress in LLM reasoning, tracing the shift from autoregressive generation to 'thought' processes and 'learning to reason' via reinforcement learning. It argues that RL enables automatic generation of high-quality reasoning trajectories through trial-and-error, substantially expanding capacity via more training data; combined with test-time scaling (more tokens at inference), this forms a path to Large Reasoning Models, with OpenAI o1 as a milestone. The review covers LLM background, automated data construction, learning-to-reason techniques, test-time scaling, open-source projects, and open challenges.

Significance. A well-executed survey in this rapidly evolving area could help organize the literature on reinforced reasoning and highlight the train-time/test-time scaling paradigm, providing a useful reference for researchers working on more capable reasoning systems.

major comments (2)
  1. [Abstract] Abstract: the survey provides no explicit selection criteria for included papers or protocol for reconciling conflicting results across studies; this is load-bearing for a literature review's credibility and should be stated in the introduction or methods section.
  2. [learning-to-reason techniques] Section on learning-to-reason (automated data construction + RL): the repeated claim that RL produces 'high-quality' trajectories via trial-and-error assumes reward models avoid circular verification risks and systematic biases; the survey does not discuss independent verification mechanisms or empirical checks for hallucination amplification beyond the RL objective itself.
minor comments (2)
  1. [Introduction] Introduction: the term 'Large Reasoning Model' is introduced as a new frontier without a precise definition or distinguishing criteria from existing LLMs; a short clarifying paragraph would help.
  2. [Throughout] Throughout: some citations to recent arXiv preprints (e.g., o1-related work) would benefit from explicit dates or version numbers to aid readers tracking fast-moving developments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation. We will revise the manuscript to address the two major comments by adding explicit methodological details and expanding the discussion of limitations in the learning-to-reason section.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the survey provides no explicit selection criteria for included papers or protocol for reconciling conflicting results across studies; this is load-bearing for a literature review's credibility and should be stated in the introduction or methods section.

    Authors: We agree that an explicit description of paper selection and conflict reconciliation strengthens survey credibility. In the revised manuscript we will add a dedicated paragraph in the Introduction (new subsection 1.3) that states: (i) search strategy (arXiv, ACL Anthology, NeurIPS/ICLR/ICML proceedings 2023-2025, keywords “reinforcement learning reasoning LLM”, “o1”, “test-time scaling”); (ii) inclusion criteria (peer-reviewed or high-impact preprints with empirical results on reasoning benchmarks); (iii) exclusion of purely theoretical or non-LLM work; and (iv) our approach to conflicting results (presenting both positive and negative findings with citations and noting open questions rather than claiming consensus). revision: yes

  2. Referee: [learning-to-reason techniques] Section on learning-to-reason (automated data construction + RL): the repeated claim that RL produces 'high-quality' trajectories via trial-and-error assumes reward models avoid circular verification risks and systematic biases; the survey does not discuss independent verification mechanisms or empirical checks for hallucination amplification beyond the RL objective itself.

    Authors: We accept the observation that the current text under-emphasizes verification risks. We will revise Section 4 (Learning-to-Reason) by inserting a new subsection 4.4 “Verification and Bias Considerations” that (a) acknowledges the circularity risk when reward models are trained on the same distribution as the policy, (b) cites existing empirical checks (e.g., process-supervised reward models, outcome verification via external solvers, and human preference studies), and (c) discusses reported cases of hallucination amplification under RL. We will also tone down the unqualified “high-quality” phrasing in the abstract and introduction to “trajectories that improve downstream benchmark performance under the learned reward model.” revision: yes

Circularity Check

0 steps flagged

Survey reports external literature without internal derivation or self-referential reduction

full rationale

The paper is a literature survey reviewing progress in LLM reasoning, RL-based training, automated data construction, and test-time scaling. It presents no original equations, fitted parameters, or quantitative derivations of its own. All technical claims are attributed to cited external works (including OpenAI o1), with no load-bearing step that reduces by construction to the survey's own inputs or self-citations. The central narrative simply organizes existing results rather than deriving new ones from within the paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

As a survey the paper rests on the assumption that the cited literature accurately represents the state of the field and that the three highlighted technical components (automated data construction, learning-to-reason, test-time scaling) are the dominant drivers.

axioms (1)
  • domain assumption Reinforcement learning on reasoning trajectories produces higher-quality training data than human annotation alone.
    Invoked in the abstract when stating that RL enables automatic generation of high-quality reasoning trajectories.
invented entities (1)
  • Large Reasoning Model no independent evidence
    purpose: Conceptual category for LLMs trained with reinforced reasoning and test-time scaling.
    Introduced as the target of the surveyed research frontier.

pith-pipeline@v0.9.0 · 5643 in / 1183 out tokens · 14404 ms · 2026-05-15T21:17:24.834699+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

    cs.CL 2026-05 unverdicted novelty 8.0

    REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...

  2. MirageBackdoor: A Stealthy Attack that Induces Think-Well-Answer-Wrong Reasoning

    cs.CR 2026-04 unverdicted novelty 8.0

    MirageBackdoor is the first backdoor attack that preserves clean chain-of-thought reasoning in LLMs while steering the final answer to a specific incorrect target under a trigger.

  3. Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

    cs.AI 2026-05 conditional novelty 7.0

    Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.

  4. IE as Cache: Information Extraction Enhanced Agentic Reasoning

    cs.CL 2026-04 unverdicted novelty 7.0

    IE-as-Cache framework repurposes information extraction as a dynamic cognitive cache to improve agentic reasoning accuracy in LLMs on challenging benchmarks.

  5. Metacognitive Behavioral Tuning of Large Language Models for Multi-Hop Question Answering

    cs.AI 2026-02 unverdicted novelty 7.0

    Metacognitive Behavioral Tuning injects a five-phase structure into LLM reasoning traces to improve accuracy-efficiency on multi-hop QA while reducing trace length and degeneration.

  6. ADMM-Q: An Improved Hessian-based Weight Quantizer for Post-Training Quantization of Large Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    ADMM-Q is a new post-training quantization method using ADMM operator splitting that reduces WikiText-2 perplexity compared to GPTQ on Qwen3-8B across W3A16, W4A8, and W2A4KV4 settings.

  7. Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while sh...

  8. Meta-Aligner: Bidirectional Preference-Policy Optimization for Multi-Objective LLMs Alignment

    cs.LG 2026-04 unverdicted novelty 6.0

    Meta-Aligner introduces a meta-learner network that produces dynamic preference weights to enable bidirectional optimization between preferences and LLM policy responses for multi-objective alignment.

  9. Grounding Before Generalizing: How AI Differs from Humans in Causal Transfer

    cs.AI 2026-04 unverdicted novelty 6.0

    LLMs and VLMs exhibit grounding-dependent causal transfer in the OpenLock paradigm, requiring initial context-specific learning unlike humans who transfer structures from the first attempt.

  10. Reasoning Fails Where Step Flow Breaks

    cs.AI 2026-04 unverdicted novelty 6.0

    Step-Saliency identifies shallow lock-in and deep decay in reasoning model attention flows, and StepFlow intervention repairs them to improve accuracy across LRMs.

  11. Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search

    cs.LG 2026-03 unverdicted novelty 6.0

    Gome reaches 35.1% any-medal rate on MLE-Bench by mapping reasoning to gradient-based updates, outperforming tree search once models are sufficiently capable.

  12. How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors

    cs.AI 2026-05 unverdicted novelty 5.0

    IMAX trains soft prefixes with an InfoMax reward to drive diverse exploration in RLVR, yielding up to 11.60% gains in Pass@4 over standard RLVR across model scales.

  13. How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem

    cs.AI 2026-05 unverdicted novelty 5.0

    Non-reasoning LLMs fail the equivalence class problem while reasoning LLMs perform better but remain incomplete, with difficulty peaking at phase transition for the former and maximum diameter for the latter.

  14. UpstreamQA: A Modular Framework for Explicit Reasoning on Video Question Answering Tasks

    cs.CV 2026-04 unverdicted novelty 5.0

    UpstreamQA disentangles video reasoning by using LRMs for explicit upstream object identification and scene context before downstream LMM VideoQA, improving performance and interpretability on OpenEQA and NExTQA in so...

  15. Efficient Test-Time Scaling via Temporal Reasoning Aggregation

    cs.AI 2026-04 unverdicted novelty 5.0

    TRACE aggregates answer consistency and confidence trajectory over multiple reasoning steps to decide when to halt inference, reducing token usage by 25-30% while keeping accuracy within 1-2% of full reasoning.

  16. Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

    cs.CL 2025-03 accept novelty 5.0

    A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.

  17. Reinforcement Learning for Scalable and Trustworthy Intelligent Systems

    cs.LG 2026-05 unverdicted novelty 3.0

    Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.

  18. From System 1 to System 2: A Survey of Reasoning Large Language Models

    cs.AI 2025-02 accept novelty 3.0

    The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

Reference graph

Works this paper leans on

202 extracted references · 202 canonical work pages · cited by 18 Pith papers · 41 internal anchors

  1. [1]

    Phi-4 Technical Report

    Marah Abdin, Jyoti Aneja, Harkirat Behl, S´ebastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 techni- cal report. arXiv preprint arXiv:2412.08905, 2024

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022

  4. [4]

    Direct preference optimization with an offset

    Afra Amini, Tim Vieira, and Ryan Cotterell. Direct preference optimization with an offset. arXiv preprint arXiv:2402.10571, 2024

  5. [5]

    Mathqa: Towards interpretable math word problem solving with operation-based formalisms, 2019

    Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms, 2019

  6. [6]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

  7. [7]

    Openai’s new o3 model freaks out computer science majors

    AXIOS. Openai’s new o3 model freaks out computer science majors. 2025

  8. [8]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

  9. [9]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitu- tional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

  10. [10]

    Phyre: A new benchmark for physical reasoning.Advances in Neural Information Processing Systems, 32, 2019

    Anton Bakhtin, Laurens van der Maaten, Justin Johnson, Laura Gustafson, and Ross Girshick. Phyre: A new benchmark for physical reasoning.Advances in Neural Information Processing Systems, 32, 2019

  11. [11]

    Graph of thoughts: Solving elaborate problems with large language models

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17682–17690, 2024

  12. [12]

    Piqa: Reasoning about phys- ical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about phys- ical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

  13. [13]

    Autonomous chemical research with large language models

    Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models. Nature, 624(7992):570–578, 2023

  14. [14]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877– 1901, 2020

  15. [15]

    A survey of monte carlo tree search methods

    Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowl- ing, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012

  16. [16]

    Alphamath almost zero: process supervision without process

    Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Alphamath almost zero: process supervision without process. arXiv preprint arXiv:2405.03553, 2024

  17. [17]

    Step-level value preference optimiza- tion for mathematical reasoning

    Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Step-level value preference optimiza- tion for mathematical reasoning. arXiv preprint arXiv:2406.10858, 2024. 26

  18. [18]

    Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression, 2022

    Jiaqi Chen, Tong Li, Jinghui Qin, Pan Lu, Liang Lin, Chongyu Chen, and Xiaodan Liang. Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression, 2022

  19. [19]

    Xing, and Liang Lin

    Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P. Xing, and Liang Lin. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning, 2022

  20. [20]

    Large language model-driven meta-structure discovery in heterogeneous information network

    Lin Chen, Fengli Xu, Nian Li, Zhenyu Han, Meng Wang, Yong Li, and Pan Hui. Large language model-driven meta-structure discovery in heterogeneous information network. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Min- ing, pages 307–318, 2024

  21. [21]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  22. [22]

    Large language models meet harry potter: A dataset for aligning dialogue agents with characters

    Nuo Chen, Yan Wang, Haiyun Jiang, Deng Cai, Yuhan Li, Ziyang Chen, Longyue Wang, and Jia Li. Large language models meet harry potter: A dataset for aligning dialogue agents with characters. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8506–8520, 2023

  23. [23]

    Theoremqa: A theorem-driven question answering dataset, 2023

    Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. Theoremqa: A theorem-driven question answering dataset, 2023

  24. [24]

    Training verifiers to solve math word problems, 2021

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021

  25. [25]

    Meta-in-context learning in large language models.Advances in Neural Information Process- ing Systems, 36, 2024

    Julian Coda-Forno, Marcel Binz, Zeynep Akata, Matt Botvinick, Jane Wang, and Eric Schulz. Meta-in-context learning in large language models.Advances in Neural Information Process- ing Systems, 36, 2024

  26. [26]

    Testing gpt-4-o1-preview on math and science problems: A follow-up study

    Ernest Davis. Testing gpt-4-o1-preview on math and science problems: A follow-up study. arXiv preprint arXiv:2410.22340, 2024

  27. [27]

    System 2 thinking in openai’s o1- preview model: Near-perfect performance on a mathematics exam

    Joost CF de Winter, Dimitra Dodou, and Yke Bauke Eisma. System 2 thinking in openai’s o1- preview model: Near-perfect performance on a mathematics exam. Computers, 13(11):278, 2024

  28. [28]

    Mind2web: Towards a generalist agent for the web

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36, 2024

  29. [29]

    Reasonbert: Pre-trained to reason with distant supervision

    Xiang Deng, Yu Su, Alyssa Lees, You Wu, Cong Yu, and Huan Sun. Reasonbert: Pre-trained to reason with distant supervision. arXiv preprint arXiv:2109.04912, 2021

  30. [30]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understand- ing. arXiv preprint arXiv:1810.04805, 2018

  31. [31]

    Data augmentation using llms: Data perspectives, learning paradigms and challenges

    Bosheng Ding, Chengwei Qin, Ruochen Zhao, Tianze Luo, Xinze Li, Guizhen Chen, Wen- han Xia, Junjie Hu, Anh Tuan Luu, and Shafiq Joty. Data augmentation using llms: Data perspectives, learning paradigms and challenges. arXiv preprint arXiv:2403.02990, 2024

  32. [32]

    Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

    Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023

  33. [33]

    A Survey on In-context Learning

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. A survey on in-context learning. arXiv preprint arXiv:2301.00234 , 2022

  34. [34]

    Abduction

    Igor Douven. Abduction. In Edward N. Zalta, editor, The Stanford Encyclopedia of Philoso- phy. Metaphysics Research Lab, Stanford University, Summer 2021 edition, 2021

  35. [35]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  36. [36]

    The path to superintelligence: A critical analysis of openai’s five levels of ai progression

    Tom Duenas and Diana Ruiz. The path to superintelligence: A critical analysis of openai’s five levels of ai progression. ResearchGate, 2024b. doi, 10, 2024. 27

  37. [37]

    Detecting hallucinations in large language models using semantic entropy

    Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy. Nature, 630(8017):625–630, 2024

  38. [38]

    Promptbreeder: Self-referential self-improvement via prompt evolution

    Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rockt¨aschel. Promptbreeder: Self-referential self-improvement via prompt evolution. arXiv preprint arXiv:2309.16797, 2023

  39. [39]

    Memory sharing for large language model based agents

    Hang Gao and Yongfeng Zhang. Memory sharing for large language model based agents. arXiv preprint arXiv:2404.09982, 2024

  40. [40]

    Self-evolving gpt: A lifelong autonomous experiential learner

    Jinglong Gao, Xiao Ding, Yiming Cui, Jianbai Zhao, Hepeng Wang, Ting Liu, and Bing Qin. Self-evolving gpt: A lifelong autonomous experiential learner. arXiv preprint arXiv:2407.08937, 2024

  41. [41]

    Rlef: Grounding code llms in execution feedback with reinforcement learning

    Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, and Gabriel Syn- naeve. Rlef: Grounding code llms in execution feedback with reinforcement learning. arXiv preprint arXiv:2410.02089, 2024

  42. [42]

    Llms accelerate annotation for medical information extraction

    Akshay Goel, Almog Gueta, Omry Gilon, Chang Liu, Sofia Erell, Lan Huong Nguyen, Xi- aohong Hao, Bolous Jaber, Shashir Reddy, Rupesh Kartha, et al. Llms accelerate annotation for medical information extraction. In Machine Learning for Health (ML4H), pages 82–100. PMLR, 2023

  43. [43]

    CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

    Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738, 2023

  44. [44]

    Richelieu: Self-evolving llm-based agents for ai diplomacy

    Zhenyu Guan, Xiangyu Kong, Fangwei Zhong, and Yizhou Wang. Richelieu: Self-evolving llm-based agents for ai diplomacy. arXiv preprint arXiv:2407.06813, 2024

  45. [45]

    Reinforced Self-Training (ReST) for Language Modeling

    Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. Rein- forced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023

  46. [46]

    Fabbri, Wo- jciech Kryscinski, Semih Yavuz, Ye Liu, Xi Victoria Lin, Shafiq Joty, Yingbo Zhou, Caiming Xiong, Rex Ying, Arman Cohan, and Dragomir Radev

    Simeng Han, Hailey Schoelkopf, Yilun Zhao, Zhenting Qi, Martin Riddell, Wenfei Zhou, James Coady, David Peng, Yujie Qiao, Luke Benson, Lucy Sun, Alex Wardle-Solano, Hannah Szabo, Ekaterina Zubova, Matthew Burtell, Jonathan Fan, Yixin Liu, Brian Wong, Malcolm Sailor, Ansong Ni, Linyong Nan, Jungo Kasai, Tao Yu, Rui Zhang, Alexander R. Fabbri, Wo- jciech Kr...

  47. [47]

    Inductive Logic

    James Hawthorne. Inductive Logic. In Edward N. Zalta and Uri Nodelman, editors, The Stan- ford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Summer 2024 edition, 2024

  48. [48]

    A cross-domain performance report of open ai chatgpt o1 model

    Kadhim Hayawi and Sakib Shahriar. A cross-domain performance report of open ai chatgpt o1 model. 2024

  49. [49]

    Measuring Coding Challenge Competence With APPS

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021

  50. [50]

    Measuring mathematical problem solving with the math dataset, 2021

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021

  51. [51]

    Learn- ing to solve arithmetic word problems with verb categorization

    Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. Learn- ing to solve arithmetic word problems with verb categorization. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 523–533, Doha, Qatar, October

  52. [52]

    Association for Computational Linguistics

  53. [53]

    Can gpt-o1 kill all bugs? an evaluation of gpt-family llms on quixbugs

    Haichuan Hu, Ye Shang, Guolin Xu, Congqing He, and Quanjun Zhang. Can gpt-o1 kill all bugs? an evaluation of gpt-family llms on quixbugs. arXiv e-prints, pages arXiv–2409, 2024

  54. [54]

    Automated Design of Agentic Systems

    Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. arXiv preprint arXiv:2408.08435, 2024. 28

  55. [55]

    Agents’ room: Narrative generation through multi-step collaboration

    Fantine Huot, Reinald Kim Amplayo, Jennimaria Palomaki, Alice Shoshana Jakobovits, Eliz- abeth Clark, and Mirella Lapata. Agents’ room: Narrative generation through multi-step collaboration. arXiv preprint arXiv:2410.02603, 2024

  56. [56]

    Self- explore to avoid the pit: Improving the reasoning capabilities of language models with fine- grained rewards

    Hyeonbin Hwang, Doyoung Kim, Seungone Kim, Seonghyeon Ye, and Minjoon Seo. Self- explore to avoid the pit: Improving the reasoning capabilities of language models with fine- grained rewards. arXiv preprint arXiv:2404.10346, 2024

  57. [57]

    blob reinforcement fine-tuning

    interconnects.ai. blob reinforcement fine-tuning. (Accessed: 2025-12-6)

  58. [58]

    Jiang, Wenda Li, Jesse Michael Han, and Yuhuai Wu

    Albert Q. Jiang, Wenda Li, Jesse Michael Han, and Yuhuai Wu. Lisa: Language models of isabelle proofs. 6th Conference on Artificial Intelligence and Theorem Proving, 2021

  59. [59]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023

  60. [60]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

  61. [61]

    Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment

    Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment. arXiv preprint arXiv:2410.01679, 2024

  62. [62]

    Meganno+: A human-llm collaborative annotation system

    Hannah Kim, Kushan Mitra, Rafael Li Chen, Sajjadur Rahman, and Dan Zhang. Meganno+: A human-llm collaborative annotation system. arXiv preprint arXiv:2402.18050, 2024

  63. [63]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022

  64. [64]

    Probing physical reasoning with counter-commonsense context

    Kazushi Kondo, Saku Sugawara, and Akiko Aizawa. Probing physical reasoning with counter-commonsense context. In Proceedings of the 61st Annual Meeting of the Associa- tion for Computational Linguistics (Volume 2: Short Papers), pages 603–612, 2023

  65. [65]

    Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917, 2024

    Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, et al. Training language models to self-correct via reinforcement learning. arXiv preprint arXiv:2409.12917, 2024

  66. [66]

    Language models as zero-shot trajectory generators

    Teyun Kwon, Norman Di Palo, and Edward Johns. Language models as zero-shot trajectory generators. IEEE Robotics and Automation Letters, 2024

  67. [67]

    Llms as factual reasoners: Insights from exist- ing benchmarks and beyond

    Philippe Laban, Wojciech Kry ´sci´nski, Divyansh Agarwal, Alexander R Fabbri, Caiming Xiong, Shafiq Joty, and Chien-Sheng Wu. Llms as factual reasoners: Insights from exist- ing benchmarks and beyond. arXiv preprint arXiv:2305.14540, 2023

  68. [68]

    Ds-1000: A natural and reliable benchmark for data science code generation

    Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen- tau Yih, Daniel Fried, Sida Wang, and Tao Yu. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning , pages 18319–18345. PMLR, 2023

  69. [69]

    A systematic assessment of openai o1-preview for higher order thinking in education

    Ehsan Latif, Yifan Zhou, Shuchen Guo, Yizhu Gao, Lehong Shi, Matthew Nayaaba, Gyeonggeon Lee, Liang Zhang, Arne Bewersdorff, Luyang Fang, et al. A systematic assessment of openai o1-preview for higher order thinking in education. arXiv preprint arXiv:2410.21287, 2024

  70. [70]

    Multi-agent causal discovery using large language models

    Hao Duong Le, Xin Xia, and Zhang Chen. Multi-agent causal discovery using large language models. arXiv preprint arXiv:2407.15073, 2024

  71. [71]

    Llm2llm: Boosting llms with novel iterative data enhancement.arXiv preprint arXiv:2403.15042, 2024

    Nicholas Lee, Thanakul Wattanawong, Sehoon Kim, Karttikeya Mangalam, Sheng Shen, Gopala Anumanchipalli, Michael W Mahoney, Kurt Keutzer, and Amir Gholami. Llm2llm: Boosting llms with novel iterative data enhancement.arXiv preprint arXiv:2403.15042, 2024

  72. [72]

    BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

    M Lewis. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019

  73. [73]

    Dotamath: Decomposition of thought with code assistance and self-correction for mathemat- ical reasoning

    Chengpeng Li, Guanting Dong, Mingfeng Xue, Ru Peng, Xiang Wang, and Dayiheng Liu. Dotamath: Decomposition of thought with code assistance and self-correction for mathemat- ical reasoning. arXiv preprint arXiv:2407.04078, 2024. 29

  74. [74]

    Openai-o1 ab testing: Does the o1 model really do good reasoning in math problem solving? arXiv preprint arXiv:2411.06198, 2024

    Leo Li, Ye Luo, and Tingyou Pan. Openai-o1 ab testing: Does the o1 model really do good reasoning in math problem solving? arXiv preprint arXiv:2411.06198, 2024

  75. [75]

    Coannotating: Uncertainty-guided work allocation between human and large language models for data annotation

    Minzhi Li, Taiwei Shi, Caleb Ziems, Min-Yen Kan, Nancy F Chen, Zhengyuan Liu, and Diyi Yang. Coannotating: Uncertainty-guided work allocation between human and large language models for data annotation. arXiv preprint arXiv:2310.15638, 2023

  76. [76]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023

  77. [77]

    Let’s verify step by step, 2023

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step, 2023

  78. [78]

    Smith, and Yejin Choi

    Alisa Liu, Swabha Swayamdipta, Noah A. Smith, and Yejin Choi. Wanli: Worker and ai collaboration for natural language inference dataset creation, 2022

  79. [79]

    Fimo: A challenge formal dataset for automated theorem proving, 2023

    Chengwu Liu, Jianhao Shen, Huajian Xin, Zhengying Liu, Ye Yuan, Haiming Wang, Wei Ju, Chuanyang Zheng, Yichun Yin, Lin Li, Ming Zhang, and Qun Liu. Fimo: A challenge formal dataset for automated theorem proving, 2023

  80. [80]

    AgentBench: Evaluating LLMs as Agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688, 2023

Showing first 80 references.