arxiv: 2501.09686 · v3 · submitted 2025-01-16 · 💻 cs.AI · cs.CL

Recognition: no theorem link

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

Fengli Xu , Qianyue Hao , Zefang Zong , Jingwei Wang , Yunke Zhang , Jingyi Wang , Xiaochong Lan , Jiahui Gong

show 12 more authors

Tianjian Ouyang Fanjin Meng Chenyang Shao Yuwei Yan Qinglong Yang Yiwen Song Sijian Ren Xinyuan Hu Yu Li Jie Feng Chen Gao Yong Li

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:17 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords large language modelsreinforcement learningreasoningtest-time scalinglarge reasoning modelssurvey

0 comments

The pith

Reinforcement learning on reasoning trajectories combined with test-time token scaling points toward Large Reasoning Models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys how large language models can move beyond simple next-token prediction by treating sequences of intermediate thoughts as trainable reasoning processes. Reinforcement learning is applied to automatically discover and reinforce high-quality reasoning trajectories, which supplies far more training data than human annotation allows. At inference time, permitting the model to generate additional thinking tokens before answering further improves accuracy on hard tasks. Together these techniques define an emerging frontier the authors call Large Reasoning Models, with OpenAI's o1 series presented as the first prominent example. A reader cares because the approach reframes scaling laws around computation spent on thought rather than solely on model size or data volume.

Core claim

The central claim is that applying reinforcement learning to train models on automatically generated reasoning trajectories, paired with deliberate test-time scaling of thinking tokens, expands LLMs' reasoning capacity and marks a path to a new class of Large Reasoning Models.

What carries the argument

The reinforced reasoning paradigm, in which reinforcement learning optimizes the generation of thought sequences that represent intermediate reasoning steps through trial-and-error search.

If this is right

Automated trial-and-error search generates substantially more high-quality reasoning trajectories than manual curation, expanding available training data.
Allocating more tokens to thinking during inference measurably raises accuracy on complex tasks.
The combination of train-time RL and test-time scaling shifts the paradigm from pure autoregressive generation to learned search and reflection.
Open-source efforts are already constructing models that follow this reinforced-reasoning pattern.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the scaling holds, these models could extend to long-horizon planning problems in robotics or scientific discovery where intermediate verification is possible.
The method may reduce dependence on human-curated reasoning datasets, shifting data creation toward self-generated trajectories.
A practical test would measure whether the extra test-time tokens yield consistent gains across domains that lack clear reward signals.

Load-bearing premise

Reinforcement learning applied to reasoning trajectories will reliably expand LLMs' reasoning capacity without introducing systematic biases or hallucinations that are harder to detect than in standard generation.

What would settle it

An experiment demonstrating that RL-trained reasoning models produce no net accuracy gain or exhibit higher rates of subtle, hard-to-detect errors on held-out complex reasoning tasks compared with standard LLMs would falsify the claimed expansion of capacity.

read the original abstract

Language has long been conceived as an essential tool for human reasoning. The breakthrough of Large Language Models (LLMs) has sparked significant research interest in leveraging these models to tackle complex reasoning tasks. Researchers have moved beyond simple autoregressive token generation by introducing the concept of "thought" -- a sequence of tokens representing intermediate steps in the reasoning process. This innovative paradigm enables LLMs' to mimic complex human reasoning processes, such as tree search and reflective thinking. Recently, an emerging trend of learning to reason has applied reinforcement learning (RL) to train LLMs to master reasoning processes. This approach enables the automatic generation of high-quality reasoning trajectories through trial-and-error search algorithms, significantly expanding LLMs' reasoning capacity by providing substantially more training data. Furthermore, recent studies demonstrate that encouraging LLMs to "think" with more tokens during test-time inference can further significantly boost reasoning accuracy. Therefore, the train-time and test-time scaling combined to show a new research frontier -- a path toward Large Reasoning Model. The introduction of OpenAI's o1 series marks a significant milestone in this research direction. In this survey, we present a comprehensive review of recent progress in LLM reasoning. We begin by introducing the foundational background of LLMs and then explore the key technical components driving the development of large reasoning models, with a focus on automated data construction, learning-to-reason techniques, and test-time scaling. We also analyze popular open-source projects at building large reasoning models, and conclude with open challenges and future research directions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper surveys recent progress in LLM reasoning, tracing the shift from autoregressive generation to 'thought' processes and 'learning to reason' via reinforcement learning. It argues that RL enables automatic generation of high-quality reasoning trajectories through trial-and-error, substantially expanding capacity via more training data; combined with test-time scaling (more tokens at inference), this forms a path to Large Reasoning Models, with OpenAI o1 as a milestone. The review covers LLM background, automated data construction, learning-to-reason techniques, test-time scaling, open-source projects, and open challenges.

Significance. A well-executed survey in this rapidly evolving area could help organize the literature on reinforced reasoning and highlight the train-time/test-time scaling paradigm, providing a useful reference for researchers working on more capable reasoning systems.

major comments (2)

[Abstract] Abstract: the survey provides no explicit selection criteria for included papers or protocol for reconciling conflicting results across studies; this is load-bearing for a literature review's credibility and should be stated in the introduction or methods section.
[learning-to-reason techniques] Section on learning-to-reason (automated data construction + RL): the repeated claim that RL produces 'high-quality' trajectories via trial-and-error assumes reward models avoid circular verification risks and systematic biases; the survey does not discuss independent verification mechanisms or empirical checks for hallucination amplification beyond the RL objective itself.

minor comments (2)

[Introduction] Introduction: the term 'Large Reasoning Model' is introduced as a new frontier without a precise definition or distinguishing criteria from existing LLMs; a short clarifying paragraph would help.
[Throughout] Throughout: some citations to recent arXiv preprints (e.g., o1-related work) would benefit from explicit dates or version numbers to aid readers tracking fast-moving developments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation. We will revise the manuscript to address the two major comments by adding explicit methodological details and expanding the discussion of limitations in the learning-to-reason section.

read point-by-point responses

Referee: [Abstract] Abstract: the survey provides no explicit selection criteria for included papers or protocol for reconciling conflicting results across studies; this is load-bearing for a literature review's credibility and should be stated in the introduction or methods section.

Authors: We agree that an explicit description of paper selection and conflict reconciliation strengthens survey credibility. In the revised manuscript we will add a dedicated paragraph in the Introduction (new subsection 1.3) that states: (i) search strategy (arXiv, ACL Anthology, NeurIPS/ICLR/ICML proceedings 2023-2025, keywords “reinforcement learning reasoning LLM”, “o1”, “test-time scaling”); (ii) inclusion criteria (peer-reviewed or high-impact preprints with empirical results on reasoning benchmarks); (iii) exclusion of purely theoretical or non-LLM work; and (iv) our approach to conflicting results (presenting both positive and negative findings with citations and noting open questions rather than claiming consensus). revision: yes
Referee: [learning-to-reason techniques] Section on learning-to-reason (automated data construction + RL): the repeated claim that RL produces 'high-quality' trajectories via trial-and-error assumes reward models avoid circular verification risks and systematic biases; the survey does not discuss independent verification mechanisms or empirical checks for hallucination amplification beyond the RL objective itself.

Authors: We accept the observation that the current text under-emphasizes verification risks. We will revise Section 4 (Learning-to-Reason) by inserting a new subsection 4.4 “Verification and Bias Considerations” that (a) acknowledges the circularity risk when reward models are trained on the same distribution as the policy, (b) cites existing empirical checks (e.g., process-supervised reward models, outcome verification via external solvers, and human preference studies), and (c) discusses reported cases of hallucination amplification under RL. We will also tone down the unqualified “high-quality” phrasing in the abstract and introduction to “trajectories that improve downstream benchmark performance under the learned reward model.” revision: yes

Circularity Check

0 steps flagged

Survey reports external literature without internal derivation or self-referential reduction

full rationale

The paper is a literature survey reviewing progress in LLM reasoning, RL-based training, automated data construction, and test-time scaling. It presents no original equations, fitted parameters, or quantitative derivations of its own. All technical claims are attributed to cited external works (including OpenAI o1), with no load-bearing step that reduces by construction to the survey's own inputs or self-citations. The central narrative simply organizes existing results rather than deriving new ones from within the paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

As a survey the paper rests on the assumption that the cited literature accurately represents the state of the field and that the three highlighted technical components (automated data construction, learning-to-reason, test-time scaling) are the dominant drivers.

axioms (1)

domain assumption Reinforcement learning on reasoning trajectories produces higher-quality training data than human annotation alone.
Invoked in the abstract when stating that RL enables automatic generation of high-quality reasoning trajectories.

invented entities (1)

Large Reasoning Model no independent evidence
purpose: Conceptual category for LLMs trained with reinforced reasoning and test-time scaling.
Introduced as the target of the surveyed research frontier.

pith-pipeline@v0.9.0 · 5643 in / 1183 out tokens · 14404 ms · 2026-05-15T21:17:24.834699+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
cs.CL 2026-05 unverdicted novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
MirageBackdoor: A Stealthy Attack that Induces Think-Well-Answer-Wrong Reasoning
cs.CR 2026-04 unverdicted novelty 8.0

MirageBackdoor is the first backdoor attack that preserves clean chain-of-thought reasoning in LLMs while steering the final answer to a specific incorrect target under a trigger.
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
cs.AI 2026-05 conditional novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
IE as Cache: Information Extraction Enhanced Agentic Reasoning
cs.CL 2026-04 unverdicted novelty 7.0

IE-as-Cache framework repurposes information extraction as a dynamic cognitive cache to improve agentic reasoning accuracy in LLMs on challenging benchmarks.
Metacognitive Behavioral Tuning of Large Language Models for Multi-Hop Question Answering
cs.AI 2026-02 unverdicted novelty 7.0

Metacognitive Behavioral Tuning injects a five-phase structure into LLM reasoning traces to improve accuracy-efficiency on multi-hop QA while reducing trace length and degeneration.
ADMM-Q: An Improved Hessian-based Weight Quantizer for Post-Training Quantization of Large Language Models
cs.LG 2026-05 unverdicted novelty 6.0

ADMM-Q is a new post-training quantization method using ADMM operator splitting that reduces WikiText-2 perplexity compared to GPTQ on Qwen3-8B across W3A16, W4A8, and W2A4KV4 settings.
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while sh...
Meta-Aligner: Bidirectional Preference-Policy Optimization for Multi-Objective LLMs Alignment
cs.LG 2026-04 unverdicted novelty 6.0

Meta-Aligner introduces a meta-learner network that produces dynamic preference weights to enable bidirectional optimization between preferences and LLM policy responses for multi-objective alignment.
Grounding Before Generalizing: How AI Differs from Humans in Causal Transfer
cs.AI 2026-04 unverdicted novelty 6.0

LLMs and VLMs exhibit grounding-dependent causal transfer in the OpenLock paradigm, requiring initial context-specific learning unlike humans who transfer structures from the first attempt.
Reasoning Fails Where Step Flow Breaks
cs.AI 2026-04 unverdicted novelty 6.0

Step-Saliency identifies shallow lock-in and deep decay in reasoning model attention flows, and StepFlow intervention repairs them to improve accuracy across LRMs.
Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search
cs.LG 2026-03 unverdicted novelty 6.0

Gome reaches 35.1% any-medal rate on MLE-Bench by mapping reasoning to gradient-based updates, outperforming tree search once models are sufficiently capable.
How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors
cs.AI 2026-05 unverdicted novelty 5.0

IMAX trains soft prefixes with an InfoMax reward to drive diverse exploration in RLVR, yielding up to 11.60% gains in Pass@4 over standard RLVR across model scales.
How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem
cs.AI 2026-05 unverdicted novelty 5.0

Non-reasoning LLMs fail the equivalence class problem while reasoning LLMs perform better but remain incomplete, with difficulty peaking at phase transition for the former and maximum diameter for the latter.
UpstreamQA: A Modular Framework for Explicit Reasoning on Video Question Answering Tasks
cs.CV 2026-04 unverdicted novelty 5.0

UpstreamQA disentangles video reasoning by using LRMs for explicit upstream object identification and scene context before downstream LMM VideoQA, improving performance and interpretability on OpenEQA and NExTQA in so...
Efficient Test-Time Scaling via Temporal Reasoning Aggregation
cs.AI 2026-04 unverdicted novelty 5.0

TRACE aggregates answer consistency and confidence trajectory over multiple reasoning steps to decide when to halt inference, reducing token usage by 25-30% while keeping accuracy within 1-2% of full reasoning.
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
cs.CL 2025-03 accept novelty 5.0

A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.
Reinforcement Learning for Scalable and Trustworthy Intelligent Systems
cs.LG 2026-05 unverdicted novelty 3.0

Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.
From System 1 to System 2: A Survey of Reasoning Large Language Models
cs.AI 2025-02 accept novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

Reference graph

Works this paper leans on

202 extracted references · 202 canonical work pages · cited by 18 Pith papers · 41 internal anchors

[1]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, S´ebastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 techni- cal report. arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Direct preference optimization with an offset

Afra Amini, Tim Vieira, and Ryan Cotterell. Direct preference optimization with an offset. arXiv preprint arXiv:2402.10571, 2024

work page arXiv 2024
[5]

Mathqa: Towards interpretable math word problem solving with operation-based formalisms, 2019

Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms, 2019

work page 2019
[6]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Openai’s new o3 model freaks out computer science majors

AXIOS. Openai’s new o3 model freaks out computer science majors. 2025

work page 2025
[8]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitu- tional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Phyre: A new benchmark for physical reasoning.Advances in Neural Information Processing Systems, 32, 2019

Anton Bakhtin, Laurens van der Maaten, Justin Johnson, Laura Gustafson, and Ross Girshick. Phyre: A new benchmark for physical reasoning.Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[11]

Graph of thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17682–17690, 2024

work page 2024
[12]

Piqa: Reasoning about phys- ical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about phys- ical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

work page 2020
[13]

Autonomous chemical research with large language models

Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models. Nature, 624(7992):570–578, 2023

work page 2023
[14]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877– 1901, 2020

work page 1901
[15]

A survey of monte carlo tree search methods

Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowl- ing, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012

work page 2012
[16]

Alphamath almost zero: process supervision without process

Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Alphamath almost zero: process supervision without process. arXiv preprint arXiv:2405.03553, 2024

work page arXiv 2024
[17]

Step-level value preference optimiza- tion for mathematical reasoning

Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Step-level value preference optimiza- tion for mathematical reasoning. arXiv preprint arXiv:2406.10858, 2024. 26

work page arXiv 2024
[18]

Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression, 2022

Jiaqi Chen, Tong Li, Jinghui Qin, Pan Lu, Liang Lin, Chongyu Chen, and Xiaodan Liang. Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression, 2022

work page 2022
[19]

Xing, and Liang Lin

Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P. Xing, and Liang Lin. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning, 2022

work page 2022
[20]

Large language model-driven meta-structure discovery in heterogeneous information network

Lin Chen, Fengli Xu, Nian Li, Zhenyu Han, Meng Wang, Yong Li, and Pan Hui. Large language model-driven meta-structure discovery in heterogeneous information network. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Min- ing, pages 307–318, 2024

work page 2024
[21]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[22]

Large language models meet harry potter: A dataset for aligning dialogue agents with characters

Nuo Chen, Yan Wang, Haiyun Jiang, Deng Cai, Yuhan Li, Ziyang Chen, Longyue Wang, and Jia Li. Large language models meet harry potter: A dataset for aligning dialogue agents with characters. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8506–8520, 2023

work page 2023
[23]

Theoremqa: A theorem-driven question answering dataset, 2023

Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. Theoremqa: A theorem-driven question answering dataset, 2023

work page 2023
[24]

Training verifiers to solve math word problems, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021

work page 2021
[25]

Meta-in-context learning in large language models.Advances in Neural Information Process- ing Systems, 36, 2024

Julian Coda-Forno, Marcel Binz, Zeynep Akata, Matt Botvinick, Jane Wang, and Eric Schulz. Meta-in-context learning in large language models.Advances in Neural Information Process- ing Systems, 36, 2024

work page 2024
[26]

Testing gpt-4-o1-preview on math and science problems: A follow-up study

Ernest Davis. Testing gpt-4-o1-preview on math and science problems: A follow-up study. arXiv preprint arXiv:2410.22340, 2024

work page arXiv 2024
[27]

System 2 thinking in openai’s o1- preview model: Near-perfect performance on a mathematics exam

Joost CF de Winter, Dimitra Dodou, and Yke Bauke Eisma. System 2 thinking in openai’s o1- preview model: Near-perfect performance on a mathematics exam. Computers, 13(11):278, 2024

work page 2024
[28]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[29]

Reasonbert: Pre-trained to reason with distant supervision

Xiang Deng, Yu Su, Alyssa Lees, You Wu, Cong Yu, and Huan Sun. Reasonbert: Pre-trained to reason with distant supervision. arXiv preprint arXiv:2109.04912, 2021

work page arXiv 2021
[30]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understand- ing. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[31]

Data augmentation using llms: Data perspectives, learning paradigms and challenges

Bosheng Ding, Chengwei Qin, Ruochen Zhao, Tianze Luo, Xinze Li, Guizhen Chen, Wen- han Xia, Junjie Hu, Anh Tuan Luu, and Shafiq Joty. Data augmentation using llms: Data perspectives, learning paradigms and challenges. arXiv preprint arXiv:2403.02990, 2024

work page arXiv 2024
[32]

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

A Survey on In-context Learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. A survey on in-context learning. arXiv preprint arXiv:2301.00234 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

Abduction

Igor Douven. Abduction. In Edward N. Zalta, editor, The Stanford Encyclopedia of Philoso- phy. Metaphysics Research Lab, Stanford University, Summer 2021 edition, 2021

work page 2021
[35]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

The path to superintelligence: A critical analysis of openai’s five levels of ai progression

Tom Duenas and Diana Ruiz. The path to superintelligence: A critical analysis of openai’s five levels of ai progression. ResearchGate, 2024b. doi, 10, 2024. 27

work page 2024
[37]

Detecting hallucinations in large language models using semantic entropy

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy. Nature, 630(8017):625–630, 2024

work page 2024
[38]

Promptbreeder: Self-referential self-improvement via prompt evolution

Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rockt¨aschel. Promptbreeder: Self-referential self-improvement via prompt evolution. arXiv preprint arXiv:2309.16797, 2023

work page arXiv 2023
[39]

Memory sharing for large language model based agents

Hang Gao and Yongfeng Zhang. Memory sharing for large language model based agents. arXiv preprint arXiv:2404.09982, 2024

work page arXiv 2024
[40]

Self-evolving gpt: A lifelong autonomous experiential learner

Jinglong Gao, Xiao Ding, Yiming Cui, Jianbai Zhao, Hepeng Wang, Ting Liu, and Bing Qin. Self-evolving gpt: A lifelong autonomous experiential learner. arXiv preprint arXiv:2407.08937, 2024

work page arXiv 2024
[41]

Rlef: Grounding code llms in execution feedback with reinforcement learning

Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, and Gabriel Syn- naeve. Rlef: Grounding code llms in execution feedback with reinforcement learning. arXiv preprint arXiv:2410.02089, 2024

work page arXiv 2024
[42]

Llms accelerate annotation for medical information extraction

Akshay Goel, Almog Gueta, Omry Gilon, Chang Liu, Sofia Erell, Lan Huong Nguyen, Xi- aohong Hao, Bolous Jaber, Shashir Reddy, Rupesh Kartha, et al. Llms accelerate annotation for medical information extraction. In Machine Learning for Health (ML4H), pages 82–100. PMLR, 2023

work page 2023
[43]

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Richelieu: Self-evolving llm-based agents for ai diplomacy

Zhenyu Guan, Xiangyu Kong, Fangwei Zhong, and Yizhou Wang. Richelieu: Self-evolving llm-based agents for ai diplomacy. arXiv preprint arXiv:2407.06813, 2024

work page arXiv 2024
[45]

Reinforced Self-Training (ReST) for Language Modeling

Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. Rein- forced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Fabbri, Wo- jciech Kryscinski, Semih Yavuz, Ye Liu, Xi Victoria Lin, Shafiq Joty, Yingbo Zhou, Caiming Xiong, Rex Ying, Arman Cohan, and Dragomir Radev

Simeng Han, Hailey Schoelkopf, Yilun Zhao, Zhenting Qi, Martin Riddell, Wenfei Zhou, James Coady, David Peng, Yujie Qiao, Luke Benson, Lucy Sun, Alex Wardle-Solano, Hannah Szabo, Ekaterina Zubova, Matthew Burtell, Jonathan Fan, Yixin Liu, Brian Wong, Malcolm Sailor, Ansong Ni, Linyong Nan, Jungo Kasai, Tao Yu, Rui Zhang, Alexander R. Fabbri, Wo- jciech Kr...

work page 2024
[47]

Inductive Logic

James Hawthorne. Inductive Logic. In Edward N. Zalta and Uri Nodelman, editors, The Stan- ford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Summer 2024 edition, 2024

work page 2024
[48]

A cross-domain performance report of open ai chatgpt o1 model

Kadhim Hayawi and Sakib Shahriar. A cross-domain performance report of open ai chatgpt o1 model. 2024

work page 2024
[49]

Measuring Coding Challenge Competence With APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[50]

Measuring mathematical problem solving with the math dataset, 2021

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021

work page 2021
[51]

Learn- ing to solve arithmetic word problems with verb categorization

Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. Learn- ing to solve arithmetic word problems with verb categorization. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 523–533, Doha, Qatar, October

work page 2014
[52]

Association for Computational Linguistics

work page
[53]

Can gpt-o1 kill all bugs? an evaluation of gpt-family llms on quixbugs

Haichuan Hu, Ye Shang, Guolin Xu, Congqing He, and Quanjun Zhang. Can gpt-o1 kill all bugs? an evaluation of gpt-family llms on quixbugs. arXiv e-prints, pages arXiv–2409, 2024

work page 2024
[54]

Automated Design of Agentic Systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. arXiv preprint arXiv:2408.08435, 2024. 28

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Agents’ room: Narrative generation through multi-step collaboration

Fantine Huot, Reinald Kim Amplayo, Jennimaria Palomaki, Alice Shoshana Jakobovits, Eliz- abeth Clark, and Mirella Lapata. Agents’ room: Narrative generation through multi-step collaboration. arXiv preprint arXiv:2410.02603, 2024

work page arXiv 2024
[56]

Self- explore to avoid the pit: Improving the reasoning capabilities of language models with fine- grained rewards

Hyeonbin Hwang, Doyoung Kim, Seungone Kim, Seonghyeon Ye, and Minjoon Seo. Self- explore to avoid the pit: Improving the reasoning capabilities of language models with fine- grained rewards. arXiv preprint arXiv:2404.10346, 2024

work page arXiv 2024
[57]

blob reinforcement fine-tuning

interconnects.ai. blob reinforcement fine-tuning. (Accessed: 2025-12-6)

work page 2025
[58]

Jiang, Wenda Li, Jesse Michael Han, and Yuhuai Wu

Albert Q. Jiang, Wenda Li, Jesse Michael Han, and Yuhuai Wu. Lisa: Language models of isabelle proofs. 6th Conference on Artificial Intelligence and Theorem Proving, 2021

work page 2021
[59]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[61]

Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment

Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment. arXiv preprint arXiv:2410.01679, 2024

work page arXiv 2024
[62]

Meganno+: A human-llm collaborative annotation system

Hannah Kim, Kushan Mitra, Rafael Li Chen, Sajjadur Rahman, and Dan Zhang. Meganno+: A human-llm collaborative annotation system. arXiv preprint arXiv:2402.18050, 2024

work page arXiv 2024
[63]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022

work page 2022
[64]

Probing physical reasoning with counter-commonsense context

Kazushi Kondo, Saku Sugawara, and Akiko Aizawa. Probing physical reasoning with counter-commonsense context. In Proceedings of the 61st Annual Meeting of the Associa- tion for Computational Linguistics (Volume 2: Short Papers), pages 603–612, 2023

work page 2023
[65]

Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917, 2024

Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, et al. Training language models to self-correct via reinforcement learning. arXiv preprint arXiv:2409.12917, 2024

work page arXiv 2024
[66]

Language models as zero-shot trajectory generators

Teyun Kwon, Norman Di Palo, and Edward Johns. Language models as zero-shot trajectory generators. IEEE Robotics and Automation Letters, 2024

work page 2024
[67]

Llms as factual reasoners: Insights from exist- ing benchmarks and beyond

Philippe Laban, Wojciech Kry ´sci´nski, Divyansh Agarwal, Alexander R Fabbri, Caiming Xiong, Shafiq Joty, and Chien-Sheng Wu. Llms as factual reasoners: Insights from exist- ing benchmarks and beyond. arXiv preprint arXiv:2305.14540, 2023

work page arXiv 2023
[68]

Ds-1000: A natural and reliable benchmark for data science code generation

Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen- tau Yih, Daniel Fried, Sida Wang, and Tao Yu. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning , pages 18319–18345. PMLR, 2023

work page 2023
[69]

A systematic assessment of openai o1-preview for higher order thinking in education

Ehsan Latif, Yifan Zhou, Shuchen Guo, Yizhu Gao, Lehong Shi, Matthew Nayaaba, Gyeonggeon Lee, Liang Zhang, Arne Bewersdorff, Luyang Fang, et al. A systematic assessment of openai o1-preview for higher order thinking in education. arXiv preprint arXiv:2410.21287, 2024

work page arXiv 2024
[70]

Multi-agent causal discovery using large language models

Hao Duong Le, Xin Xia, and Zhang Chen. Multi-agent causal discovery using large language models. arXiv preprint arXiv:2407.15073, 2024

work page arXiv 2024
[71]

Llm2llm: Boosting llms with novel iterative data enhancement.arXiv preprint arXiv:2403.15042, 2024

Nicholas Lee, Thanakul Wattanawong, Sehoon Kim, Karttikeya Mangalam, Sheng Shen, Gopala Anumanchipalli, Michael W Mahoney, Kurt Keutzer, and Amir Gholami. Llm2llm: Boosting llms with novel iterative data enhancement.arXiv preprint arXiv:2403.15042, 2024

work page arXiv 2024
[72]

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

M Lewis. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[73]

Dotamath: Decomposition of thought with code assistance and self-correction for mathemat- ical reasoning

Chengpeng Li, Guanting Dong, Mingfeng Xue, Ru Peng, Xiang Wang, and Dayiheng Liu. Dotamath: Decomposition of thought with code assistance and self-correction for mathemat- ical reasoning. arXiv preprint arXiv:2407.04078, 2024. 29

work page arXiv 2024
[74]

Openai-o1 ab testing: Does the o1 model really do good reasoning in math problem solving? arXiv preprint arXiv:2411.06198, 2024

Leo Li, Ye Luo, and Tingyou Pan. Openai-o1 ab testing: Does the o1 model really do good reasoning in math problem solving? arXiv preprint arXiv:2411.06198, 2024

work page arXiv 2024
[75]

Coannotating: Uncertainty-guided work allocation between human and large language models for data annotation

Minzhi Li, Taiwei Shi, Caleb Ziems, Min-Yen Kan, Nancy F Chen, Zhengyuan Liu, and Diyi Yang. Coannotating: Uncertainty-guided work allocation between human and large language models for data annotation. arXiv preprint arXiv:2310.15638, 2023

work page arXiv 2023
[76]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[77]

Let’s verify step by step, 2023

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step, 2023

work page 2023
[78]

Smith, and Yejin Choi

Alisa Liu, Swabha Swayamdipta, Noah A. Smith, and Yejin Choi. Wanli: Worker and ai collaboration for natural language inference dataset creation, 2022

work page 2022
[79]

Fimo: A challenge formal dataset for automated theorem proving, 2023

Chengwu Liu, Jianhao Shen, Huajian Xin, Zhengying Liu, Ye Yuan, Haiming Wang, Wei Ju, Chuanyang Zheng, Yichun Yin, Lin Li, Ming Zhang, and Qun Liu. Fimo: A challenge formal dataset for automated theorem proving, 2023

work page 2023
[80]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

Showing first 80 references.