Recognition: no theorem link
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
Pith reviewed 2026-05-15 21:17 UTC · model grok-4.3
The pith
Reinforcement learning on reasoning trajectories combined with test-time token scaling points toward Large Reasoning Models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that applying reinforcement learning to train models on automatically generated reasoning trajectories, paired with deliberate test-time scaling of thinking tokens, expands LLMs' reasoning capacity and marks a path to a new class of Large Reasoning Models.
What carries the argument
The reinforced reasoning paradigm, in which reinforcement learning optimizes the generation of thought sequences that represent intermediate reasoning steps through trial-and-error search.
If this is right
- Automated trial-and-error search generates substantially more high-quality reasoning trajectories than manual curation, expanding available training data.
- Allocating more tokens to thinking during inference measurably raises accuracy on complex tasks.
- The combination of train-time RL and test-time scaling shifts the paradigm from pure autoregressive generation to learned search and reflection.
- Open-source efforts are already constructing models that follow this reinforced-reasoning pattern.
Where Pith is reading between the lines
- If the scaling holds, these models could extend to long-horizon planning problems in robotics or scientific discovery where intermediate verification is possible.
- The method may reduce dependence on human-curated reasoning datasets, shifting data creation toward self-generated trajectories.
- A practical test would measure whether the extra test-time tokens yield consistent gains across domains that lack clear reward signals.
Load-bearing premise
Reinforcement learning applied to reasoning trajectories will reliably expand LLMs' reasoning capacity without introducing systematic biases or hallucinations that are harder to detect than in standard generation.
What would settle it
An experiment demonstrating that RL-trained reasoning models produce no net accuracy gain or exhibit higher rates of subtle, hard-to-detect errors on held-out complex reasoning tasks compared with standard LLMs would falsify the claimed expansion of capacity.
read the original abstract
Language has long been conceived as an essential tool for human reasoning. The breakthrough of Large Language Models (LLMs) has sparked significant research interest in leveraging these models to tackle complex reasoning tasks. Researchers have moved beyond simple autoregressive token generation by introducing the concept of "thought" -- a sequence of tokens representing intermediate steps in the reasoning process. This innovative paradigm enables LLMs' to mimic complex human reasoning processes, such as tree search and reflective thinking. Recently, an emerging trend of learning to reason has applied reinforcement learning (RL) to train LLMs to master reasoning processes. This approach enables the automatic generation of high-quality reasoning trajectories through trial-and-error search algorithms, significantly expanding LLMs' reasoning capacity by providing substantially more training data. Furthermore, recent studies demonstrate that encouraging LLMs to "think" with more tokens during test-time inference can further significantly boost reasoning accuracy. Therefore, the train-time and test-time scaling combined to show a new research frontier -- a path toward Large Reasoning Model. The introduction of OpenAI's o1 series marks a significant milestone in this research direction. In this survey, we present a comprehensive review of recent progress in LLM reasoning. We begin by introducing the foundational background of LLMs and then explore the key technical components driving the development of large reasoning models, with a focus on automated data construction, learning-to-reason techniques, and test-time scaling. We also analyze popular open-source projects at building large reasoning models, and conclude with open challenges and future research directions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper surveys recent progress in LLM reasoning, tracing the shift from autoregressive generation to 'thought' processes and 'learning to reason' via reinforcement learning. It argues that RL enables automatic generation of high-quality reasoning trajectories through trial-and-error, substantially expanding capacity via more training data; combined with test-time scaling (more tokens at inference), this forms a path to Large Reasoning Models, with OpenAI o1 as a milestone. The review covers LLM background, automated data construction, learning-to-reason techniques, test-time scaling, open-source projects, and open challenges.
Significance. A well-executed survey in this rapidly evolving area could help organize the literature on reinforced reasoning and highlight the train-time/test-time scaling paradigm, providing a useful reference for researchers working on more capable reasoning systems.
major comments (2)
- [Abstract] Abstract: the survey provides no explicit selection criteria for included papers or protocol for reconciling conflicting results across studies; this is load-bearing for a literature review's credibility and should be stated in the introduction or methods section.
- [learning-to-reason techniques] Section on learning-to-reason (automated data construction + RL): the repeated claim that RL produces 'high-quality' trajectories via trial-and-error assumes reward models avoid circular verification risks and systematic biases; the survey does not discuss independent verification mechanisms or empirical checks for hallucination amplification beyond the RL objective itself.
minor comments (2)
- [Introduction] Introduction: the term 'Large Reasoning Model' is introduced as a new frontier without a precise definition or distinguishing criteria from existing LLMs; a short clarifying paragraph would help.
- [Throughout] Throughout: some citations to recent arXiv preprints (e.g., o1-related work) would benefit from explicit dates or version numbers to aid readers tracking fast-moving developments.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive recommendation. We will revise the manuscript to address the two major comments by adding explicit methodological details and expanding the discussion of limitations in the learning-to-reason section.
read point-by-point responses
-
Referee: [Abstract] Abstract: the survey provides no explicit selection criteria for included papers or protocol for reconciling conflicting results across studies; this is load-bearing for a literature review's credibility and should be stated in the introduction or methods section.
Authors: We agree that an explicit description of paper selection and conflict reconciliation strengthens survey credibility. In the revised manuscript we will add a dedicated paragraph in the Introduction (new subsection 1.3) that states: (i) search strategy (arXiv, ACL Anthology, NeurIPS/ICLR/ICML proceedings 2023-2025, keywords “reinforcement learning reasoning LLM”, “o1”, “test-time scaling”); (ii) inclusion criteria (peer-reviewed or high-impact preprints with empirical results on reasoning benchmarks); (iii) exclusion of purely theoretical or non-LLM work; and (iv) our approach to conflicting results (presenting both positive and negative findings with citations and noting open questions rather than claiming consensus). revision: yes
-
Referee: [learning-to-reason techniques] Section on learning-to-reason (automated data construction + RL): the repeated claim that RL produces 'high-quality' trajectories via trial-and-error assumes reward models avoid circular verification risks and systematic biases; the survey does not discuss independent verification mechanisms or empirical checks for hallucination amplification beyond the RL objective itself.
Authors: We accept the observation that the current text under-emphasizes verification risks. We will revise Section 4 (Learning-to-Reason) by inserting a new subsection 4.4 “Verification and Bias Considerations” that (a) acknowledges the circularity risk when reward models are trained on the same distribution as the policy, (b) cites existing empirical checks (e.g., process-supervised reward models, outcome verification via external solvers, and human preference studies), and (c) discusses reported cases of hallucination amplification under RL. We will also tone down the unqualified “high-quality” phrasing in the abstract and introduction to “trajectories that improve downstream benchmark performance under the learned reward model.” revision: yes
Circularity Check
Survey reports external literature without internal derivation or self-referential reduction
full rationale
The paper is a literature survey reviewing progress in LLM reasoning, RL-based training, automated data construction, and test-time scaling. It presents no original equations, fitted parameters, or quantitative derivations of its own. All technical claims are attributed to cited external works (including OpenAI o1), with no load-bearing step that reduces by construction to the survey's own inputs or self-citations. The central narrative simply organizes existing results rather than deriving new ones from within the paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reinforcement learning on reasoning trajectories produces higher-quality training data than human annotation alone.
invented entities (1)
-
Large Reasoning Model
no independent evidence
Forward citations
Cited by 18 Pith papers
-
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
-
MirageBackdoor: A Stealthy Attack that Induces Think-Well-Answer-Wrong Reasoning
MirageBackdoor is the first backdoor attack that preserves clean chain-of-thought reasoning in LLMs while steering the final answer to a specific incorrect target under a trigger.
-
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
-
IE as Cache: Information Extraction Enhanced Agentic Reasoning
IE-as-Cache framework repurposes information extraction as a dynamic cognitive cache to improve agentic reasoning accuracy in LLMs on challenging benchmarks.
-
Metacognitive Behavioral Tuning of Large Language Models for Multi-Hop Question Answering
Metacognitive Behavioral Tuning injects a five-phase structure into LLM reasoning traces to improve accuracy-efficiency on multi-hop QA while reducing trace length and degeneration.
-
ADMM-Q: An Improved Hessian-based Weight Quantizer for Post-Training Quantization of Large Language Models
ADMM-Q is a new post-training quantization method using ADMM operator splitting that reduces WikiText-2 perplexity compared to GPTQ on Qwen3-8B across W3A16, W4A8, and W2A4KV4 settings.
-
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while sh...
-
Meta-Aligner: Bidirectional Preference-Policy Optimization for Multi-Objective LLMs Alignment
Meta-Aligner introduces a meta-learner network that produces dynamic preference weights to enable bidirectional optimization between preferences and LLM policy responses for multi-objective alignment.
-
Grounding Before Generalizing: How AI Differs from Humans in Causal Transfer
LLMs and VLMs exhibit grounding-dependent causal transfer in the OpenLock paradigm, requiring initial context-specific learning unlike humans who transfer structures from the first attempt.
-
Reasoning Fails Where Step Flow Breaks
Step-Saliency identifies shallow lock-in and deep decay in reasoning model attention flows, and StepFlow intervention repairs them to improve accuracy across LRMs.
-
Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search
Gome reaches 35.1% any-medal rate on MLE-Bench by mapping reasoning to gradient-based updates, outperforming tree search once models are sufficiently capable.
-
How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors
IMAX trains soft prefixes with an InfoMax reward to drive diverse exploration in RLVR, yielding up to 11.60% gains in Pass@4 over standard RLVR across model scales.
-
How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem
Non-reasoning LLMs fail the equivalence class problem while reasoning LLMs perform better but remain incomplete, with difficulty peaking at phase transition for the former and maximum diameter for the latter.
-
UpstreamQA: A Modular Framework for Explicit Reasoning on Video Question Answering Tasks
UpstreamQA disentangles video reasoning by using LRMs for explicit upstream object identification and scene context before downstream LMM VideoQA, improving performance and interpretability on OpenEQA and NExTQA in so...
-
Efficient Test-Time Scaling via Temporal Reasoning Aggregation
TRACE aggregates answer consistency and confidence trajectory over multiple reasoning steps to decide when to halt inference, reducing token usage by 25-30% while keeping accuracy within 1-2% of full reasoning.
-
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.
-
Reinforcement Learning for Scalable and Trustworthy Intelligent Systems
Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.
-
From System 1 to System 2: A Survey of Reasoning Large Language Models
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
Reference graph
Works this paper leans on
-
[1]
Marah Abdin, Jyoti Aneja, Harkirat Behl, S´ebastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 techni- cal report. arXiv preprint arXiv:2412.08905, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
Direct preference optimization with an offset
Afra Amini, Tim Vieira, and Ryan Cotterell. Direct preference optimization with an offset. arXiv preprint arXiv:2402.10571, 2024
-
[5]
Mathqa: Towards interpretable math word problem solving with operation-based formalisms, 2019
Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms, 2019
work page 2019
-
[6]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Openai’s new o3 model freaks out computer science majors
AXIOS. Openai’s new o3 model freaks out computer science majors. 2025
work page 2025
-
[8]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[9]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitu- tional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
Anton Bakhtin, Laurens van der Maaten, Justin Johnson, Laura Gustafson, and Ross Girshick. Phyre: A new benchmark for physical reasoning.Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[11]
Graph of thoughts: Solving elaborate problems with large language models
Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17682–17690, 2024
work page 2024
-
[12]
Piqa: Reasoning about phys- ical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about phys- ical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020
work page 2020
-
[13]
Autonomous chemical research with large language models
Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models. Nature, 624(7992):570–578, 2023
work page 2023
-
[14]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877– 1901, 2020
work page 1901
-
[15]
A survey of monte carlo tree search methods
Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowl- ing, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012
work page 2012
-
[16]
Alphamath almost zero: process supervision without process
Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Alphamath almost zero: process supervision without process. arXiv preprint arXiv:2405.03553, 2024
-
[17]
Step-level value preference optimiza- tion for mathematical reasoning
Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Step-level value preference optimiza- tion for mathematical reasoning. arXiv preprint arXiv:2406.10858, 2024. 26
-
[18]
Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression, 2022
Jiaqi Chen, Tong Li, Jinghui Qin, Pan Lu, Liang Lin, Chongyu Chen, and Xiaodan Liang. Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression, 2022
work page 2022
-
[19]
Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P. Xing, and Liang Lin. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning, 2022
work page 2022
-
[20]
Large language model-driven meta-structure discovery in heterogeneous information network
Lin Chen, Fengli Xu, Nian Li, Zhenyu Han, Meng Wang, Yong Li, and Pan Hui. Large language model-driven meta-structure discovery in heterogeneous information network. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Min- ing, pages 307–318, 2024
work page 2024
-
[21]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[22]
Large language models meet harry potter: A dataset for aligning dialogue agents with characters
Nuo Chen, Yan Wang, Haiyun Jiang, Deng Cai, Yuhan Li, Ziyang Chen, Longyue Wang, and Jia Li. Large language models meet harry potter: A dataset for aligning dialogue agents with characters. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8506–8520, 2023
work page 2023
-
[23]
Theoremqa: A theorem-driven question answering dataset, 2023
Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. Theoremqa: A theorem-driven question answering dataset, 2023
work page 2023
-
[24]
Training verifiers to solve math word problems, 2021
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021
work page 2021
-
[25]
Julian Coda-Forno, Marcel Binz, Zeynep Akata, Matt Botvinick, Jane Wang, and Eric Schulz. Meta-in-context learning in large language models.Advances in Neural Information Process- ing Systems, 36, 2024
work page 2024
-
[26]
Testing gpt-4-o1-preview on math and science problems: A follow-up study
Ernest Davis. Testing gpt-4-o1-preview on math and science problems: A follow-up study. arXiv preprint arXiv:2410.22340, 2024
-
[27]
System 2 thinking in openai’s o1- preview model: Near-perfect performance on a mathematics exam
Joost CF de Winter, Dimitra Dodou, and Yke Bauke Eisma. System 2 thinking in openai’s o1- preview model: Near-perfect performance on a mathematics exam. Computers, 13(11):278, 2024
work page 2024
-
[28]
Mind2web: Towards a generalist agent for the web
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[29]
Reasonbert: Pre-trained to reason with distant supervision
Xiang Deng, Yu Su, Alyssa Lees, You Wu, Cong Yu, and Huan Sun. Reasonbert: Pre-trained to reason with distant supervision. arXiv preprint arXiv:2109.04912, 2021
-
[30]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understand- ing. arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[31]
Data augmentation using llms: Data perspectives, learning paradigms and challenges
Bosheng Ding, Chengwei Qin, Ruochen Zhao, Tianze Luo, Xinze Li, Guizhen Chen, Wen- han Xia, Junjie Hu, Anh Tuan Luu, and Shafiq Joty. Data augmentation using llms: Data perspectives, learning paradigms and challenges. arXiv preprint arXiv:2403.02990, 2024
-
[32]
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations
Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
A Survey on In-context Learning
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. A survey on in-context learning. arXiv preprint arXiv:2301.00234 , 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [34]
-
[35]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
The path to superintelligence: A critical analysis of openai’s five levels of ai progression
Tom Duenas and Diana Ruiz. The path to superintelligence: A critical analysis of openai’s five levels of ai progression. ResearchGate, 2024b. doi, 10, 2024. 27
work page 2024
-
[37]
Detecting hallucinations in large language models using semantic entropy
Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy. Nature, 630(8017):625–630, 2024
work page 2024
-
[38]
Promptbreeder: Self-referential self-improvement via prompt evolution
Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rockt¨aschel. Promptbreeder: Self-referential self-improvement via prompt evolution. arXiv preprint arXiv:2309.16797, 2023
-
[39]
Memory sharing for large language model based agents
Hang Gao and Yongfeng Zhang. Memory sharing for large language model based agents. arXiv preprint arXiv:2404.09982, 2024
-
[40]
Self-evolving gpt: A lifelong autonomous experiential learner
Jinglong Gao, Xiao Ding, Yiming Cui, Jianbai Zhao, Hepeng Wang, Ting Liu, and Bing Qin. Self-evolving gpt: A lifelong autonomous experiential learner. arXiv preprint arXiv:2407.08937, 2024
-
[41]
Rlef: Grounding code llms in execution feedback with reinforcement learning
Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, and Gabriel Syn- naeve. Rlef: Grounding code llms in execution feedback with reinforcement learning. arXiv preprint arXiv:2410.02089, 2024
-
[42]
Llms accelerate annotation for medical information extraction
Akshay Goel, Almog Gueta, Omry Gilon, Chang Liu, Sofia Erell, Lan Huong Nguyen, Xi- aohong Hao, Bolous Jaber, Shashir Reddy, Rupesh Kartha, et al. Llms accelerate annotation for medical information extraction. In Machine Learning for Health (ML4H), pages 82–100. PMLR, 2023
work page 2023
-
[43]
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Richelieu: Self-evolving llm-based agents for ai diplomacy
Zhenyu Guan, Xiangyu Kong, Fangwei Zhong, and Yizhou Wang. Richelieu: Self-evolving llm-based agents for ai diplomacy. arXiv preprint arXiv:2407.06813, 2024
-
[45]
Reinforced Self-Training (ReST) for Language Modeling
Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. Rein- forced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
Simeng Han, Hailey Schoelkopf, Yilun Zhao, Zhenting Qi, Martin Riddell, Wenfei Zhou, James Coady, David Peng, Yujie Qiao, Luke Benson, Lucy Sun, Alex Wardle-Solano, Hannah Szabo, Ekaterina Zubova, Matthew Burtell, Jonathan Fan, Yixin Liu, Brian Wong, Malcolm Sailor, Ansong Ni, Linyong Nan, Jungo Kasai, Tao Yu, Rui Zhang, Alexander R. Fabbri, Wo- jciech Kr...
work page 2024
-
[47]
James Hawthorne. Inductive Logic. In Edward N. Zalta and Uri Nodelman, editors, The Stan- ford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Summer 2024 edition, 2024
work page 2024
-
[48]
A cross-domain performance report of open ai chatgpt o1 model
Kadhim Hayawi and Sakib Shahriar. A cross-domain performance report of open ai chatgpt o1 model. 2024
work page 2024
-
[49]
Measuring Coding Challenge Competence With APPS
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[50]
Measuring mathematical problem solving with the math dataset, 2021
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021
work page 2021
-
[51]
Learn- ing to solve arithmetic word problems with verb categorization
Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. Learn- ing to solve arithmetic word problems with verb categorization. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 523–533, Doha, Qatar, October
work page 2014
-
[52]
Association for Computational Linguistics
-
[53]
Can gpt-o1 kill all bugs? an evaluation of gpt-family llms on quixbugs
Haichuan Hu, Ye Shang, Guolin Xu, Congqing He, and Quanjun Zhang. Can gpt-o1 kill all bugs? an evaluation of gpt-family llms on quixbugs. arXiv e-prints, pages arXiv–2409, 2024
work page 2024
-
[54]
Automated Design of Agentic Systems
Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. arXiv preprint arXiv:2408.08435, 2024. 28
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
Agents’ room: Narrative generation through multi-step collaboration
Fantine Huot, Reinald Kim Amplayo, Jennimaria Palomaki, Alice Shoshana Jakobovits, Eliz- abeth Clark, and Mirella Lapata. Agents’ room: Narrative generation through multi-step collaboration. arXiv preprint arXiv:2410.02603, 2024
-
[56]
Hyeonbin Hwang, Doyoung Kim, Seungone Kim, Seonghyeon Ye, and Minjoon Seo. Self- explore to avoid the pit: Improving the reasoning capabilities of language models with fine- grained rewards. arXiv preprint arXiv:2404.10346, 2024
-
[57]
blob reinforcement fine-tuning
interconnects.ai. blob reinforcement fine-tuning. (Accessed: 2025-12-6)
work page 2025
-
[58]
Jiang, Wenda Li, Jesse Michael Han, and Yuhuai Wu
Albert Q. Jiang, Wenda Li, Jesse Michael Han, and Yuhuai Wu. Lisa: Language models of isabelle proofs. 6th Conference on Artificial Intelligence and Theorem Proving, 2021
work page 2021
-
[59]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[60]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[61]
Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment
Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment. arXiv preprint arXiv:2410.01679, 2024
-
[62]
Meganno+: A human-llm collaborative annotation system
Hannah Kim, Kushan Mitra, Rafael Li Chen, Sajjadur Rahman, and Dan Zhang. Meganno+: A human-llm collaborative annotation system. arXiv preprint arXiv:2402.18050, 2024
-
[63]
Large language models are zero-shot reasoners
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022
work page 2022
-
[64]
Probing physical reasoning with counter-commonsense context
Kazushi Kondo, Saku Sugawara, and Akiko Aizawa. Probing physical reasoning with counter-commonsense context. In Proceedings of the 61st Annual Meeting of the Associa- tion for Computational Linguistics (Volume 2: Short Papers), pages 603–612, 2023
work page 2023
-
[65]
Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, et al. Training language models to self-correct via reinforcement learning. arXiv preprint arXiv:2409.12917, 2024
-
[66]
Language models as zero-shot trajectory generators
Teyun Kwon, Norman Di Palo, and Edward Johns. Language models as zero-shot trajectory generators. IEEE Robotics and Automation Letters, 2024
work page 2024
-
[67]
Llms as factual reasoners: Insights from exist- ing benchmarks and beyond
Philippe Laban, Wojciech Kry ´sci´nski, Divyansh Agarwal, Alexander R Fabbri, Caiming Xiong, Shafiq Joty, and Chien-Sheng Wu. Llms as factual reasoners: Insights from exist- ing benchmarks and beyond. arXiv preprint arXiv:2305.14540, 2023
-
[68]
Ds-1000: A natural and reliable benchmark for data science code generation
Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen- tau Yih, Daniel Fried, Sida Wang, and Tao Yu. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning , pages 18319–18345. PMLR, 2023
work page 2023
-
[69]
A systematic assessment of openai o1-preview for higher order thinking in education
Ehsan Latif, Yifan Zhou, Shuchen Guo, Yizhu Gao, Lehong Shi, Matthew Nayaaba, Gyeonggeon Lee, Liang Zhang, Arne Bewersdorff, Luyang Fang, et al. A systematic assessment of openai o1-preview for higher order thinking in education. arXiv preprint arXiv:2410.21287, 2024
-
[70]
Multi-agent causal discovery using large language models
Hao Duong Le, Xin Xia, and Zhang Chen. Multi-agent causal discovery using large language models. arXiv preprint arXiv:2407.15073, 2024
-
[71]
Llm2llm: Boosting llms with novel iterative data enhancement.arXiv preprint arXiv:2403.15042, 2024
Nicholas Lee, Thanakul Wattanawong, Sehoon Kim, Karttikeya Mangalam, Sheng Shen, Gopala Anumanchipalli, Michael W Mahoney, Kurt Keutzer, and Amir Gholami. Llm2llm: Boosting llms with novel iterative data enhancement.arXiv preprint arXiv:2403.15042, 2024
-
[72]
M Lewis. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[73]
Chengpeng Li, Guanting Dong, Mingfeng Xue, Ru Peng, Xiang Wang, and Dayiheng Liu. Dotamath: Decomposition of thought with code assistance and self-correction for mathemat- ical reasoning. arXiv preprint arXiv:2407.04078, 2024. 29
-
[74]
Leo Li, Ye Luo, and Tingyou Pan. Openai-o1 ab testing: Does the o1 model really do good reasoning in math problem solving? arXiv preprint arXiv:2411.06198, 2024
-
[75]
Minzhi Li, Taiwei Shi, Caleb Ziems, Min-Yen Kan, Nancy F Chen, Zhengyuan Liu, and Diyi Yang. Coannotating: Uncertainty-guided work allocation between human and large language models for data annotation. arXiv preprint arXiv:2310.15638, 2023
-
[76]
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[77]
Let’s verify step by step, 2023
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step, 2023
work page 2023
-
[78]
Alisa Liu, Swabha Swayamdipta, Noah A. Smith, and Yejin Choi. Wanli: Worker and ai collaboration for natural language inference dataset creation, 2022
work page 2022
-
[79]
Fimo: A challenge formal dataset for automated theorem proving, 2023
Chengwu Liu, Jianhao Shen, Huajian Xin, Zhengying Liu, Ye Yuan, Haiming Wang, Wei Ju, Chuanyang Zheng, Yichun Yin, Lin Li, Ming Zhang, and Qun Liu. Fimo: A challenge formal dataset for automated theorem proving, 2023
work page 2023
-
[80]
AgentBench: Evaluating LLMs as Agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.