Towards Reasoning in Large Language Models: A Survey
Pith reviewed 2026-05-18 13:18 UTC · model grok-4.3
The pith
Large language models exhibit reasoning abilities that prompting techniques can enhance and benchmarks can assess, though the full extent remains unclear.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reasoning is fundamental to intelligence and large language models appear to possess it once they reach sufficient size, yet the precise scope of this capacity is not fully understood. This survey brings together methods for enhancing and eliciting reasoning, evaluation approaches and benchmarks, results from prior work, and recommendations for next steps in the field.
What carries the argument
A structured review that organizes techniques for eliciting reasoning in LLMs through prompting and training alongside assessment via targeted benchmarks.
If this is right
- Techniques such as chain-of-thought prompting can improve reasoning performance in LLMs on various tasks.
- Evaluation benchmarks provide standardized ways to measure logical, mathematical, and commonsense reasoning.
- Studies suggest that larger models tend to exhibit stronger reasoning but still face limitations.
- Future research should focus on more advanced evaluation and new methods to boost capabilities.
Where Pith is reading between the lines
- If the reviewed techniques prove effective across models, they could be integrated into standard AI development practices for more reliable outputs.
- Insights from this overview may inform how reasoning in LLMs relates to questions about human-like intelligence.
- New experiments could validate or extend the survey's synthesis with recently released models.
Load-bearing premise
The synthesis depends on the selected studies being a fair and complete representation of all relevant research without selection bias or overlooked contradictions.
What would settle it
A follow-up review that includes a wider range of papers and reaches substantially different conclusions about the state of LLM reasoning would indicate the current overview is incomplete or skewed.
read the original abstract
Reasoning is a fundamental aspect of human intelligence that plays a crucial role in activities such as problem solving, decision making, and critical thinking. In recent years, large language models (LLMs) have made significant progress in natural language processing, and there is observation that these models may exhibit reasoning abilities when they are sufficiently large. However, it is not yet clear to what extent LLMs are capable of reasoning. This paper provides a comprehensive overview of the current state of knowledge on reasoning in LLMs, including techniques for improving and eliciting reasoning in these models, methods and benchmarks for evaluating reasoning abilities, findings and implications of previous research in this field, and suggestions on future directions. Our aim is to provide a detailed and up-to-date review of this topic and stimulate meaningful discussion and future work.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript surveys reasoning capabilities in large language models, covering techniques for improving and eliciting reasoning, evaluation methods and benchmarks, key findings and implications from prior work, and suggestions for future research directions, with the aim of providing a comprehensive and up-to-date review as of late 2022.
Significance. If the coverage is representative, the survey would offer a useful organizing resource for the NLP community by synthesizing techniques, benchmarks, and open questions in LLM reasoning at a time when the literature was expanding rapidly.
major comments (1)
- [Abstract] Abstract: the central claim of delivering a 'comprehensive overview' and 'detailed and up-to-date review' is load-bearing for the paper's contribution, yet no explicit literature-search protocol, database list, inclusion/exclusion criteria, or coverage statistics are provided, leaving open the possibility of selection bias in a fast-moving subfield.
minor comments (2)
- [Introduction] The manuscript would benefit from a short dedicated subsection (e.g., in the introduction) that states the search strategy and year range of included papers so readers can assess completeness.
- Some benchmark descriptions could be clarified with a summary table listing task type, dataset size, and whether the evaluation is zero-shot or few-shot.
Simulated Author's Rebuttal
We thank the referee for this constructive comment on the abstract. We agree that greater transparency regarding our literature review process will strengthen the manuscript and address potential concerns about coverage in this rapidly evolving area. We will revise accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of delivering a 'comprehensive overview' and 'detailed and up-to-date review' is load-bearing for the paper's contribution, yet no explicit literature-search protocol, database list, inclusion/exclusion criteria, or coverage statistics are provided, leaving open the possibility of selection bias in a fast-moving subfield.
Authors: We agree that an explicit description of the literature search process would improve transparency. The survey was compiled by reviewing papers available as of December 2022, drawing from arXiv preprints, ACL/EMNLP/NAACL proceedings, NeurIPS/ICLR workshops, and highly cited works on prompting and reasoning techniques. To address the concern, we will add a new subsection (e.g., 'Literature Search Methodology') in the Introduction that outlines the primary search keywords (e.g., 'chain-of-thought', 'reasoning in LLMs', 'emergent abilities'), sources queried (Google Scholar, arXiv, ACL Anthology), approximate scope (papers from 2020–2022 with a focus on post-2021 works), and inclusion criteria (works that directly address reasoning capabilities, evaluation, or improvement methods in LLMs). We will also note the approximate number of papers synthesized. This addition will clarify the coverage without altering the survey's scope or claims. revision: yes
Circularity Check
No circularity: survey aggregates external literature without derivations or self-referential reductions
full rationale
This is a survey paper whose central contribution is synthesis of prior work on LLM reasoning techniques, benchmarks, and findings. No equations, predictions, fitted parameters, or first-principles derivations appear in the provided abstract or structure. All content rests on citations to external studies rather than internal reductions. The absence of any derivation chain means no steps can be shown to reduce to inputs by construction, satisfying the default expectation of no significant circularity for non-derivational papers.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs may exhibit reasoning abilities when they are sufficiently large
Forward citations
Cited by 18 Pith papers
-
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
-
STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes
STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.
-
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
-
GEM: Graph-Enhanced Mixture-of-Experts with ReAct Agents for Dialogue State Tracking
GEM achieves 65.19% joint goal accuracy on MultiWOZ 2.2 by routing between a graph neural network expert for dialogue structure and a T5 expert for sequences, plus ReAct agents for value generation, outperforming prio...
-
Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA
Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks whe...
-
TDA-RC: Task-Driven Alignment for Knowledge-Based Reasoning Chains in Large Language Models
TDA-RC embeds topological patterns from multi-round reasoning into CoT via persistent homology and a repair agent, yielding better accuracy-efficiency trade-offs than ToT or GoT on tested datasets.
-
CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning
CURE-MED pairs a new 13-language medical reasoning benchmark with curriculum RL to raise logical correctness to 70% and language consistency to 95% at 32B scale while outperforming baselines.
-
A Survey on Large Language Model based Autonomous Agents
A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...
-
Reasoning with Language Model is Planning with World Model
RAP turns LLMs into dual world-model and planning agents via MCTS to generate better reasoning paths, outperforming CoT baselines and achieving 33% relative gains over GPT-4 CoT using LLaMA-33B on plan generation.
-
Multimodal Chain-of-Thought Reasoning in Language Models
Multimodal-CoT achieves state-of-the-art on ScienceQA by using a two-stage process that incorporates vision into chain-of-thought rationale generation for models under 1 billion parameters.
-
Semantic-Aware Logical Reasoning via a Semiotic Framework
LogicAgent uses a semiotic-square-guided approach to enhance logical reasoning in LLMs on the new RepublicQA benchmark and others, reporting average gains of 6.25% and 7.05% respectively.
-
Agentic Reasoning for Large Language Models
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applicat...
-
Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI
A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-groun...
-
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
R1-Onevision turns images into structured text for multimodal reasoning, trains on a custom dataset with RL, and claims SOTA results on an educational benchmark.
-
A Survey on Large Language Models for Code Generation
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
-
A Survey of Large Language Models
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
- OpenClassGen: A Large-Scale Corpus of Real-World Python Classes for LLM Research
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.