arxiv: 2212.10403 · v2 · pith:U5JNDRNEnew · submitted 2022-12-20 · 💻 cs.CL · cs.AI

Towards Reasoning in Large Language Models: A Survey

Jie Huang , Kevin Chen-Chuan Chang This is my paper

Pith reviewed 2026-05-18 13:18 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords large language modelsreasoning abilitiesprompt engineeringevaluation benchmarksnatural language processingsurveyartificial intelligence

0 comments p. Extension

The pith

Large language models exhibit reasoning abilities that prompting techniques can enhance and benchmarks can assess, though the full extent remains unclear.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to map out what is known about reasoning in large language models by reviewing techniques to improve it, ways to evaluate it, and what past studies have shown. A sympathetic reader would care because better insight into AI reasoning could help create systems that handle complex problems more dependably. The authors pull together observations from research and suggest paths forward to resolve open questions about how capable these models really are.

Core claim

Reasoning is fundamental to intelligence and large language models appear to possess it once they reach sufficient size, yet the precise scope of this capacity is not fully understood. This survey brings together methods for enhancing and eliciting reasoning, evaluation approaches and benchmarks, results from prior work, and recommendations for next steps in the field.

What carries the argument

A structured review that organizes techniques for eliciting reasoning in LLMs through prompting and training alongside assessment via targeted benchmarks.

If this is right

Techniques such as chain-of-thought prompting can improve reasoning performance in LLMs on various tasks.
Evaluation benchmarks provide standardized ways to measure logical, mathematical, and commonsense reasoning.
Studies suggest that larger models tend to exhibit stronger reasoning but still face limitations.
Future research should focus on more advanced evaluation and new methods to boost capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the reviewed techniques prove effective across models, they could be integrated into standard AI development practices for more reliable outputs.
Insights from this overview may inform how reasoning in LLMs relates to questions about human-like intelligence.
New experiments could validate or extend the survey's synthesis with recently released models.

Load-bearing premise

The synthesis depends on the selected studies being a fair and complete representation of all relevant research without selection bias or overlooked contradictions.

What would settle it

A follow-up review that includes a wider range of papers and reaches substantially different conclusions about the state of LLM reasoning would indicate the current overview is incomplete or skewed.

read the original abstract

Reasoning is a fundamental aspect of human intelligence that plays a crucial role in activities such as problem solving, decision making, and critical thinking. In recent years, large language models (LLMs) have made significant progress in natural language processing, and there is observation that these models may exhibit reasoning abilities when they are sufficiently large. However, it is not yet clear to what extent LLMs are capable of reasoning. This paper provides a comprehensive overview of the current state of knowledge on reasoning in LLMs, including techniques for improving and eliciting reasoning in these models, methods and benchmarks for evaluating reasoning abilities, findings and implications of previous research in this field, and suggestions on future directions. Our aim is to provide a detailed and up-to-date review of this topic and stimulate meaningful discussion and future work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey organizes early work on LLM reasoning into a clear structure but its coverage claims are hard to verify without a selection protocol.

read the letter

This is a standard survey paper that organizes existing research on reasoning in large language models. It covers prompting techniques to improve reasoning, benchmarks for testing it, key findings from studies, and ideas for future directions. The main strength is the clear structure it brings to a bunch of scattered papers. Sections on chain-of-thought and similar methods, along with evaluation approaches, give readers a way to see the connections between different pieces of work. For people coming into this area, that organization is practical and saves time. The weaker part is the missing detail on how the literature was collected. The claim of a comprehensive overview would be stronger if there was an explicit description of the search process or criteria for including papers. Without that, it's possible some relevant work got left out, especially in a field moving as fast as this one did around late 2022. The paper sticks to summarizing and grouping prior results without introducing new experiments or formal frameworks. That fits what a survey is supposed to do, and the citations look like they hit the main early papers on the topic. This kind of paper is useful for students or researchers who need an overview before diving into specific papers. Experts already working on LLM reasoning might not find much new here, but it can serve as a reference point. I think it deserves peer review. Referees can check for gaps in coverage and help refine the future directions section. A revised version could be a solid resource for the community.

Referee Report

1 major / 2 minor

Summary. The manuscript surveys reasoning capabilities in large language models, covering techniques for improving and eliciting reasoning, evaluation methods and benchmarks, key findings and implications from prior work, and suggestions for future research directions, with the aim of providing a comprehensive and up-to-date review as of late 2022.

Significance. If the coverage is representative, the survey would offer a useful organizing resource for the NLP community by synthesizing techniques, benchmarks, and open questions in LLM reasoning at a time when the literature was expanding rapidly.

major comments (1)

[Abstract] Abstract: the central claim of delivering a 'comprehensive overview' and 'detailed and up-to-date review' is load-bearing for the paper's contribution, yet no explicit literature-search protocol, database list, inclusion/exclusion criteria, or coverage statistics are provided, leaving open the possibility of selection bias in a fast-moving subfield.

minor comments (2)

[Introduction] The manuscript would benefit from a short dedicated subsection (e.g., in the introduction) that states the search strategy and year range of included papers so readers can assess completeness.
Some benchmark descriptions could be clarified with a summary table listing task type, dataset size, and whether the evaluation is zero-shot or few-shot.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for this constructive comment on the abstract. We agree that greater transparency regarding our literature review process will strengthen the manuscript and address potential concerns about coverage in this rapidly evolving area. We will revise accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of delivering a 'comprehensive overview' and 'detailed and up-to-date review' is load-bearing for the paper's contribution, yet no explicit literature-search protocol, database list, inclusion/exclusion criteria, or coverage statistics are provided, leaving open the possibility of selection bias in a fast-moving subfield.

Authors: We agree that an explicit description of the literature search process would improve transparency. The survey was compiled by reviewing papers available as of December 2022, drawing from arXiv preprints, ACL/EMNLP/NAACL proceedings, NeurIPS/ICLR workshops, and highly cited works on prompting and reasoning techniques. To address the concern, we will add a new subsection (e.g., 'Literature Search Methodology') in the Introduction that outlines the primary search keywords (e.g., 'chain-of-thought', 'reasoning in LLMs', 'emergent abilities'), sources queried (Google Scholar, arXiv, ACL Anthology), approximate scope (papers from 2020–2022 with a focus on post-2021 works), and inclusion criteria (works that directly address reasoning capabilities, evaluation, or improvement methods in LLMs). We will also note the approximate number of papers synthesized. This addition will clarify the coverage without altering the survey's scope or claims. revision: yes

Circularity Check

0 steps flagged

No circularity: survey aggregates external literature without derivations or self-referential reductions

full rationale

This is a survey paper whose central contribution is synthesis of prior work on LLM reasoning techniques, benchmarks, and findings. No equations, predictions, fitted parameters, or first-principles derivations appear in the provided abstract or structure. All content rests on citations to external studies rather than internal reductions. The absence of any derivation chain means no steps can be shown to reduce to inputs by construction, satisfying the default expectation of no significant circularity for non-derivational papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As a survey the central claim rests on selection and synthesis of prior literature rather than new data or derivations; no free parameters, invented entities, or ad-hoc axioms are introduced by the paper itself.

axioms (1)

domain assumption LLMs may exhibit reasoning abilities when they are sufficiently large
Stated as an observation in the abstract that motivates the survey.

pith-pipeline@v0.9.0 · 5654 in / 1117 out tokens · 41019 ms · 2026-05-18T13:18:01.821751+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
cs.CL 2025-11 unverdicted novelty 7.0

Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes
cs.CL 2026-05 unverdicted novelty 6.0

STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
cs.AI 2026-05 unverdicted novelty 6.0

OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
GEM: Graph-Enhanced Mixture-of-Experts with ReAct Agents for Dialogue State Tracking
cs.CL 2026-05 unverdicted novelty 6.0

GEM achieves 65.19% joint goal accuracy on MultiWOZ 2.2 by routing between a graph neural network expert for dialogue structure and a T5 expert for sequences, plus ReAct agents for value generation, outperforming prio...
Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA
cs.AI 2026-05 unverdicted novelty 6.0

Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks whe...
TDA-RC: Task-Driven Alignment for Knowledge-Based Reasoning Chains in Large Language Models
cs.CL 2026-03 unverdicted novelty 6.0

TDA-RC embeds topological patterns from multi-round reasoning into CoT via persistent homology and a repair agent, yielding better accuracy-efficiency trade-offs than ToT or GoT on tested datasets.
CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning
cs.AI 2026-01 unverdicted novelty 6.0

CURE-MED pairs a new 13-language medical reasoning benchmark with curriculum RL to raise logical correctness to 70% and language consistency to 95% at 32B scale while outperforming baselines.
A Survey on Large Language Model based Autonomous Agents
cs.AI 2023-08 accept novelty 6.0

A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...
Reasoning with Language Model is Planning with World Model
cs.CL 2023-05 unverdicted novelty 6.0

RAP turns LLMs into dual world-model and planning agents via MCTS to generate better reasoning paths, outperforming CoT baselines and achieving 33% relative gains over GPT-4 CoT using LLaMA-33B on plan generation.
Multimodal Chain-of-Thought Reasoning in Language Models
cs.CL 2023-02 accept novelty 6.0

Multimodal-CoT achieves state-of-the-art on ScienceQA by using a two-stage process that incorporates vision into chain-of-thought rationale generation for models under 1 billion parameters.
Semantic-Aware Logical Reasoning via a Semiotic Framework
cs.AI 2025-09 conditional novelty 5.0

LogicAgent uses a semiotic-square-guided approach to enhance logical reasoning in LLMs on the new RepublicQA benchmark and others, reporting average gains of 6.25% and 7.05% respectively.
Agentic Reasoning for Large Language Models
cs.AI 2026-01 unverdicted novelty 4.0

The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applicat...
Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI
cs.AI 2025-10 unverdicted novelty 4.0

A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-groun...
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
cs.CV 2025-03 unverdicted novelty 4.0

R1-Onevision turns images into structured text for multimodal reasoning, trains on a custom dataset with RL, and claims SOTA results on an educational benchmark.
A Survey on Large Language Models for Code Generation
cs.CL 2024-06 unverdicted novelty 3.0

A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...
Large Language Models: A Survey
cs.CL 2024-02 accept novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
A Survey of Large Language Models
cs.CL 2023-03 accept novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
OpenClassGen: A Large-Scale Corpus of Real-World Python Classes for LLM Research
cs.SE 2025-04