arxiv: 2410.05229 · v2 · submitted 2024-10-07 · 💻 cs.LG · cs.AI

Recognition: 1 theorem link

· Lean Theorem

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Iman Mirzadeh , Keivan Alizadeh , Hooman Shahrokhi , Oncel Tuzel , Samy Bengio , Mehrdad Farajtabar

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:38 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords large language modelsmathematical reasoningGSM8KGSM-Symboliclogical reasoningbenchmarksmodel limitationssymbolic templates

0 comments

The pith

Large language models cannot perform genuine mathematical reasoning and instead replicate patterns from training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GSM-Symbolic, a benchmark built from symbolic templates that generate many varied versions of the same grade-school math problems while preserving the core logic. Testing multiple state-of-the-art models on this benchmark reveals large performance swings when only the numbers in a question change and sharp drops as extra clauses are added. Adding one clause that appears relevant but does not affect the solution chain causes accuracy losses up to 65 percent. The authors conclude that current LLMs lack true logical reasoning and rely on reproducing steps seen during training. This finding casts doubt on the reliability of standard benchmarks such as GSM8K for measuring actual reasoning progress.

Core claim

GSM-Symbolic, constructed via symbolic templates, produces controllable families of questions that keep the required reasoning chain fixed while varying numbers and clause count. Across open and closed models, accuracy varies substantially on different numerical instantiations of the same template and falls markedly with longer questions; the addition of a single non-contributory yet plausible clause triggers drops reaching 65 percent. The authors interpret these results as evidence that LLMs do not execute genuine logical inference but instead retrieve and replay reasoning fragments memorized from training data.

What carries the argument

GSM-Symbolic benchmark generated from symbolic templates that fix the reasoning structure while allowing systematic changes to numerical values and the insertion of additional clauses.

Load-bearing premise

The added clauses are truly irrelevant to the solution process, so observed performance drops demonstrate an absence of genuine reasoning rather than sensitivity to prompt length or wording.

What would settle it

A model that maintains near-constant high accuracy across hundreds of numerical variants of the same template and shows no drop when irrelevant clauses are appended would falsify the claim that LLMs lack genuine reasoning.

read the original abstract

Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics. To address these concerns, we conduct a large-scale study on several SOTA open and closed models. To overcome the limitations of existing evaluations, we introduce GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the generation of a diverse set of questions. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of models.Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and show that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is because current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data. Adding a single clause that seems relevant to the question causes significant performance drops (up to 65%) across all state-of-the-art models, even though the clause doesn't contribute to the reasoning chain needed for the final answer. Overall, our work offers a more nuanced understanding of LLMs' capabilities and limitations in mathematical reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs show clear fragility to number changes and added clauses on math problems, but the claim they do no genuine reasoning rests on untested assumptions about the clauses.

read the letter

This paper shows that current LLMs are sensitive to small, controlled changes in grade-school math word problems. Using templates to create GSM-Symbolic, they generate many versions of the same underlying question by swapping numbers or adding clauses that do not change the required steps. Across multiple open and closed models, performance varies with different number choices and drops sharply, sometimes by 65 percent, when one extra clause is added even though it does not affect the answer.

Referee Report

3 major / 3 minor

Summary. The paper introduces GSM-Symbolic, a benchmark derived from symbolic templates of GSM8K questions, to enable controllable generation of diverse mathematical word problems. Through large-scale experiments on multiple open and closed LLMs, it reports high variance in model performance across different numerical instantiations of the same template, consistent declines as the number of clauses increases, and sharp drops (up to 65%) when a single seemingly irrelevant clause is added, even though the clause does not alter the required reasoning chain. The authors conclude that current LLMs do not perform genuine logical reasoning but instead replicate patterns from training data.

Significance. If the core empirical patterns hold, the work supplies a valuable diagnostic tool for distinguishing robust reasoning from surface-level pattern matching in LLMs. The template-based approach yields more reliable metrics than fixed GSM8K sets and documents fragility that is consistent across model families, which has direct implications for claims about reasoning progress.

major comments (3)

[Section 5] Section 5 (added-clause experiments): the claim that the inserted clauses 'don't contribute to the reasoning chain' is stated without an explicit verification procedure or ablation that isolates clause position, length, or lexical overlap with the original template; without such controls the observed drops cannot be unambiguously attributed to absence of genuine reasoning rather than attention dilution or CoT parsing failures on longer inputs.
[Section 4.2] Section 4.2 (numerical-variation results): the reported performance variance across number changes is presented as evidence against robust reasoning, yet the paper does not quantify or control for changes in token distribution or numeric magnitude that could affect model outputs independently of logical structure; this weakens the link between the observed drops and the central hypothesis.
[Table 2] Table 2 / clause-count scaling: the monotonic decline with added clauses is load-bearing for the fragility claim, but the paper provides no per-model breakdown of whether the drop occurs at the first added clause or accumulates gradually, leaving open the possibility that the effect is dominated by input-length sensitivity rather than reasoning failure.

minor comments (3)

[Section 3] The description of how templates were validated to ensure the added clauses remain irrelevant could be expanded with one or two concrete examples in the main text rather than only in the appendix.
[Figure 3] Figure 3 caption should explicitly state the exact number of models and total questions underlying each bar to improve reproducibility.
[Section 4] A brief note on the total compute or number of API calls used for the closed models would help readers assess the scale of the evaluation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and detailed comments, which have helped us identify areas where the manuscript can be strengthened. We address each major comment below and describe the revisions we will make in the next version of the paper.

read point-by-point responses

Referee: [Section 5] Section 5 (added-clause experiments): the claim that the inserted clauses 'don't contribute to the reasoning chain' is stated without an explicit verification procedure or ablation that isolates clause position, length, or lexical overlap with the original template; without such controls the observed drops cannot be unambiguously attributed to absence of genuine reasoning rather than attention dilution or CoT parsing failures on longer inputs.

Authors: We agree that additional controls would make the attribution more robust. In the revised manuscript we will add an explicit verification procedure consisting of: (1) manual inspection and annotation of a random sample of 100 added clauses confirming they introduce no new arithmetic operations or entities required for the solution, (2) controlled ablations that vary clause position (prefix, infix, suffix) while keeping length fixed, and (3) measurement of lexical overlap (token Jaccard) between the added clause and the original template. These results will be reported in an expanded Section 5 and an accompanying appendix table. We believe the core finding—that performance drops sharply even for clearly irrelevant clauses—will remain, but the new controls will rule out the alternative explanations raised. revision: yes
Referee: [Section 4.2] Section 4.2 (numerical-variation results): the reported performance variance across number changes is presented as evidence against robust reasoning, yet the paper does not quantify or control for changes in token distribution or numeric magnitude that could affect model outputs independently of logical structure; this weakens the link between the observed drops and the central hypothesis.

Authors: We acknowledge the value of quantifying these factors. In the revision we will add: (i) statistics on the distribution of token lengths and numeric magnitudes across the GSM-Symbolic instantiations, and (ii) a controlled subset experiment in which we restrict number magnitudes to a narrow band (e.g., 10–100) while still varying the specific values. We will also report variance stratified by magnitude bin. At the same time, we note that the same template produces large accuracy swings even when the numeric values remain within the same order of magnitude and token length, which is difficult to explain by magnitude or token effects alone. The new analyses will make this distinction explicit. revision: partial
Referee: [Table 2] Table 2 / clause-count scaling: the monotonic decline with added clauses is load-bearing for the fragility claim, but the paper provides no per-model breakdown of whether the drop occurs at the first added clause or accumulates gradually, leaving open the possibility that the effect is dominated by input-length sensitivity rather than reasoning failure.

Authors: We will expand Table 2 (and add a new figure) to include per-model curves showing accuracy as a function of clause count. These curves will be accompanied by a short analysis of the incremental drop at each step (first clause vs. subsequent clauses) and a comparison against a length-matched control set that adds neutral filler tokens rather than clauses. This will allow readers to assess whether the decline is driven primarily by the first clause (consistent with reasoning fragility) or by cumulative length (consistent with attention dilution). revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark measurements with no derived quantities or self-referential definitions

full rationale

The paper presents an empirical study introducing GSM-Symbolic via symbolic templates to generate question variants, then directly measures LLM accuracy, variance, and degradation under clause additions. No equations, fitted parameters, or first-principles derivations exist whose outputs reduce to the inputs by construction. The hypothesis that models replicate training steps rather than reason is an interpretive claim resting on observed performance drops, not a logical reduction or self-citation chain. All load-bearing elements are external measurements on model outputs against generated test cases, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the symbolic templates preserve the original reasoning requirements while isolating the effects of numerical change and clause addition.

axioms (1)

domain assumption Symbolic templates generate questions whose required reasoning chain is identical to the original GSM8K problems.
Invoked when constructing GSM-Symbolic to ensure that performance changes reflect reasoning fragility rather than altered problem difficulty.

pith-pipeline@v0.9.0 · 5624 in / 1156 out tokens · 26414 ms · 2026-05-15T00:38:35.597661+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we observe significant performance drops (up to 65%) across all state-of-the-art models, even though the added clause doesn't contribute to the reasoning chain needed for the final answer

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs
cs.LG 2026-05 unverdicted novelty 8.0

MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.
SEVerA: Verified Synthesis of Self-Evolving Agents
cs.LG 2026-03 unverdicted novelty 8.0

SEVerA uses Formally Guarded Generative Models and a three-stage Search-Verification-Learning process to synthesize self-evolving agents that satisfy hard formal constraints while improving task performance.
Formalize, Don't Optimize: The Heuristic Trap in LLM-Generated Combinatorial Solvers
cs.AI 2026-05 unverdicted novelty 7.0

LLM-generated combinatorial solvers achieve highest correctness when the model formalizes problems for verified backends rather than attempting to optimize search, which often causes regressions.
Tracing Uncertainty in Language Model "Reasoning"
cs.LG 2026-05 unverdicted novelty 7.0

Uncertainty trace profiles from LM reasoning traces predict correct final answers with AUROC up to 0.807 and enable early error detection using only initial tokens.
Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems
cs.AI 2026-04 unverdicted novelty 7.0

A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.
RoMathExam: A Longitudinal Dataset of Romanian Math Exams (1895-2025) with a Seven-Decade Core (1957-2025)
cs.CY 2026-03 unverdicted novelty 7.0

RoMathExam supplies a century-long collection of Romanian math exams together with a new intrinsic complexity metric that correlates across frontier models at r > 0.72.
Robust Reasoning Benchmark
cs.LG 2026-03 unverdicted novelty 7.0

Perturbations to math problem text cause up to 55% average accuracy drops in open-weight LLMs and sequential solving reveals context pollution in attention mechanisms.
Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis
cs.CR 2026-05 unverdicted novelty 6.0

Harmful prompts reformulated as coherent mathematical problems bypass LLM safety mechanisms at 46-56% rates, with success depending on deep reformulation rather than mere notation.
When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models
cs.CL 2026-05 conditional novelty 6.0

AloLab, an iterative meta-agent prompt optimizer, raises structured output accuracy for 7-9B models from 0% to 84-87% on GSM8K while preserving near-native inference speed.
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
cs.AI 2026-04 unverdicted novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks
cs.CL 2026-04 unverdicted novelty 6.0

QuickScope uses modified COUP Bayesian optimization to find truly difficult questions in dynamic LLM benchmarks more sample-efficiently than baselines while cutting false positives.
Agentic Frameworks for Reasoning Tasks: An Empirical Study
cs.AI 2026-04 unverdicted novelty 6.0

An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.
Heterogeneous Consensus-Progressive Reasoning for Efficient Multi-Agent Debate
cs.MA 2026-04 unverdicted novelty 6.0

HCP-MAD reduces token costs in multi-agent debates by using heterogeneous consensus verification, adaptive pair-agent stopping, and escalated collective voting based on task complexity signals.
The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning
cs.CL 2026-03 unverdicted novelty 6.0

LLMs prioritize surface heuristics such as distance cues over implicit constraints in reasoning tasks, with the new HOB benchmark showing no model exceeds 75% strict accuracy and hints recovering performance.
Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities
cs.AI 2026-05 unverdicted novelty 5.0

Absurd World automatically converts real-world problems into absurd yet logically coherent scenarios to test whether LLMs can reason without depending on familiar patterns.
NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning
cs.LG 2026-05 unverdicted novelty 5.0

Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.
One Refiner to Unlock Them All: Inference-Time Reasoning Elicitation via Reinforcement Query Refinement
cs.CL 2026-04 unverdicted novelty 5.0

ReQueR trains a single RL-based query refiner with an adaptive curriculum to decompose raw queries into structured logic, delivering 1.7-7.2% absolute gains on reasoning tasks across diverse LLMs and generalizing to u...
Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity
cs.AI 2026-04 unverdicted novelty 5.0

An LLM-as-a-judge evaluation framework for math reasoning outperforms symbolic methods by accurately assessing diverse answer representations and formats.
A pragmatic approach to regulating AI agents
cs.CY 2026-04 unverdicted novelty 5.0

AI agents require distinct regulation as AI systems under the EU AI Act with orchestration-layer oversight and a risk-based traffic light authorization system in contract law to preserve human accountability.
Too long; didn't solve
cs.AI 2026-04 unverdicted novelty 5.0

Longer prompts and solutions in a new expert-authored math dataset correlate with higher failure rates across LLMs, with length linked to empirical difficulty after difficulty adjustment.
Measuring AI Reasoning: A Guide for Researchers
cs.AI 2026-05 unverdicted novelty 4.0

Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.
EMS: Multi-Agent Voting via Efficient Majority-then-Stopping
cs.AI 2026-04 unverdicted novelty 4.0

EMS reduces the average number of agents invoked for majority voting by 32% via reliability-aware prioritization and early stopping on six benchmarks.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · cited by 22 Pith papers · 14 internal anchors

[8]

GSM-Plus:

Qintong Li and Leyang Cui and Xueliang Zhao and Lingpeng Kong and Wei Bi , editor =. GSM-Plus:. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),. 2024 , url =

work page 2024
[9]

Hwang and Soumya Sanyal and Xiang Ren and Allyson Ettinger and Za

Nouha Dziri and Ximing Lu and Melanie Sclar and Xiang Lorraine Li and Liwei Jiang and Bill Yuchen Lin and Sean Welleck and Peter West and Chandra Bhagavatula and Ronan Le Bras and Jena D. Hwang and Soumya Sanyal and Xiang Ren and Allyson Ettinger and Za. Faith and Fate: Limits of Transformers on Compositionality , booktitle =. 2023 , url =

work page 2023
[10]

Chi and Quoc V

Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed H. Chi and Quoc V. Le and Denny Zhou , editor =. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , booktitle =. 2022 , url =

work page 2022
[11]

Chi and Nathanael Sch

Freda Shi and Xinyun Chen and Kanishka Misra and Nathan Scales and David Dohan and Ed H. Chi and Nathanael Sch. Large Language Models Can Be Easily Distracted by Irrelevant Context , booktitle =. 2023 , url =

work page 2023
[12]

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics , pages=

Are NLP Models really able to Solve Simple Math Word Problems? , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics , pages=

work page 2021
[13]

Proceedings of NAACL , pages=

MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms , author=. Proceedings of NAACL , pages=

work page
[15]

Large Language Models are Zero-Shot Reasoners

Large Language Models are Zero-Shot Reasoners , author=. arXiv preprint arXiv:2205.11916 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Does learning require memorization? a short tale about a long tail , volume=

Feldman, Vitaly , year=. Does learning require memorization? a short tale about a long tail , volume=. doi:10.1145/3357713.3384290 , booktitle=

work page doi:10.1145/3357713.3384290
[18]

2023 , eprint=

Faith and Fate: Limits of Transformers on Compositionality , author=. 2023 , eprint=

work page 2023
[19]

LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

Large Language Models Still Can't Plan (A Benchmark for LLMs on Planning and Reasoning about Change) , author=. arXiv preprint arXiv:2304.11477 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Solving math word problems with process- and outcome-based feedback

Improving Mathematical Reasoning with Language Models , author=. arXiv preprint arXiv:2211.14275 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

Imitation Learning for Solving Math Word Problems , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2022
[22]

Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems

Program induction by rationale generation: Learning to solve and explain algebraic word problems , author=. arXiv preprint arXiv:1705.04146 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

2023 , eprint=

Are Emergent Abilities of Large Language Models a Mirage? , author=. 2023 , eprint=

work page 2023
[25]

Thinking Like Transformers , booktitle =

Gail Weiss and Yoav Goldberg and Eran Yahav , editor =. Thinking Like Transformers , booktitle =. 2021 , url =

work page 2021
[26]

Susskind and Samy Bengio and Preetum Nakkiran , title =

Hattie Zhou and Arwen Bradley and Etai Littwin and Noam Razin and Omid Saremi and Joshua M. Susskind and Samy Bengio and Preetum Nakkiran , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

work page 2024
[27]

When can transformers reason with abstract symbols? , booktitle =

Enric Boix. When can transformers reason with abstract symbols? , booktitle =. 2024 , url =

work page 2024
[28]

Neural Networks and the Chomsky Hierarchy , booktitle =

Gr. Neural Networks and the Chomsky Hierarchy , booktitle =. 2023 , url =

work page 2023
[29]

The Twelfth International Conference on Learning Representations,

Zhiyuan Liu and Hong Liu and Denny Zhou and Tengyu Ma , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

work page 2024
[32]

One-Layer Transformer Provably Learns One-Nearest Neighbor In Context , author=

work page
[33]

Aithal and Pratyush Maini and Zachary C

Sumukh K. Aithal and Pratyush Maini and Zachary C. Lipton and J. Zico Kolter , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2406.09358 , eprinttype =. 2406.09358 , timestamp =

work page doi:10.48550/arxiv.2406.09358 2024
[38]

Annals of the New York Academy of Sciences , year=

Can large language models reason and plan? , author=. Annals of the New York Academy of Sciences , year=

work page
[39]

2024 , url=

LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench , author=. 2024 , url=

work page 2024
[40]

A Causal Framework to Quantify the Robustness of Mathematical Reasoning with Language Models

Stolfo, Alessandro and Jin, Zhijing and Shridhar, Kumar and Schoelkopf, Bernhard and Sachan, Mrinmaya. A Causal Framework to Quantify the Robustness of Mathematical Reasoning with Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023

work page 2023
[42]

The Thirteenth International Conference on Learning Representations , year=

Limits of Deep Learning: Sequence Modeling through the Lens of Complexity Theory , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[44]

2023 , eprint=

Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve , author=. 2023 , eprint=

work page 2023
[45]

2024 , eprint=

Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap , author=. 2024 , eprint=

work page 2024
[46]

Can We Count on

Thomas Ball and Shuo Chen and Cormac Herley , journal=. Can We Count on. 2024 , url=

work page 2024
[47]

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence,

On the Paradox of Learning to Reason from Data , author =. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence,. 2023 , month =

work page 2023
[48]

What Makes Chain-of-Thought Prompting Effective? A Counterfactual Study

Madaan, Aman and Hermann, Katherine and Yazdanbakhsh, Amir. What Makes Chain-of-Thought Prompting Effective? A Counterfactual Study. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023

work page 2023
[49]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Understanding Transformer Reasoning Capabilities via Graph Algorithms , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page
[50]

arXiv preprint arXiv:2406.04229 , year=

The CLRS-Text Algorithmic Reasoning Language Benchmark , author=. arXiv preprint arXiv:2406.04229 , year=

work page arXiv
[51]

Cole, David , title =. The. 2024 , edition =

work page 2024
[52]

Intelligence , year=

Defining intelligence: Bridging the gap between human and artificial perspectives , author=. Intelligence , year=

work page
[53]

2003 , publisher=

The algebraic mind: Integrating connectionism and cognitive science , author=. 2003 , publisher=

work page 2003
[54]

IRE Transactions on information theory , volume=

The logic theory machine--A complex information processing system , author=. IRE Transactions on information theory , volume=. 1956 , publisher=

work page 1956
[55]

2014 , publisher=

Probabilistic reasoning in intelligent systems: networks of plausible inference , author=. 2014 , publisher=

work page 2014
[56]

2004 , publisher=

Knowledge representation and reasoning , author=. 2004 , publisher=

work page 2004
[57]

Minds and machines , volume=

Universal intelligence: A definition of machine intelligence , author=. Minds and machines , volume=. 2007 , publisher=

work page 2007
[59]

ICLR Blogposts 2025 , year =

Ivanova, Desi R and Ilievski, Ilija and Konstantinov, Momchil , title =. ICLR Blogposts 2025 , year =

work page 2025
[60]

Marah I Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat S. Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, S \' e bastien Bubeck, Martin Cai, Caio C \' e sar Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Parul Chopra, Allie Del Giorno, Gustavo de Rosa, Matthew Dix...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.14219 2024
[61]

Gemini: A Family of Highly Capable Multimodal Models

Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean - Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy P. Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.11805 2023
[62]

Susskind

Enric Boix - Adser \` a , Omid Saremi, Emmanuel Abbe, Samy Bengio, Etai Littwin, and Joshua M. Susskind. When can transformers reason with abstract symbols? In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=STUGfUz8ob

work page 2024
[63]

Knowledge representation and reasoning

Ronald Brachman and Hector Levesque. Knowledge representation and reasoning. Elsevier, 2004

work page 2004
[64]

On the Measure of Intelligence

Fran c ois Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911
[65]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[66]

The Chinese Room Argument

David Cole. The Chinese Room Argument . In Edward N. Zalta and Uri Nodelman (eds.), The Stanford Encyclopedia of Philosophy . Metaphysics Research Lab, Stanford University, W inter 2024 edition, 2024

work page 2024
[67]

Gr \' e goire Del \' e tang, Anian Ruoss, Jordi Grau - Moya, Tim Genewein, Li Kevin Wenliang, Elliot Catt, Chris Cundy, Marcus Hutter, Shane Legg, Joel Veness, and Pedro A. Ortega. Neural networks and the chomsky hierarchy. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023...

work page 2023
[68]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al - Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aur \' e lien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozi \` e...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
[69]

Hwang, Soumya Sanyal, Sean Welleck, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi

Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Sean Welleck, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi. Faith and fate: Limits of transformers on compositionality, 2023

work page 2023
[70]

Gignac and Eva T

Gilles E. Gignac and Eva T. Szodorai. Defining intelligence: Bridging the gap between human and artificial perspectives. Intelligence, 2024. URL https://api.semanticscholar.org/CorpusID:269015060

work page 2024
[71]

Tom Gunter, Zirui Wang, Chong Wang, Ruoming Pang, Andy Narayanan, Aonan Zhang, Bowen Zhang, Chen Chen, Chung - Cheng Chiu, David Qiu, Deepak Gopinath, Dian Ang Yap, Dong Yin, Feng Nan, Floris Weers, Guoli Yin, Haoshuo Huang, Jianyu Wang, Jiarui Lu, John Peebles, Ke Ye, Mark Lee, Nan Du, Qibin Chen, Quentin Keunebroek, Sam Wiseman, Syd Evans, Tao Lei, Vive...

work page doi:10.48550/arxiv.2407.21075 2024
[72]

Evaluating llms' mathematical and coding competency through ontology-guided interventions

Pengfei Hong, Navonil Majumder, Deepanway Ghosal, Somak Aditya, Rada Mihalcea, and Soujanya Poria. Evaluating llms' mathematical and coding competency through ontology-guided interventions. arXiv preprint arXiv:2401.09395, 2024

work page arXiv 2024
[73]

Towards more rigorous evaluations of language models

Desi R Ivanova, Ilija Ilievski, and Momchil Konstantinov. Towards more rigorous evaluations of language models. In ICLR Blogposts 2025, 2025. URL https://iclr-blogposts.github.io/2025/blog/towards-more-rigorous-llm-evals/. https://iclr-blogposts.github.io/2025/blog/towards-more-rigorous-llm-evals/

work page 2025
[74]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L \' e lio Renard Lavaud, Marie - Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth \' e e Lacroix, and William El Sayed. Mistral 7b. CoRR, abs/...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825 2023
[75]

Su, Camillo J

Bowen Jiang, Yangxinyu Xie, Zhuoqun Hao, Xiaomeng Wang, Tanwi Mallick, Weijie J. Su, Camillo J. Taylor, and Dan Roth. A peek into token bias: Large language models are not yet genuine reasoners. CoRR, abs/2406.11050, 2024. doi:10.48550/ARXIV.2406.11050. URL https://doi.org/10.48550/arXiv.2406.11050

work page doi:10.48550/arxiv.2406.11050 2024
[76]

Can large language models reason and plan? Annals of the New York Academy of Sciences, 1534: 0 15 -- 18, 2024

Subbarao Kambhampati. Can large language models reason and plan? Annals of the New York Academy of Sciences, 1534: 0 15 -- 18, 2024. URL https://api.semanticscholar.org/CorpusID:268249961

work page 2024
[77]

Universal intelligence: A definition of machine intelligence

Shane Legg and Marcus Hutter. Universal intelligence: A definition of machine intelligence. Minds and machines, 17: 0 391--444, 2007

work page 2007
[78]

Gsm-plus: A comprehensive benchmark for evaluating the robustness of llms as mathematical problem solvers

Qintong Li, Leyang Cui, Xueliang Zhao, Lingpeng Kong, and Wei Bi. Gsm-plus: A comprehensive benchmark for evaluating the robustness of llms as mathematical problem solvers. In Lun - Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Ba...

work page 2024
[79]

Klusowski, Jianqing Fan, and Mengdi Wang

Zihao Li, Yuan Cao, Cheng Gao, Yihan He, Han Liu, Jason M. Klusowski, Jianqing Fan, and Mengdi Wang. One-layer transformer provably learns one-nearest neighbor in context. 2024 b . URL https://api.semanticscholar.org/CorpusID:272307690

work page 2024
[80]

Chain of thought empowers transformers to solve inherently serial problems

Zhiyuan Liu, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inherently serial problems. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=3EWTEy9MTM

work page 2024
[81]

The algebraic mind: Integrating connectionism and cognitive science

Gary F Marcus. The algebraic mind: Integrating connectionism and cognitive science. MIT press, 2003

work page 2003
[82]

Thomas McCoy, Shunyu Yao, Dan Friedman, Matthew Hardy, and Thomas L

R. Thomas McCoy, Shunyu Yao, Dan Friedman, Matthew Hardy, and Thomas L. Griffiths. Embers of autoregression: Understanding large language models through the problem they are trained to solve, 2023. URL https://arxiv.org/abs/2309.13638

work page arXiv 2023
[83]

Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi \` e re, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, L \' e onard Hussenot, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro - Ros, Ambrose Slone, Am \' e lie H \' e liou, Andrea Tacchetti, Anna Bulanova, Antonia Paterso...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.08295 2024
[84]

The logic theory machine--a complex information processing system

Allen Newell and Herbert Simon. The logic theory machine--a complex information processing system. IRE Transactions on information theory, 2 0 (3): 0 61--79, 1956

work page 1956
[85]

Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models

Marianna Nezhurina, Lucia Cipolina-Kun, Mehdi Cherti, and Jenia Jitsev. Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models. arXiv preprint arXiv:2406.02061, 2024

work page arXiv 2024
[86]

GPT-4 Technical Report

OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi:10.48550/ARXIV.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
[87]

Learning to reason with large language models

OpenAI. Learning to reason with large language models. https://openai.com/index/learning-to-reason-with-llms/, 2024. Accessed: 2024-09-29

work page 2024
[88]

Probabilistic reasoning in intelligent systems: networks of plausible inference

Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Elsevier, 2014

work page 2014
[89]

Papadimitriou

Binghui Peng, Srini Narayanan, and Christos H. Papadimitriou. On limitations of the transformer architecture. CoRR, abs/2402.08164, 2024. doi:10.48550/ARXIV.2402.08164. URL https://doi.org/10.48550/arXiv.2402.08164

work page doi:10.48550/arxiv.2402.08164 2024
[90]

Impact of pretraining term frequencies on few-shot reasoning

Yasaman Razeghi, Adam Roberts, Colin Raffel, and Ariel Herbert-Voss. Impact of pretraining term frequencies on few-shot reasoning. arXiv preprint arXiv:2202.08904, 2022

work page arXiv 2022
[91]

Morgane Rivi \` e re, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L \' e onard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram \' e , Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.00118 2024
[92]

Are emergent abilities of large language models a mirage?, 2023

Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage?, 2023

work page 2023
[93]

Chi, Nathanael Sch \" a rli, and Denny Zhou

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Sch \" a rli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 202...

work page 2023
[94]

Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap

Saurabh Srivastava, Annarose M B, Anto P V, Shashank Menon, Ajay Sukumar, Adwaith Samod T, Alan Philipose, Stevin Prince, and Sooraj Thomas. Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap, 2024. URL https://arxiv.org/abs/2402.19450

work page arXiv 2024
[95]

A causal framework to quantify the robustness of mathematical reasoning with language models

Alessandro Stolfo, Zhijing Jin, Kumar Shridhar, Bernhard Schoelkopf, and Mrinmaya Sachan. A causal framework to quantify the robustness of mathematical reasoning with language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers...

work page 2023
[96]

Large language models still can't plan (A benchmark for llms on planning and reasoning about change)

Karthik Valmeekam, Alberto Olmo Hernandez, Sarath Sreedharan, and Subbarao Kambhampati. Large language models still can't plan (A benchmark for llms on planning and reasoning about change). CoRR, abs/2206.10498, 2022. doi:10.48550/ARXIV.2206.10498. URL https://doi.org/10.48550/arXiv.2206.10498

work page doi:10.48550/arxiv.2206.10498 2022
[97]

Llms still can't plan; can lrms? a preliminary evaluation of openai's o1 on planbench

Karthik Valmeekam, Kaya Stechly, and Subbarao Kambhampati. Llms still can't plan; can lrms? a preliminary evaluation of openai's o1 on planbench. 2024. URL https://api.semanticscholar.org/CorpusID:272770270

work page 2024
[98]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, et al. Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[99]

Thinking like transformers

Gail Weiss, Yoav Goldberg, and Eran Yahav. Thinking like transformers. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event , volume 139 of Proceedings of Machine Learning Research, pp.\ 11080--11090. PMLR , 2021. URL http://proceedings.mlr.press/v139/weiss21a.html

work page 2021

Showing first 80 references.