pith. machine review for the scientific record. sign in

arxiv: 2410.05229 · v2 · submitted 2024-10-07 · 💻 cs.LG · cs.AI

Recognition: 1 theorem link

· Lean Theorem

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:38 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords large language modelsmathematical reasoningGSM8KGSM-Symboliclogical reasoningbenchmarksmodel limitationssymbolic templates
0
0 comments X

The pith

Large language models cannot perform genuine mathematical reasoning and instead replicate patterns from training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GSM-Symbolic, a benchmark built from symbolic templates that generate many varied versions of the same grade-school math problems while preserving the core logic. Testing multiple state-of-the-art models on this benchmark reveals large performance swings when only the numbers in a question change and sharp drops as extra clauses are added. Adding one clause that appears relevant but does not affect the solution chain causes accuracy losses up to 65 percent. The authors conclude that current LLMs lack true logical reasoning and rely on reproducing steps seen during training. This finding casts doubt on the reliability of standard benchmarks such as GSM8K for measuring actual reasoning progress.

Core claim

GSM-Symbolic, constructed via symbolic templates, produces controllable families of questions that keep the required reasoning chain fixed while varying numbers and clause count. Across open and closed models, accuracy varies substantially on different numerical instantiations of the same template and falls markedly with longer questions; the addition of a single non-contributory yet plausible clause triggers drops reaching 65 percent. The authors interpret these results as evidence that LLMs do not execute genuine logical inference but instead retrieve and replay reasoning fragments memorized from training data.

What carries the argument

GSM-Symbolic benchmark generated from symbolic templates that fix the reasoning structure while allowing systematic changes to numerical values and the insertion of additional clauses.

Load-bearing premise

The added clauses are truly irrelevant to the solution process, so observed performance drops demonstrate an absence of genuine reasoning rather than sensitivity to prompt length or wording.

What would settle it

A model that maintains near-constant high accuracy across hundreds of numerical variants of the same template and shows no drop when irrelevant clauses are appended would falsify the claim that LLMs lack genuine reasoning.

read the original abstract

Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics. To address these concerns, we conduct a large-scale study on several SOTA open and closed models. To overcome the limitations of existing evaluations, we introduce GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the generation of a diverse set of questions. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of models.Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and show that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is because current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data. Adding a single clause that seems relevant to the question causes significant performance drops (up to 65%) across all state-of-the-art models, even though the clause doesn't contribute to the reasoning chain needed for the final answer. Overall, our work offers a more nuanced understanding of LLMs' capabilities and limitations in mathematical reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces GSM-Symbolic, a benchmark derived from symbolic templates of GSM8K questions, to enable controllable generation of diverse mathematical word problems. Through large-scale experiments on multiple open and closed LLMs, it reports high variance in model performance across different numerical instantiations of the same template, consistent declines as the number of clauses increases, and sharp drops (up to 65%) when a single seemingly irrelevant clause is added, even though the clause does not alter the required reasoning chain. The authors conclude that current LLMs do not perform genuine logical reasoning but instead replicate patterns from training data.

Significance. If the core empirical patterns hold, the work supplies a valuable diagnostic tool for distinguishing robust reasoning from surface-level pattern matching in LLMs. The template-based approach yields more reliable metrics than fixed GSM8K sets and documents fragility that is consistent across model families, which has direct implications for claims about reasoning progress.

major comments (3)
  1. [Section 5] Section 5 (added-clause experiments): the claim that the inserted clauses 'don't contribute to the reasoning chain' is stated without an explicit verification procedure or ablation that isolates clause position, length, or lexical overlap with the original template; without such controls the observed drops cannot be unambiguously attributed to absence of genuine reasoning rather than attention dilution or CoT parsing failures on longer inputs.
  2. [Section 4.2] Section 4.2 (numerical-variation results): the reported performance variance across number changes is presented as evidence against robust reasoning, yet the paper does not quantify or control for changes in token distribution or numeric magnitude that could affect model outputs independently of logical structure; this weakens the link between the observed drops and the central hypothesis.
  3. [Table 2] Table 2 / clause-count scaling: the monotonic decline with added clauses is load-bearing for the fragility claim, but the paper provides no per-model breakdown of whether the drop occurs at the first added clause or accumulates gradually, leaving open the possibility that the effect is dominated by input-length sensitivity rather than reasoning failure.
minor comments (3)
  1. [Section 3] The description of how templates were validated to ensure the added clauses remain irrelevant could be expanded with one or two concrete examples in the main text rather than only in the appendix.
  2. [Figure 3] Figure 3 caption should explicitly state the exact number of models and total questions underlying each bar to improve reproducibility.
  3. [Section 4] A brief note on the total compute or number of API calls used for the closed models would help readers assess the scale of the evaluation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and detailed comments, which have helped us identify areas where the manuscript can be strengthened. We address each major comment below and describe the revisions we will make in the next version of the paper.

read point-by-point responses
  1. Referee: [Section 5] Section 5 (added-clause experiments): the claim that the inserted clauses 'don't contribute to the reasoning chain' is stated without an explicit verification procedure or ablation that isolates clause position, length, or lexical overlap with the original template; without such controls the observed drops cannot be unambiguously attributed to absence of genuine reasoning rather than attention dilution or CoT parsing failures on longer inputs.

    Authors: We agree that additional controls would make the attribution more robust. In the revised manuscript we will add an explicit verification procedure consisting of: (1) manual inspection and annotation of a random sample of 100 added clauses confirming they introduce no new arithmetic operations or entities required for the solution, (2) controlled ablations that vary clause position (prefix, infix, suffix) while keeping length fixed, and (3) measurement of lexical overlap (token Jaccard) between the added clause and the original template. These results will be reported in an expanded Section 5 and an accompanying appendix table. We believe the core finding—that performance drops sharply even for clearly irrelevant clauses—will remain, but the new controls will rule out the alternative explanations raised. revision: yes

  2. Referee: [Section 4.2] Section 4.2 (numerical-variation results): the reported performance variance across number changes is presented as evidence against robust reasoning, yet the paper does not quantify or control for changes in token distribution or numeric magnitude that could affect model outputs independently of logical structure; this weakens the link between the observed drops and the central hypothesis.

    Authors: We acknowledge the value of quantifying these factors. In the revision we will add: (i) statistics on the distribution of token lengths and numeric magnitudes across the GSM-Symbolic instantiations, and (ii) a controlled subset experiment in which we restrict number magnitudes to a narrow band (e.g., 10–100) while still varying the specific values. We will also report variance stratified by magnitude bin. At the same time, we note that the same template produces large accuracy swings even when the numeric values remain within the same order of magnitude and token length, which is difficult to explain by magnitude or token effects alone. The new analyses will make this distinction explicit. revision: partial

  3. Referee: [Table 2] Table 2 / clause-count scaling: the monotonic decline with added clauses is load-bearing for the fragility claim, but the paper provides no per-model breakdown of whether the drop occurs at the first added clause or accumulates gradually, leaving open the possibility that the effect is dominated by input-length sensitivity rather than reasoning failure.

    Authors: We will expand Table 2 (and add a new figure) to include per-model curves showing accuracy as a function of clause count. These curves will be accompanied by a short analysis of the incremental drop at each step (first clause vs. subsequent clauses) and a comparison against a length-matched control set that adds neutral filler tokens rather than clauses. This will allow readers to assess whether the decline is driven primarily by the first clause (consistent with reasoning fragility) or by cumulative length (consistent with attention dilution). revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark measurements with no derived quantities or self-referential definitions

full rationale

The paper presents an empirical study introducing GSM-Symbolic via symbolic templates to generate question variants, then directly measures LLM accuracy, variance, and degradation under clause additions. No equations, fitted parameters, or first-principles derivations exist whose outputs reduce to the inputs by construction. The hypothesis that models replicate training steps rather than reason is an interpretive claim resting on observed performance drops, not a logical reduction or self-citation chain. All load-bearing elements are external measurements on model outputs against generated test cases, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the symbolic templates preserve the original reasoning requirements while isolating the effects of numerical change and clause addition.

axioms (1)
  • domain assumption Symbolic templates generate questions whose required reasoning chain is identical to the original GSM8K problems.
    Invoked when constructing GSM-Symbolic to ensure that performance changes reflect reasoning fragility rather than altered problem difficulty.

pith-pipeline@v0.9.0 · 5624 in / 1156 out tokens · 26414 ms · 2026-05-15T00:38:35.597661+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    we observe significant performance drops (up to 65%) across all state-of-the-art models, even though the added clause doesn't contribute to the reasoning chain needed for the final answer

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs

    cs.LG 2026-05 unverdicted novelty 8.0

    MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.

  2. SEVerA: Verified Synthesis of Self-Evolving Agents

    cs.LG 2026-03 unverdicted novelty 8.0

    SEVerA uses Formally Guarded Generative Models and a three-stage Search-Verification-Learning process to synthesize self-evolving agents that satisfy hard formal constraints while improving task performance.

  3. Formalize, Don't Optimize: The Heuristic Trap in LLM-Generated Combinatorial Solvers

    cs.AI 2026-05 unverdicted novelty 7.0

    LLM-generated combinatorial solvers achieve highest correctness when the model formalizes problems for verified backends rather than attempting to optimize search, which often causes regressions.

  4. Tracing Uncertainty in Language Model "Reasoning"

    cs.LG 2026-05 unverdicted novelty 7.0

    Uncertainty trace profiles from LM reasoning traces predict correct final answers with AUROC up to 0.807 and enable early error detection using only initial tokens.

  5. Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems

    cs.AI 2026-04 unverdicted novelty 7.0

    A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.

  6. RoMathExam: A Longitudinal Dataset of Romanian Math Exams (1895-2025) with a Seven-Decade Core (1957-2025)

    cs.CY 2026-03 unverdicted novelty 7.0

    RoMathExam supplies a century-long collection of Romanian math exams together with a new intrinsic complexity metric that correlates across frontier models at r > 0.72.

  7. Robust Reasoning Benchmark

    cs.LG 2026-03 unverdicted novelty 7.0

    Perturbations to math problem text cause up to 55% average accuracy drops in open-weight LLMs and sequential solving reveals context pollution in attention mechanisms.

  8. Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis

    cs.CR 2026-05 unverdicted novelty 6.0

    Harmful prompts reformulated as coherent mathematical problems bypass LLM safety mechanisms at 46-56% rates, with success depending on deep reformulation rather than mere notation.

  9. When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models

    cs.CL 2026-05 conditional novelty 6.0

    AloLab, an iterative meta-agent prompt optimizer, raises structured output accuracy for 7-9B models from 0% to 84-87% on GSM8K while preserving near-native inference speed.

  10. HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

    cs.AI 2026-04 unverdicted novelty 6.0

    HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

  11. QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks

    cs.CL 2026-04 unverdicted novelty 6.0

    QuickScope uses modified COUP Bayesian optimization to find truly difficult questions in dynamic LLM benchmarks more sample-efficiently than baselines while cutting false positives.

  12. Agentic Frameworks for Reasoning Tasks: An Empirical Study

    cs.AI 2026-04 unverdicted novelty 6.0

    An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.

  13. Heterogeneous Consensus-Progressive Reasoning for Efficient Multi-Agent Debate

    cs.MA 2026-04 unverdicted novelty 6.0

    HCP-MAD reduces token costs in multi-agent debates by using heterogeneous consensus verification, adaptive pair-agent stopping, and escalated collective voting based on task complexity signals.

  14. The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning

    cs.CL 2026-03 unverdicted novelty 6.0

    LLMs prioritize surface heuristics such as distance cues over implicit constraints in reasoning tasks, with the new HOB benchmark showing no model exceeds 75% strict accuracy and hints recovering performance.

  15. Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities

    cs.AI 2026-05 unverdicted novelty 5.0

    Absurd World automatically converts real-world problems into absurd yet logically coherent scenarios to test whether LLMs can reason without depending on familiar patterns.

  16. NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning

    cs.LG 2026-05 unverdicted novelty 5.0

    Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.

  17. One Refiner to Unlock Them All: Inference-Time Reasoning Elicitation via Reinforcement Query Refinement

    cs.CL 2026-04 unverdicted novelty 5.0

    ReQueR trains a single RL-based query refiner with an adaptive curriculum to decompose raw queries into structured logic, delivering 1.7-7.2% absolute gains on reasoning tasks across diverse LLMs and generalizing to u...

  18. Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity

    cs.AI 2026-04 unverdicted novelty 5.0

    An LLM-as-a-judge evaluation framework for math reasoning outperforms symbolic methods by accurately assessing diverse answer representations and formats.

  19. A pragmatic approach to regulating AI agents

    cs.CY 2026-04 unverdicted novelty 5.0

    AI agents require distinct regulation as AI systems under the EU AI Act with orchestration-layer oversight and a risk-based traffic light authorization system in contract law to preserve human accountability.

  20. Too long; didn't solve

    cs.AI 2026-04 unverdicted novelty 5.0

    Longer prompts and solutions in a new expert-authored math dataset correlate with higher failure rates across LLMs, with length linked to empirical difficulty after difficulty adjustment.

  21. Measuring AI Reasoning: A Guide for Researchers

    cs.AI 2026-05 unverdicted novelty 4.0

    Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.

  22. EMS: Multi-Agent Voting via Efficient Majority-then-Stopping

    cs.AI 2026-04 unverdicted novelty 4.0

    EMS reduces the average number of agents invoked for majority voting by 32% via reliability-aware prioritization and early stopping on six benchmarks.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · cited by 22 Pith papers · 14 internal anchors

  1. [8]

    GSM-Plus:

    Qintong Li and Leyang Cui and Xueliang Zhao and Lingpeng Kong and Wei Bi , editor =. GSM-Plus:. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),. 2024 , url =

  2. [9]

    Hwang and Soumya Sanyal and Xiang Ren and Allyson Ettinger and Za

    Nouha Dziri and Ximing Lu and Melanie Sclar and Xiang Lorraine Li and Liwei Jiang and Bill Yuchen Lin and Sean Welleck and Peter West and Chandra Bhagavatula and Ronan Le Bras and Jena D. Hwang and Soumya Sanyal and Xiang Ren and Allyson Ettinger and Za. Faith and Fate: Limits of Transformers on Compositionality , booktitle =. 2023 , url =

  3. [10]

    Chi and Quoc V

    Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed H. Chi and Quoc V. Le and Denny Zhou , editor =. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , booktitle =. 2022 , url =

  4. [11]

    Chi and Nathanael Sch

    Freda Shi and Xinyun Chen and Kanishka Misra and Nathan Scales and David Dohan and Ed H. Chi and Nathanael Sch. Large Language Models Can Be Easily Distracted by Irrelevant Context , booktitle =. 2023 , url =

  5. [12]

    Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics , pages=

    Are NLP Models really able to Solve Simple Math Word Problems? , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics , pages=

  6. [13]

    Proceedings of NAACL , pages=

    MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms , author=. Proceedings of NAACL , pages=

  7. [15]

    Large Language Models are Zero-Shot Reasoners

    Large Language Models are Zero-Shot Reasoners , author=. arXiv preprint arXiv:2205.11916 , year=

  8. [17]

    Does learning require memorization? a short tale about a long tail , volume=

    Feldman, Vitaly , year=. Does learning require memorization? a short tale about a long tail , volume=. doi:10.1145/3357713.3384290 , booktitle=

  9. [18]

    2023 , eprint=

    Faith and Fate: Limits of Transformers on Compositionality , author=. 2023 , eprint=

  10. [19]

    LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

    Large Language Models Still Can't Plan (A Benchmark for LLMs on Planning and Reasoning about Change) , author=. arXiv preprint arXiv:2304.11477 , year=

  11. [20]

    Solving math word problems with process- and outcome-based feedback

    Improving Mathematical Reasoning with Language Models , author=. arXiv preprint arXiv:2211.14275 , year=

  12. [21]

    Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

    Imitation Learning for Solving Math Word Problems , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

  13. [22]

    Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems

    Program induction by rationale generation: Learning to solve and explain algebraic word problems , author=. arXiv preprint arXiv:1705.04146 , year=

  14. [23]

    2023 , eprint=

    Are Emergent Abilities of Large Language Models a Mirage? , author=. 2023 , eprint=

  15. [25]

    Thinking Like Transformers , booktitle =

    Gail Weiss and Yoav Goldberg and Eran Yahav , editor =. Thinking Like Transformers , booktitle =. 2021 , url =

  16. [26]

    Susskind and Samy Bengio and Preetum Nakkiran , title =

    Hattie Zhou and Arwen Bradley and Etai Littwin and Noam Razin and Omid Saremi and Joshua M. Susskind and Samy Bengio and Preetum Nakkiran , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

  17. [27]

    When can transformers reason with abstract symbols? , booktitle =

    Enric Boix. When can transformers reason with abstract symbols? , booktitle =. 2024 , url =

  18. [28]

    Neural Networks and the Chomsky Hierarchy , booktitle =

    Gr. Neural Networks and the Chomsky Hierarchy , booktitle =. 2023 , url =

  19. [29]

    The Twelfth International Conference on Learning Representations,

    Zhiyuan Liu and Hong Liu and Denny Zhou and Tengyu Ma , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

  20. [32]

    One-Layer Transformer Provably Learns One-Nearest Neighbor In Context , author=

  21. [33]

    Aithal and Pratyush Maini and Zachary C

    Sumukh K. Aithal and Pratyush Maini and Zachary C. Lipton and J. Zico Kolter , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2406.09358 , eprinttype =. 2406.09358 , timestamp =

  22. [38]

    Annals of the New York Academy of Sciences , year=

    Can large language models reason and plan? , author=. Annals of the New York Academy of Sciences , year=

  23. [39]

    2024 , url=

    LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench , author=. 2024 , url=

  24. [40]

    A Causal Framework to Quantify the Robustness of Mathematical Reasoning with Language Models

    Stolfo, Alessandro and Jin, Zhijing and Shridhar, Kumar and Schoelkopf, Bernhard and Sachan, Mrinmaya. A Causal Framework to Quantify the Robustness of Mathematical Reasoning with Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023

  25. [42]

    The Thirteenth International Conference on Learning Representations , year=

    Limits of Deep Learning: Sequence Modeling through the Lens of Complexity Theory , author=. The Thirteenth International Conference on Learning Representations , year=

  26. [44]

    2023 , eprint=

    Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve , author=. 2023 , eprint=

  27. [45]

    2024 , eprint=

    Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap , author=. 2024 , eprint=

  28. [46]

    Can We Count on

    Thomas Ball and Shuo Chen and Cormac Herley , journal=. Can We Count on. 2024 , url=

  29. [47]

    Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence,

    On the Paradox of Learning to Reason from Data , author =. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence,. 2023 , month =

  30. [48]

    What Makes Chain-of-Thought Prompting Effective? A Counterfactual Study

    Madaan, Aman and Hermann, Katherine and Yazdanbakhsh, Amir. What Makes Chain-of-Thought Prompting Effective? A Counterfactual Study. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023

  31. [49]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    Understanding Transformer Reasoning Capabilities via Graph Algorithms , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  32. [50]

    arXiv preprint arXiv:2406.04229 , year=

    The CLRS-Text Algorithmic Reasoning Language Benchmark , author=. arXiv preprint arXiv:2406.04229 , year=

  33. [51]

    Cole, David , title =. The. 2024 , edition =

  34. [52]

    Intelligence , year=

    Defining intelligence: Bridging the gap between human and artificial perspectives , author=. Intelligence , year=

  35. [53]

    2003 , publisher=

    The algebraic mind: Integrating connectionism and cognitive science , author=. 2003 , publisher=

  36. [54]

    IRE Transactions on information theory , volume=

    The logic theory machine--A complex information processing system , author=. IRE Transactions on information theory , volume=. 1956 , publisher=

  37. [55]

    2014 , publisher=

    Probabilistic reasoning in intelligent systems: networks of plausible inference , author=. 2014 , publisher=

  38. [56]

    2004 , publisher=

    Knowledge representation and reasoning , author=. 2004 , publisher=

  39. [57]

    Minds and machines , volume=

    Universal intelligence: A definition of machine intelligence , author=. Minds and machines , volume=. 2007 , publisher=

  40. [59]

    ICLR Blogposts 2025 , year =

    Ivanova, Desi R and Ilievski, Ilija and Konstantinov, Momchil , title =. ICLR Blogposts 2025 , year =

  41. [60]

    Marah I Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat S. Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, S \' e bastien Bubeck, Martin Cai, Caio C \' e sar Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Parul Chopra, Allie Del Giorno, Gustavo de Rosa, Matthew Dix...

  42. [61]

    Gemini: A Family of Highly Capable Multimodal Models

    Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean - Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy P. Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael ...

  43. [62]

    Susskind

    Enric Boix - Adser \` a , Omid Saremi, Emmanuel Abbe, Samy Bengio, Etai Littwin, and Joshua M. Susskind. When can transformers reason with abstract symbols? In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=STUGfUz8ob

  44. [63]

    Knowledge representation and reasoning

    Ronald Brachman and Hector Levesque. Knowledge representation and reasoning. Elsevier, 2004

  45. [64]

    On the Measure of Intelligence

    Fran c ois Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019

  46. [65]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  47. [66]

    The Chinese Room Argument

    David Cole. The Chinese Room Argument . In Edward N. Zalta and Uri Nodelman (eds.), The Stanford Encyclopedia of Philosophy . Metaphysics Research Lab, Stanford University, W inter 2024 edition, 2024

  48. [67]

    Gr \' e goire Del \' e tang, Anian Ruoss, Jordi Grau - Moya, Tim Genewein, Li Kevin Wenliang, Elliot Catt, Chris Cundy, Marcus Hutter, Shane Legg, Joel Veness, and Pedro A. Ortega. Neural networks and the chomsky hierarchy. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023...

  49. [68]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al - Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aur \' e lien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozi \` e...

  50. [69]

    Hwang, Soumya Sanyal, Sean Welleck, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi

    Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Sean Welleck, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi. Faith and fate: Limits of transformers on compositionality, 2023

  51. [70]

    Gignac and Eva T

    Gilles E. Gignac and Eva T. Szodorai. Defining intelligence: Bridging the gap between human and artificial perspectives. Intelligence, 2024. URL https://api.semanticscholar.org/CorpusID:269015060

  52. [71]

    Tom Gunter, Zirui Wang, Chong Wang, Ruoming Pang, Andy Narayanan, Aonan Zhang, Bowen Zhang, Chen Chen, Chung - Cheng Chiu, David Qiu, Deepak Gopinath, Dian Ang Yap, Dong Yin, Feng Nan, Floris Weers, Guoli Yin, Haoshuo Huang, Jianyu Wang, Jiarui Lu, John Peebles, Ke Ye, Mark Lee, Nan Du, Qibin Chen, Quentin Keunebroek, Sam Wiseman, Syd Evans, Tao Lei, Vive...

  53. [72]

    Evaluating llms' mathematical and coding competency through ontology-guided interventions

    Pengfei Hong, Navonil Majumder, Deepanway Ghosal, Somak Aditya, Rada Mihalcea, and Soujanya Poria. Evaluating llms' mathematical and coding competency through ontology-guided interventions. arXiv preprint arXiv:2401.09395, 2024

  54. [73]

    Towards more rigorous evaluations of language models

    Desi R Ivanova, Ilija Ilievski, and Momchil Konstantinov. Towards more rigorous evaluations of language models. In ICLR Blogposts 2025, 2025. URL https://iclr-blogposts.github.io/2025/blog/towards-more-rigorous-llm-evals/. https://iclr-blogposts.github.io/2025/blog/towards-more-rigorous-llm-evals/

  55. [74]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L \' e lio Renard Lavaud, Marie - Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth \' e e Lacroix, and William El Sayed. Mistral 7b. CoRR, abs/...

  56. [75]

    Su, Camillo J

    Bowen Jiang, Yangxinyu Xie, Zhuoqun Hao, Xiaomeng Wang, Tanwi Mallick, Weijie J. Su, Camillo J. Taylor, and Dan Roth. A peek into token bias: Large language models are not yet genuine reasoners. CoRR, abs/2406.11050, 2024. doi:10.48550/ARXIV.2406.11050. URL https://doi.org/10.48550/arXiv.2406.11050

  57. [76]

    Can large language models reason and plan? Annals of the New York Academy of Sciences, 1534: 0 15 -- 18, 2024

    Subbarao Kambhampati. Can large language models reason and plan? Annals of the New York Academy of Sciences, 1534: 0 15 -- 18, 2024. URL https://api.semanticscholar.org/CorpusID:268249961

  58. [77]

    Universal intelligence: A definition of machine intelligence

    Shane Legg and Marcus Hutter. Universal intelligence: A definition of machine intelligence. Minds and machines, 17: 0 391--444, 2007

  59. [78]

    Gsm-plus: A comprehensive benchmark for evaluating the robustness of llms as mathematical problem solvers

    Qintong Li, Leyang Cui, Xueliang Zhao, Lingpeng Kong, and Wei Bi. Gsm-plus: A comprehensive benchmark for evaluating the robustness of llms as mathematical problem solvers. In Lun - Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Ba...

  60. [79]

    Klusowski, Jianqing Fan, and Mengdi Wang

    Zihao Li, Yuan Cao, Cheng Gao, Yihan He, Han Liu, Jason M. Klusowski, Jianqing Fan, and Mengdi Wang. One-layer transformer provably learns one-nearest neighbor in context. 2024 b . URL https://api.semanticscholar.org/CorpusID:272307690

  61. [80]

    Chain of thought empowers transformers to solve inherently serial problems

    Zhiyuan Liu, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inherently serial problems. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=3EWTEy9MTM

  62. [81]

    The algebraic mind: Integrating connectionism and cognitive science

    Gary F Marcus. The algebraic mind: Integrating connectionism and cognitive science. MIT press, 2003

  63. [82]

    Thomas McCoy, Shunyu Yao, Dan Friedman, Matthew Hardy, and Thomas L

    R. Thomas McCoy, Shunyu Yao, Dan Friedman, Matthew Hardy, and Thomas L. Griffiths. Embers of autoregression: Understanding large language models through the problem they are trained to solve, 2023. URL https://arxiv.org/abs/2309.13638

  64. [83]

    Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi \` e re, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, L \' e onard Hussenot, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro - Ros, Ambrose Slone, Am \' e lie H \' e liou, Andrea Tacchetti, Anna Bulanova, Antonia Paterso...

  65. [84]

    The logic theory machine--a complex information processing system

    Allen Newell and Herbert Simon. The logic theory machine--a complex information processing system. IRE Transactions on information theory, 2 0 (3): 0 61--79, 1956

  66. [85]

    Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models

    Marianna Nezhurina, Lucia Cipolina-Kun, Mehdi Cherti, and Jenia Jitsev. Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models. arXiv preprint arXiv:2406.02061, 2024

  67. [86]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi:10.48550/ARXIV.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774

  68. [87]

    Learning to reason with large language models

    OpenAI. Learning to reason with large language models. https://openai.com/index/learning-to-reason-with-llms/, 2024. Accessed: 2024-09-29

  69. [88]

    Probabilistic reasoning in intelligent systems: networks of plausible inference

    Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Elsevier, 2014

  70. [89]

    Papadimitriou

    Binghui Peng, Srini Narayanan, and Christos H. Papadimitriou. On limitations of the transformer architecture. CoRR, abs/2402.08164, 2024. doi:10.48550/ARXIV.2402.08164. URL https://doi.org/10.48550/arXiv.2402.08164

  71. [90]

    Impact of pretraining term frequencies on few-shot reasoning

    Yasaman Razeghi, Adam Roberts, Colin Raffel, and Ariel Herbert-Voss. Impact of pretraining term frequencies on few-shot reasoning. arXiv preprint arXiv:2202.08904, 2022

  72. [91]

    Morgane Rivi \` e re, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L \' e onard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram \' e , Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgi...

  73. [92]

    Are emergent abilities of large language models a mirage?, 2023

    Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage?, 2023

  74. [93]

    Chi, Nathanael Sch \" a rli, and Denny Zhou

    Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Sch \" a rli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 202...

  75. [94]

    Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap

    Saurabh Srivastava, Annarose M B, Anto P V, Shashank Menon, Ajay Sukumar, Adwaith Samod T, Alan Philipose, Stevin Prince, and Sooraj Thomas. Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap, 2024. URL https://arxiv.org/abs/2402.19450

  76. [95]

    A causal framework to quantify the robustness of mathematical reasoning with language models

    Alessandro Stolfo, Zhijing Jin, Kumar Shridhar, Bernhard Schoelkopf, and Mrinmaya Sachan. A causal framework to quantify the robustness of mathematical reasoning with language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers...

  77. [96]

    Large language models still can't plan (A benchmark for llms on planning and reasoning about change)

    Karthik Valmeekam, Alberto Olmo Hernandez, Sarath Sreedharan, and Subbarao Kambhampati. Large language models still can't plan (A benchmark for llms on planning and reasoning about change). CoRR, abs/2206.10498, 2022. doi:10.48550/ARXIV.2206.10498. URL https://doi.org/10.48550/arXiv.2206.10498

  78. [97]

    Llms still can't plan; can lrms? a preliminary evaluation of openai's o1 on planbench

    Karthik Valmeekam, Kaya Stechly, and Subbarao Kambhampati. Llms still can't plan; can lrms? a preliminary evaluation of openai's o1 on planbench. 2024. URL https://api.semanticscholar.org/CorpusID:272770270

  79. [98]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, et al. Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022

  80. [99]

    Thinking like transformers

    Gail Weiss, Yoav Goldberg, and Eran Yahav. Thinking like transformers. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event , volume 139 of Proceedings of Machine Learning Research, pp.\ 11080--11090. PMLR , 2021. URL http://proceedings.mlr.press/v139/weiss21a.html

Showing first 80 references.