Recognition: 1 theorem link
· Lean TheoremGSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
Pith reviewed 2026-05-15 00:38 UTC · model grok-4.3
The pith
Large language models cannot perform genuine mathematical reasoning and instead replicate patterns from training data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GSM-Symbolic, constructed via symbolic templates, produces controllable families of questions that keep the required reasoning chain fixed while varying numbers and clause count. Across open and closed models, accuracy varies substantially on different numerical instantiations of the same template and falls markedly with longer questions; the addition of a single non-contributory yet plausible clause triggers drops reaching 65 percent. The authors interpret these results as evidence that LLMs do not execute genuine logical inference but instead retrieve and replay reasoning fragments memorized from training data.
What carries the argument
GSM-Symbolic benchmark generated from symbolic templates that fix the reasoning structure while allowing systematic changes to numerical values and the insertion of additional clauses.
Load-bearing premise
The added clauses are truly irrelevant to the solution process, so observed performance drops demonstrate an absence of genuine reasoning rather than sensitivity to prompt length or wording.
What would settle it
A model that maintains near-constant high accuracy across hundreds of numerical variants of the same template and shows no drop when irrelevant clauses are appended would falsify the claim that LLMs lack genuine reasoning.
read the original abstract
Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics. To address these concerns, we conduct a large-scale study on several SOTA open and closed models. To overcome the limitations of existing evaluations, we introduce GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the generation of a diverse set of questions. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of models.Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and show that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is because current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data. Adding a single clause that seems relevant to the question causes significant performance drops (up to 65%) across all state-of-the-art models, even though the clause doesn't contribute to the reasoning chain needed for the final answer. Overall, our work offers a more nuanced understanding of LLMs' capabilities and limitations in mathematical reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GSM-Symbolic, a benchmark derived from symbolic templates of GSM8K questions, to enable controllable generation of diverse mathematical word problems. Through large-scale experiments on multiple open and closed LLMs, it reports high variance in model performance across different numerical instantiations of the same template, consistent declines as the number of clauses increases, and sharp drops (up to 65%) when a single seemingly irrelevant clause is added, even though the clause does not alter the required reasoning chain. The authors conclude that current LLMs do not perform genuine logical reasoning but instead replicate patterns from training data.
Significance. If the core empirical patterns hold, the work supplies a valuable diagnostic tool for distinguishing robust reasoning from surface-level pattern matching in LLMs. The template-based approach yields more reliable metrics than fixed GSM8K sets and documents fragility that is consistent across model families, which has direct implications for claims about reasoning progress.
major comments (3)
- [Section 5] Section 5 (added-clause experiments): the claim that the inserted clauses 'don't contribute to the reasoning chain' is stated without an explicit verification procedure or ablation that isolates clause position, length, or lexical overlap with the original template; without such controls the observed drops cannot be unambiguously attributed to absence of genuine reasoning rather than attention dilution or CoT parsing failures on longer inputs.
- [Section 4.2] Section 4.2 (numerical-variation results): the reported performance variance across number changes is presented as evidence against robust reasoning, yet the paper does not quantify or control for changes in token distribution or numeric magnitude that could affect model outputs independently of logical structure; this weakens the link between the observed drops and the central hypothesis.
- [Table 2] Table 2 / clause-count scaling: the monotonic decline with added clauses is load-bearing for the fragility claim, but the paper provides no per-model breakdown of whether the drop occurs at the first added clause or accumulates gradually, leaving open the possibility that the effect is dominated by input-length sensitivity rather than reasoning failure.
minor comments (3)
- [Section 3] The description of how templates were validated to ensure the added clauses remain irrelevant could be expanded with one or two concrete examples in the main text rather than only in the appendix.
- [Figure 3] Figure 3 caption should explicitly state the exact number of models and total questions underlying each bar to improve reproducibility.
- [Section 4] A brief note on the total compute or number of API calls used for the closed models would help readers assess the scale of the evaluation.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and detailed comments, which have helped us identify areas where the manuscript can be strengthened. We address each major comment below and describe the revisions we will make in the next version of the paper.
read point-by-point responses
-
Referee: [Section 5] Section 5 (added-clause experiments): the claim that the inserted clauses 'don't contribute to the reasoning chain' is stated without an explicit verification procedure or ablation that isolates clause position, length, or lexical overlap with the original template; without such controls the observed drops cannot be unambiguously attributed to absence of genuine reasoning rather than attention dilution or CoT parsing failures on longer inputs.
Authors: We agree that additional controls would make the attribution more robust. In the revised manuscript we will add an explicit verification procedure consisting of: (1) manual inspection and annotation of a random sample of 100 added clauses confirming they introduce no new arithmetic operations or entities required for the solution, (2) controlled ablations that vary clause position (prefix, infix, suffix) while keeping length fixed, and (3) measurement of lexical overlap (token Jaccard) between the added clause and the original template. These results will be reported in an expanded Section 5 and an accompanying appendix table. We believe the core finding—that performance drops sharply even for clearly irrelevant clauses—will remain, but the new controls will rule out the alternative explanations raised. revision: yes
-
Referee: [Section 4.2] Section 4.2 (numerical-variation results): the reported performance variance across number changes is presented as evidence against robust reasoning, yet the paper does not quantify or control for changes in token distribution or numeric magnitude that could affect model outputs independently of logical structure; this weakens the link between the observed drops and the central hypothesis.
Authors: We acknowledge the value of quantifying these factors. In the revision we will add: (i) statistics on the distribution of token lengths and numeric magnitudes across the GSM-Symbolic instantiations, and (ii) a controlled subset experiment in which we restrict number magnitudes to a narrow band (e.g., 10–100) while still varying the specific values. We will also report variance stratified by magnitude bin. At the same time, we note that the same template produces large accuracy swings even when the numeric values remain within the same order of magnitude and token length, which is difficult to explain by magnitude or token effects alone. The new analyses will make this distinction explicit. revision: partial
-
Referee: [Table 2] Table 2 / clause-count scaling: the monotonic decline with added clauses is load-bearing for the fragility claim, but the paper provides no per-model breakdown of whether the drop occurs at the first added clause or accumulates gradually, leaving open the possibility that the effect is dominated by input-length sensitivity rather than reasoning failure.
Authors: We will expand Table 2 (and add a new figure) to include per-model curves showing accuracy as a function of clause count. These curves will be accompanied by a short analysis of the incremental drop at each step (first clause vs. subsequent clauses) and a comparison against a length-matched control set that adds neutral filler tokens rather than clauses. This will allow readers to assess whether the decline is driven primarily by the first clause (consistent with reasoning fragility) or by cumulative length (consistent with attention dilution). revision: yes
Circularity Check
No circularity; empirical benchmark measurements with no derived quantities or self-referential definitions
full rationale
The paper presents an empirical study introducing GSM-Symbolic via symbolic templates to generate question variants, then directly measures LLM accuracy, variance, and degradation under clause additions. No equations, fitted parameters, or first-principles derivations exist whose outputs reduce to the inputs by construction. The hypothesis that models replicate training steps rather than reason is an interpretive claim resting on observed performance drops, not a logical reduction or self-citation chain. All load-bearing elements are external measurements on model outputs against generated test cases, making the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Symbolic templates generate questions whose required reasoning chain is identical to the original GSM8K problems.
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we observe significant performance drops (up to 65%) across all state-of-the-art models, even though the added clause doesn't contribute to the reasoning chain needed for the final answer
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 22 Pith papers
-
MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs
MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.
-
SEVerA: Verified Synthesis of Self-Evolving Agents
SEVerA uses Formally Guarded Generative Models and a three-stage Search-Verification-Learning process to synthesize self-evolving agents that satisfy hard formal constraints while improving task performance.
-
Formalize, Don't Optimize: The Heuristic Trap in LLM-Generated Combinatorial Solvers
LLM-generated combinatorial solvers achieve highest correctness when the model formalizes problems for verified backends rather than attempting to optimize search, which often causes regressions.
-
Tracing Uncertainty in Language Model "Reasoning"
Uncertainty trace profiles from LM reasoning traces predict correct final answers with AUROC up to 0.807 and enable early error detection using only initial tokens.
-
Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems
A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.
-
RoMathExam: A Longitudinal Dataset of Romanian Math Exams (1895-2025) with a Seven-Decade Core (1957-2025)
RoMathExam supplies a century-long collection of Romanian math exams together with a new intrinsic complexity metric that correlates across frontier models at r > 0.72.
-
Robust Reasoning Benchmark
Perturbations to math problem text cause up to 55% average accuracy drops in open-weight LLMs and sequential solving reveals context pollution in attention mechanisms.
-
Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis
Harmful prompts reformulated as coherent mathematical problems bypass LLM safety mechanisms at 46-56% rates, with success depending on deep reformulation rather than mere notation.
-
When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models
AloLab, an iterative meta-agent prompt optimizer, raises structured output accuracy for 7-9B models from 0% to 84-87% on GSM8K while preserving near-native inference speed.
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks
QuickScope uses modified COUP Bayesian optimization to find truly difficult questions in dynamic LLM benchmarks more sample-efficiently than baselines while cutting false positives.
-
Agentic Frameworks for Reasoning Tasks: An Empirical Study
An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.
-
Heterogeneous Consensus-Progressive Reasoning for Efficient Multi-Agent Debate
HCP-MAD reduces token costs in multi-agent debates by using heterogeneous consensus verification, adaptive pair-agent stopping, and escalated collective voting based on task complexity signals.
-
The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning
LLMs prioritize surface heuristics such as distance cues over implicit constraints in reasoning tasks, with the new HOB benchmark showing no model exceeds 75% strict accuracy and hints recovering performance.
-
Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities
Absurd World automatically converts real-world problems into absurd yet logically coherent scenarios to test whether LLMs can reason without depending on familiar patterns.
-
NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning
Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.
-
One Refiner to Unlock Them All: Inference-Time Reasoning Elicitation via Reinforcement Query Refinement
ReQueR trains a single RL-based query refiner with an adaptive curriculum to decompose raw queries into structured logic, delivering 1.7-7.2% absolute gains on reasoning tasks across diverse LLMs and generalizing to u...
-
Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity
An LLM-as-a-judge evaluation framework for math reasoning outperforms symbolic methods by accurately assessing diverse answer representations and formats.
-
A pragmatic approach to regulating AI agents
AI agents require distinct regulation as AI systems under the EU AI Act with orchestration-layer oversight and a risk-based traffic light authorization system in contract law to preserve human accountability.
-
Too long; didn't solve
Longer prompts and solutions in a new expert-authored math dataset correlate with higher failure rates across LLMs, with length linked to empirical difficulty after difficulty adjustment.
-
Measuring AI Reasoning: A Guide for Researchers
Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.
-
EMS: Multi-Agent Voting via Efficient Majority-then-Stopping
EMS reduces the average number of agents invoked for majority voting by 32% via reliability-aware prioritization and early stopping on six benchmarks.
Reference graph
Works this paper leans on
- [8]
-
[9]
Hwang and Soumya Sanyal and Xiang Ren and Allyson Ettinger and Za
Nouha Dziri and Ximing Lu and Melanie Sclar and Xiang Lorraine Li and Liwei Jiang and Bill Yuchen Lin and Sean Welleck and Peter West and Chandra Bhagavatula and Ronan Le Bras and Jena D. Hwang and Soumya Sanyal and Xiang Ren and Allyson Ettinger and Za. Faith and Fate: Limits of Transformers on Compositionality , booktitle =. 2023 , url =
work page 2023
-
[10]
Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed H. Chi and Quoc V. Le and Denny Zhou , editor =. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , booktitle =. 2022 , url =
work page 2022
-
[11]
Freda Shi and Xinyun Chen and Kanishka Misra and Nathan Scales and David Dohan and Ed H. Chi and Nathanael Sch. Large Language Models Can Be Easily Distracted by Irrelevant Context , booktitle =. 2023 , url =
work page 2023
-
[12]
Are NLP Models really able to Solve Simple Math Word Problems? , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics , pages=
work page 2021
-
[13]
MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms , author=. Proceedings of NAACL , pages=
-
[15]
Large Language Models are Zero-Shot Reasoners
Large Language Models are Zero-Shot Reasoners , author=. arXiv preprint arXiv:2205.11916 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Does learning require memorization? a short tale about a long tail , volume=
Feldman, Vitaly , year=. Does learning require memorization? a short tale about a long tail , volume=. doi:10.1145/3357713.3384290 , booktitle=
-
[18]
Faith and Fate: Limits of Transformers on Compositionality , author=. 2023 , eprint=
work page 2023
-
[19]
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
Large Language Models Still Can't Plan (A Benchmark for LLMs on Planning and Reasoning about Change) , author=. arXiv preprint arXiv:2304.11477 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Solving math word problems with process- and outcome-based feedback
Improving Mathematical Reasoning with Language Models , author=. arXiv preprint arXiv:2211.14275 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=
Imitation Learning for Solving Math Word Problems , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2022
-
[22]
Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems
Program induction by rationale generation: Learning to solve and explain algebraic word problems , author=. arXiv preprint arXiv:1705.04146 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Are Emergent Abilities of Large Language Models a Mirage? , author=. 2023 , eprint=
work page 2023
-
[25]
Thinking Like Transformers , booktitle =
Gail Weiss and Yoav Goldberg and Eran Yahav , editor =. Thinking Like Transformers , booktitle =. 2021 , url =
work page 2021
-
[26]
Susskind and Samy Bengio and Preetum Nakkiran , title =
Hattie Zhou and Arwen Bradley and Etai Littwin and Noam Razin and Omid Saremi and Joshua M. Susskind and Samy Bengio and Preetum Nakkiran , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =
work page 2024
-
[27]
When can transformers reason with abstract symbols? , booktitle =
Enric Boix. When can transformers reason with abstract symbols? , booktitle =. 2024 , url =
work page 2024
-
[28]
Neural Networks and the Chomsky Hierarchy , booktitle =
Gr. Neural Networks and the Chomsky Hierarchy , booktitle =. 2023 , url =
work page 2023
-
[29]
The Twelfth International Conference on Learning Representations,
Zhiyuan Liu and Hong Liu and Denny Zhou and Tengyu Ma , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =
work page 2024
-
[32]
One-Layer Transformer Provably Learns One-Nearest Neighbor In Context , author=
-
[33]
Aithal and Pratyush Maini and Zachary C
Sumukh K. Aithal and Pratyush Maini and Zachary C. Lipton and J. Zico Kolter , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2406.09358 , eprinttype =. 2406.09358 , timestamp =
-
[38]
Annals of the New York Academy of Sciences , year=
Can large language models reason and plan? , author=. Annals of the New York Academy of Sciences , year=
-
[39]
LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench , author=. 2024 , url=
work page 2024
-
[40]
A Causal Framework to Quantify the Robustness of Mathematical Reasoning with Language Models
Stolfo, Alessandro and Jin, Zhijing and Shridhar, Kumar and Schoelkopf, Bernhard and Sachan, Mrinmaya. A Causal Framework to Quantify the Robustness of Mathematical Reasoning with Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023
work page 2023
-
[42]
The Thirteenth International Conference on Learning Representations , year=
Limits of Deep Learning: Sequence Modeling through the Lens of Complexity Theory , author=. The Thirteenth International Conference on Learning Representations , year=
-
[44]
Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve , author=. 2023 , eprint=
work page 2023
-
[45]
Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap , author=. 2024 , eprint=
work page 2024
-
[46]
Thomas Ball and Shuo Chen and Cormac Herley , journal=. Can We Count on. 2024 , url=
work page 2024
-
[47]
Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence,
On the Paradox of Learning to Reason from Data , author =. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence,. 2023 , month =
work page 2023
-
[48]
What Makes Chain-of-Thought Prompting Effective? A Counterfactual Study
Madaan, Aman and Hermann, Katherine and Yazdanbakhsh, Amir. What Makes Chain-of-Thought Prompting Effective? A Counterfactual Study. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023
work page 2023
-
[49]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
Understanding Transformer Reasoning Capabilities via Graph Algorithms , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[50]
arXiv preprint arXiv:2406.04229 , year=
The CLRS-Text Algorithmic Reasoning Language Benchmark , author=. arXiv preprint arXiv:2406.04229 , year=
-
[51]
Cole, David , title =. The. 2024 , edition =
work page 2024
-
[52]
Defining intelligence: Bridging the gap between human and artificial perspectives , author=. Intelligence , year=
-
[53]
The algebraic mind: Integrating connectionism and cognitive science , author=. 2003 , publisher=
work page 2003
-
[54]
IRE Transactions on information theory , volume=
The logic theory machine--A complex information processing system , author=. IRE Transactions on information theory , volume=. 1956 , publisher=
work page 1956
-
[55]
Probabilistic reasoning in intelligent systems: networks of plausible inference , author=. 2014 , publisher=
work page 2014
-
[56]
Knowledge representation and reasoning , author=. 2004 , publisher=
work page 2004
-
[57]
Universal intelligence: A definition of machine intelligence , author=. Minds and machines , volume=. 2007 , publisher=
work page 2007
-
[59]
Ivanova, Desi R and Ilievski, Ilija and Konstantinov, Momchil , title =. ICLR Blogposts 2025 , year =
work page 2025
-
[60]
Marah I Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat S. Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, S \' e bastien Bubeck, Martin Cai, Caio C \' e sar Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Parul Chopra, Allie Del Giorno, Gustavo de Rosa, Matthew Dix...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.14219 2024
-
[61]
Gemini: A Family of Highly Capable Multimodal Models
Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean - Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy P. Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.11805 2023
-
[62]
Enric Boix - Adser \` a , Omid Saremi, Emmanuel Abbe, Samy Bengio, Etai Littwin, and Joshua M. Susskind. When can transformers reason with abstract symbols? In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=STUGfUz8ob
work page 2024
-
[63]
Knowledge representation and reasoning
Ronald Brachman and Hector Levesque. Knowledge representation and reasoning. Elsevier, 2004
work page 2004
-
[64]
On the Measure of Intelligence
Fran c ois Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[65]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[66]
David Cole. The Chinese Room Argument . In Edward N. Zalta and Uri Nodelman (eds.), The Stanford Encyclopedia of Philosophy . Metaphysics Research Lab, Stanford University, W inter 2024 edition, 2024
work page 2024
-
[67]
Gr \' e goire Del \' e tang, Anian Ruoss, Jordi Grau - Moya, Tim Genewein, Li Kevin Wenliang, Elliot Catt, Chris Cundy, Marcus Hutter, Shane Legg, Joel Veness, and Pedro A. Ortega. Neural networks and the chomsky hierarchy. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023...
work page 2023
-
[68]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al - Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aur \' e lien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozi \` e...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
-
[69]
Hwang, Soumya Sanyal, Sean Welleck, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi
Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Sean Welleck, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi. Faith and fate: Limits of transformers on compositionality, 2023
work page 2023
-
[70]
Gilles E. Gignac and Eva T. Szodorai. Defining intelligence: Bridging the gap between human and artificial perspectives. Intelligence, 2024. URL https://api.semanticscholar.org/CorpusID:269015060
work page 2024
-
[71]
Tom Gunter, Zirui Wang, Chong Wang, Ruoming Pang, Andy Narayanan, Aonan Zhang, Bowen Zhang, Chen Chen, Chung - Cheng Chiu, David Qiu, Deepak Gopinath, Dian Ang Yap, Dong Yin, Feng Nan, Floris Weers, Guoli Yin, Haoshuo Huang, Jianyu Wang, Jiarui Lu, John Peebles, Ke Ye, Mark Lee, Nan Du, Qibin Chen, Quentin Keunebroek, Sam Wiseman, Syd Evans, Tao Lei, Vive...
-
[72]
Evaluating llms' mathematical and coding competency through ontology-guided interventions
Pengfei Hong, Navonil Majumder, Deepanway Ghosal, Somak Aditya, Rada Mihalcea, and Soujanya Poria. Evaluating llms' mathematical and coding competency through ontology-guided interventions. arXiv preprint arXiv:2401.09395, 2024
-
[73]
Towards more rigorous evaluations of language models
Desi R Ivanova, Ilija Ilievski, and Momchil Konstantinov. Towards more rigorous evaluations of language models. In ICLR Blogposts 2025, 2025. URL https://iclr-blogposts.github.io/2025/blog/towards-more-rigorous-llm-evals/. https://iclr-blogposts.github.io/2025/blog/towards-more-rigorous-llm-evals/
work page 2025
-
[74]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L \' e lio Renard Lavaud, Marie - Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth \' e e Lacroix, and William El Sayed. Mistral 7b. CoRR, abs/...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825 2023
-
[75]
Bowen Jiang, Yangxinyu Xie, Zhuoqun Hao, Xiaomeng Wang, Tanwi Mallick, Weijie J. Su, Camillo J. Taylor, and Dan Roth. A peek into token bias: Large language models are not yet genuine reasoners. CoRR, abs/2406.11050, 2024. doi:10.48550/ARXIV.2406.11050. URL https://doi.org/10.48550/arXiv.2406.11050
-
[76]
Subbarao Kambhampati. Can large language models reason and plan? Annals of the New York Academy of Sciences, 1534: 0 15 -- 18, 2024. URL https://api.semanticscholar.org/CorpusID:268249961
work page 2024
-
[77]
Universal intelligence: A definition of machine intelligence
Shane Legg and Marcus Hutter. Universal intelligence: A definition of machine intelligence. Minds and machines, 17: 0 391--444, 2007
work page 2007
-
[78]
Qintong Li, Leyang Cui, Xueliang Zhao, Lingpeng Kong, and Wei Bi. Gsm-plus: A comprehensive benchmark for evaluating the robustness of llms as mathematical problem solvers. In Lun - Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Ba...
work page 2024
-
[79]
Klusowski, Jianqing Fan, and Mengdi Wang
Zihao Li, Yuan Cao, Cheng Gao, Yihan He, Han Liu, Jason M. Klusowski, Jianqing Fan, and Mengdi Wang. One-layer transformer provably learns one-nearest neighbor in context. 2024 b . URL https://api.semanticscholar.org/CorpusID:272307690
work page 2024
-
[80]
Chain of thought empowers transformers to solve inherently serial problems
Zhiyuan Liu, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inherently serial problems. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=3EWTEy9MTM
work page 2024
-
[81]
The algebraic mind: Integrating connectionism and cognitive science
Gary F Marcus. The algebraic mind: Integrating connectionism and cognitive science. MIT press, 2003
work page 2003
-
[82]
Thomas McCoy, Shunyu Yao, Dan Friedman, Matthew Hardy, and Thomas L
R. Thomas McCoy, Shunyu Yao, Dan Friedman, Matthew Hardy, and Thomas L. Griffiths. Embers of autoregression: Understanding large language models through the problem they are trained to solve, 2023. URL https://arxiv.org/abs/2309.13638
-
[83]
Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi \` e re, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, L \' e onard Hussenot, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro - Ros, Ambrose Slone, Am \' e lie H \' e liou, Andrea Tacchetti, Anna Bulanova, Antonia Paterso...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.08295 2024
-
[84]
The logic theory machine--a complex information processing system
Allen Newell and Herbert Simon. The logic theory machine--a complex information processing system. IRE Transactions on information theory, 2 0 (3): 0 61--79, 1956
work page 1956
-
[85]
Marianna Nezhurina, Lucia Cipolina-Kun, Mehdi Cherti, and Jenia Jitsev. Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models. arXiv preprint arXiv:2406.02061, 2024
-
[86]
OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi:10.48550/ARXIV.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
-
[87]
Learning to reason with large language models
OpenAI. Learning to reason with large language models. https://openai.com/index/learning-to-reason-with-llms/, 2024. Accessed: 2024-09-29
work page 2024
-
[88]
Probabilistic reasoning in intelligent systems: networks of plausible inference
Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Elsevier, 2014
work page 2014
-
[89]
Binghui Peng, Srini Narayanan, and Christos H. Papadimitriou. On limitations of the transformer architecture. CoRR, abs/2402.08164, 2024. doi:10.48550/ARXIV.2402.08164. URL https://doi.org/10.48550/arXiv.2402.08164
-
[90]
Impact of pretraining term frequencies on few-shot reasoning
Yasaman Razeghi, Adam Roberts, Colin Raffel, and Ariel Herbert-Voss. Impact of pretraining term frequencies on few-shot reasoning. arXiv preprint arXiv:2202.08904, 2022
-
[91]
Morgane Rivi \` e re, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L \' e onard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram \' e , Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgi...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.00118 2024
-
[92]
Are emergent abilities of large language models a mirage?, 2023
Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage?, 2023
work page 2023
-
[93]
Chi, Nathanael Sch \" a rli, and Denny Zhou
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Sch \" a rli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 202...
work page 2023
-
[94]
Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap
Saurabh Srivastava, Annarose M B, Anto P V, Shashank Menon, Ajay Sukumar, Adwaith Samod T, Alan Philipose, Stevin Prince, and Sooraj Thomas. Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap, 2024. URL https://arxiv.org/abs/2402.19450
-
[95]
A causal framework to quantify the robustness of mathematical reasoning with language models
Alessandro Stolfo, Zhijing Jin, Kumar Shridhar, Bernhard Schoelkopf, and Mrinmaya Sachan. A causal framework to quantify the robustness of mathematical reasoning with language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers...
work page 2023
-
[96]
Large language models still can't plan (A benchmark for llms on planning and reasoning about change)
Karthik Valmeekam, Alberto Olmo Hernandez, Sarath Sreedharan, and Subbarao Kambhampati. Large language models still can't plan (A benchmark for llms on planning and reasoning about change). CoRR, abs/2206.10498, 2022. doi:10.48550/ARXIV.2206.10498. URL https://doi.org/10.48550/arXiv.2206.10498
-
[97]
Llms still can't plan; can lrms? a preliminary evaluation of openai's o1 on planbench
Karthik Valmeekam, Kaya Stechly, and Subbarao Kambhampati. Llms still can't plan; can lrms? a preliminary evaluation of openai's o1 on planbench. 2024. URL https://api.semanticscholar.org/CorpusID:272770270
work page 2024
-
[98]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans, et al. Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[99]
Gail Weiss, Yoav Goldberg, and Eran Yahav. Thinking like transformers. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event , volume 139 of Proceedings of Machine Learning Research, pp.\ 11080--11090. PMLR , 2021. URL http://proceedings.mlr.press/v139/weiss21a.html
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.