pith. sign in

arxiv: 2606.08728 · v1 · pith:3USIPORNnew · submitted 2026-06-07 · 💻 cs.AI · cs.CL· cs.CV· cs.LG

Artificial Intelligence for Mathematical Reasoning: An Integrated Survey of Language Models, Neuro-symbolic Systems, and Verified Discovery

Pith reviewed 2026-06-27 18:38 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CVcs.LG
keywords mathematical reasoninglanguage modelsneuro-symbolic systemstheorem provingAI benchmarksfailure modesverified discovery
0
0 comments X

The pith

AI mathematical reasoning is organized into four axes: informal solving over text and diagrams, formal proving in assistants, mathematical discovery, and the inference and training techniques that link them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The survey establishes a unified account of how AI systems approach mathematical reasoning by tracing the field's development from early rule-based methods to current large language models, neuro-symbolic provers, and verified discovery pipelines. It divides the landscape into informal reasoning over text and diagrams, formal reasoning inside proof assistants, systems that propose new constructions or attack open problems, and the prompting, tool-use, and reinforcement techniques that connect generation to verification. The account also catalogs benchmarks across arithmetic, competition problems, geometry, and formal proving while examining issues like saturation, contamination, and the gap between pass-at-1 and verifier-assisted results. A sympathetic reader would care because mathematical reasoning remains one of the clearest tests of whether AI systems can move beyond pattern matching to reliable, verifiable inference.

Core claim

This survey organizes the AI mathematical reasoning landscape along four axes: informal reasoning over text and diagrams that includes math word problems, multimodal geometry, and vision-language models; formal reasoning inside proof assistants covering autoformalization, tactic prediction, and proof search; mathematical discovery where systems propose constructions or improve bounds on open problems; and inference and training techniques such as chain-of-thought prompting, tool use, process reward models, and reinforcement learning with verification that increasingly tie generation to checking.

What carries the argument

The four-axis taxonomy that partitions the field into informal reasoning, formal reasoning in proof assistants, mathematical discovery, and inference/training techniques while also cataloging benchmarks and failure modes.

If this is right

  • Progress can be tracked separately along each of the four axes while noting where techniques from one axis improve another.
  • Benchmark issues such as contamination and the difference between pass@1 and verifier-assisted scoring affect how results are interpreted across informal and formal settings.
  • Failure modes like brittleness under perturbation and reward hacking appear in both language-model and neuro-symbolic systems.
  • Future systems are expected to combine generation with verification in verified-discovery workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The taxonomy could serve as a checklist when designing new hybrid systems that move fluidly between informal exploration and formal checking.
  • Attention to energy cost and inference efficiency may become a fifth practical axis as model scale grows.
  • Making formalization tools usable by working mathematicians would test whether the survey's identified directions actually accelerate open-problem work.

Load-bearing premise

That the chosen papers, benchmarks, and listed failure modes give a representative picture of a fast-moving field without major gaps or biases in coverage.

What would settle it

A later review that identifies a large body of work falling outside the four axes or that documents dominant failure modes not listed in the survey would show the organization is incomplete.

Figures

Figures reproduced from arXiv: 2606.08728 by Hasan Mahmud, Md Kamrul Hasan, Mohsinul Kabir, Syed Rifat Raiyan.

Figure 2
Figure 2. Figure 2: Autoformalization and formal theorem proving illus [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Search, screening, and taxonomy coding pipeline for the survey corpus. The left side shows the record-selection flow [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A taxonomy of mathematical reasoning systems. The four main axes (informal text-only, multimodal, formal, discovery) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Chronology of AI systems for mathematical reasoning, organized as paradigm swimlanes. Early symbolic and statistical [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Equation Tree representing (x + n1) ÷ n2 = n3. Seq2Seq methods: [41] introduced DNS, an RNN-based SEQ2SEQ with a similarity-based retrieval component. The encoder uses GRUs [119], the decoder uses LSTM [120], and a Significant Number Identification (SNI) model iden￾tifies relevant numbers in the problem text. It released the landmark MATH23K corpus of 23,161 Chinese MWPs. EQUGENER [121] employs a memory-ne… view at source ↗
Figure 8
Figure 8. Figure 8: A worked-example comparison of SEQ2TREE (top) and GRAPH2TREE (bottom) on the same problem; yellow￾highlighted tokens are extracted quantities. SEQ2TREE must infer from the LSTM’s hidden state that the two numbers share the unit “marbles”; GRAPH2TREE receives this fact explicitly through a quantity-relation graph encoding “5 same-unit 3”, so the GNN’s hidden state separates which quantities can combine arit… view at source ↗
Figure 9
Figure 9. Figure 9: The comprehension–generation–verification triad. Solid [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Evolution of prompting topologies for mathematical reasoning. The upper timeline sketches the shift from direct [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Four multi-agent topologies for mathematical reasoning. (a) Debate: fully connected rounds with a judge [ [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Evolution of RL algorithms for reasoning-model training. The main track (top) shows the progression from critic-based [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The verified-discovery pipeline. Top row: four abstract stages, neural proposer (cyan), informal reasoning (yellow), formalization (pink), and verification (purple), each with its representative artifacts. Bottom row: a concrete instantiation by the GPT-5.2 Pro + ARISTOTLE team that produced a Lean-checked proof of Erdos problem #729 in early 2026, with one panel per ˝ stage showing the actual artifact it… view at source ↗
Figure 14
Figure 14. Figure 14: Benchmark saturation, 2021–2026: best-reported single-model performance on nine canonical mathematical-reasoning [PITH_FULL_IMAGE:figures/full_fig_p034_14.png] view at source ↗
read the original abstract

Mathematical reasoning has long served as a stringent test of machine intelligence; over the past decade, it has moved from a niche problem within NLP to one of the most consequential AI frontiers. This survey provides a unified account of the field's evolution, from early rule-based math word problem (MWP) solvers and template-driven geometry systems, through neural expression generation and LLM prompting, to contemporary reasoning models, multi-agent systems, neuro-symbolic theorem provers, and verified discovery workflows. We organize the landscape along four axes: (i) informal reasoning over text and diagrams, spanning MWP solving, multimodal geometry, and VLMs; (ii) formal reasoning in proof assistants, including autoformalization, tactic prediction, compiler-guided repair, and proof search; (iii) mathematical discovery, where systems propose constructions, improve bounds, or assist attacks on open problems; and (iv) the inference and training-time techniques, including CoT prompting, tool use, process reward models, and RLVR, that increasingly connect generation with verification. We catalog major benchmarks across grade-school arithmetic, competition mathematics, geometry, formal proving, multimodal and multilingual reasoning, and expert evaluation, and we examine benchmark saturation, contamination, reporting mismatches, and the distinction between pass@1, majority voting, and verifier-assisted pass@$k$. We critically assess failure modes: brittleness under perturbation, reward hacking, multimodal grounding failures, fragile formalization, and the energy cost of reasoning-scale inference. Drawing on recent perspectives from working mathematicians, we identify future directions centered on verified-discovery workflows, reasoning efficiency, and infrastructure to make AI-assisted formalization broadly usable. Companion materials: https://github.com/Starscream-11813/awesome-AI4Math.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript is a literature survey that organizes AI approaches to mathematical reasoning into four axes—informal reasoning over text and diagrams, formal reasoning in proof assistants, mathematical discovery, and inference/training techniques—while cataloging benchmarks across arithmetic, competition math, geometry, and formal proving, and critically examining failure modes such as benchmark contamination, reward hacking, and multimodal grounding failures. It concludes with future directions emphasizing verified-discovery workflows and efficiency improvements, supported by a companion GitHub repository.

Significance. If the synthesis is representative, the four-axis taxonomy and benchmark catalog could provide a useful organizing framework for a fast-moving area, with the explicit discussion of pass@k distinctions and energy costs of inference adding practical value. The companion GitHub repository (awesome-AI4Math) is a concrete strength that supports reproducibility and ongoing updates.

major comments (1)
  1. [Abstract and §1] Abstract and §1 (Introduction): the central claim that the survey constitutes a 'unified account' and 'critically assesses' failure modes rests on the assumption of representative literature selection, yet no explicit inclusion criteria, coverage metrics, or search methodology are stated; this directly affects the weakest assumption identified in the review and requires a dedicated subsection to make the representativeness claim verifiable.
minor comments (2)
  1. The abstract lists 'multimodal and multilingual reasoning' benchmarks but the corresponding catalog section should include a table or explicit enumeration to improve scannability.
  2. Future-directions paragraph would benefit from concrete, falsifiable milestones (e.g., specific open problems or benchmark thresholds) rather than high-level aspirations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and recommendation of minor revision. The single major comment identifies a genuine gap in methodological transparency that we can readily address without altering the core contributions.

read point-by-point responses
  1. Referee: [Abstract and §1] Abstract and §1 (Introduction): the central claim that the survey constitutes a 'unified account' and 'critically assesses' failure modes rests on the assumption of representative literature selection, yet no explicit inclusion criteria, coverage metrics, or search methodology are stated; this directly affects the weakest assumption identified in the review and requires a dedicated subsection to make the representativeness claim verifiable.

    Authors: We agree that the absence of an explicit literature-selection methodology weakens the verifiability of the 'unified account' and 'critically assesses' claims. In the revised manuscript we will insert a new subsection (placed immediately after the current §1.1) titled 'Literature Selection and Scope'. This subsection will detail: (i) search strategy (keywords, databases, and venues queried, including arXiv categories cs.AI/cs.LG, top conferences 2018–2024, and citation thresholds), (ii) inclusion/exclusion criteria (peer-reviewed or high-impact preprints focused on AI-driven mathematical reasoning, with explicit cut-offs for coverage of each axis), and (iii) quantitative coverage metrics (number of papers per axis and benchmark category). The GitHub repository will also be updated with the same selection log. These additions directly satisfy the request while preserving the existing narrative. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is a literature survey paper with no derivations, equations, fitted parameters, predictions, or theoretical claims. The central contribution is an organizational taxonomy across four axes plus benchmark cataloging and failure-mode discussion; none of these reduce to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The paper is self-contained as a review and draws on external literature without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper, the work introduces no free parameters, axioms, or invented entities; it aggregates and organizes existing published results.

pith-pipeline@v0.9.1-grok · 5870 in / 920 out tokens · 19864 ms · 2026-06-27T18:38:20.763854+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

266 extracted references · 55 linked inside Pith

  1. [1]

    E. A. Feigenbaum, J. Feldman,et al.,Computers and thought. New York: McGraw-Hill, 1963

  2. [2]

    Natural language input for a computer problem solving system,

    D. G. Bobrow, “Natural language input for a computer problem solving system,” 1964

  3. [3]

    Neu- ropsychological performance, iq, personality, and grades in a longi- tudinal grade-school male sample,

    J. Peterson, R. Pihl, D. Higgins, J. S ´eguin, and R. Tremblay, “Neu- ropsychological performance, iq, personality, and grades in a longi- tudinal grade-school male sample,”Individual Differences Research, vol. 1, pp. 159–172, 2003

  4. [4]

    A broad look at the literature on math word problem-solving interventions for third graders,

    S. Kingsdorf and J. Krawec, “A broad look at the literature on math word problem-solving interventions for third graders,”Cogent Education, vol. 3, no. 1, p. 1135770, 2016

  5. [5]

    Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning,

    P. Lu, R. Gong, S. Jiang, L. Qiu, S. Huang, X. Liang, and S.-C. Zhu, “Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning,”arXiv preprint arXiv:2105.04165, 2021

  6. [6]

    The Lean 4 theorem prover and program- ming language,

    L. de Moura and S. Ullrich, “The Lean 4 theorem prover and program- ming language,”CADE, 2021

  7. [7]

    Mathematical discoveries from program search with large language models,

    B. Romera-Paredes, M. Barekatain, A. Novikov, M. Balog, M. P. Kumar, E. Dupont, F. J. Ruiz, J. S. Ellenberg, P. Wang, O. Fawzi,et al., “Mathematical discoveries from program search with large language models,”Nature, vol. 625, pp. 468–475, 2024

  8. [8]

    AlphaEvolve: A cod- ing agent for scientific and algorithmic discovery,

    A. Novikov, N. V ˜u, M. Eisenberger, E. Dupont, P.-S. Huang, A. Z. Wagner, S. Shirobokov, B. Kozlovskii,et al., “AlphaEvolve: A cod- ing agent for scientific and algorithmic discovery,”arXiv preprint arXiv:2506.13131, 2025

  9. [9]

    Autoformalization with large language models,

    Y . Wu, A. Q. Jiang, W. Li, M. Rabe, C. Staats, M. Jamnik, and C. Szegedy, “Autoformalization with large language models,”NeurIPS, vol. 35, pp. 32353–32368, 2022

  10. [10]

    Draft, sketch, and prove: Guiding formal theorem provers with informal proofs,

    A. Q. Jiang, S. Welleck, J. P. Zhou, W. Li, J. Liu, M. Jamnik, T. Lacroix, Y . Wu, and G. Lample, “Draft, sketch, and prove: Guiding formal theorem provers with informal proofs,”ICLR, 2023

  11. [11]

    Erd\h{o}s’ minimum overlap problem,

    E. P. White, “Erd\h{o}s’ minimum overlap problem,”arXiv preprint arXiv:2201.05704, 2022

  12. [12]

    Are nlp models really able to solve simple math word problems?,

    A. Patel, S. Bhattamishra, and N. Goyal, “Are nlp models really able to solve simple math word problems?,”arXiv preprint arXiv:2103.07191, 2021

  13. [13]

    GSM-Symbolic: Understanding the limitations of mathematical reasoning in large language models,

    I. Mirzadeh, K. Alizadeh, H. Shahrokhi, O. Tuzel, S. Bengio, and M. Farajtabar, “GSM-Symbolic: Understanding the limitations of mathematical reasoning in large language models,”arXiv preprint arXiv:2410.05229, 2024

  14. [14]

    LeanDojo: Theorem prov- ing with retrieval-augmented language models,

    K. Yang, A. M. Swope, A. Gu, R. Chalamala, P. Song, S. Yu, S. Godil, R. Prenger, and A. Anandkumar, “LeanDojo: Theorem prov- ing with retrieval-augmented language models,” inNeurIPS Datasets and Benchmarks Track, 2023

  15. [15]

    A review of methods for automatic understanding of natural language mathematical problems,

    A. Mukherjee and U. Garain, “A review of methods for automatic understanding of natural language mathematical problems,”Artificial Intelligence Review, vol. 29, no. 2, pp. 93–122, 2008

  16. [16]

    Solving arithmetic mathematical word problems: A review and recent advancements,

    S. Mandal and S. K. Naskar, “Solving arithmetic mathematical word problems: A review and recent advancements,” inInformation Tech- nology and Applied Mathematics(P. Chandra, D. Giri, F. Li, S. Kar, and D. K. Jana, eds.), (Singapore), pp. 95–114, Springer Singapore, 2019

  17. [17]

    The gap of semantic parsing: A survey on automatic math word problem solvers,

    D. Zhang, L. Wang, L. Zhang, B. T. Dai, and H. T. Shen, “The gap of semantic parsing: A survey on automatic math word problem solvers,” 2018

  18. [18]

    Why are nlp models fumbling at elementary math? a survey of deep learning based word problem solvers,

    S. S. Sundaram, S. Gurajada, M. Fisichella, S. S. Abraham,et al., “Why are nlp models fumbling at elementary math? a survey of deep learning based word problem solvers,”arXiv preprint arXiv:2205.15683, 2022

  19. [19]

    A survey of deep learning for mathematical reasoning,

    P. Lu, L. Qiu, W. Yu, S. Welleck, and K.-W. Chang, “A survey of deep learning for mathematical reasoning,”ACL, 2023

  20. [20]

    Large lan- guage models for mathematical reasoning: Progresses and challenges,

    J. Ahn, R. Verma, R. Lou, D. Liu, R. Zhang, and W. Yin, “Large lan- guage models for mathematical reasoning: Progresses and challenges,” inEACL Student Research Workshop, 2024

  21. [21]

    Learning to reason with LLMs,

    OpenAI, “Learning to reason with LLMs,” tech. rep., OpenAI, 2024. https://openai.com/index/learning-to-reason-with-llms/

  22. [22]

    Mathematical language models: A survey,

    W. Liu, H. Hu, J. Zhou, Y . Ding, J. Li, J. Zeng, M. He, Q. Chen, B. Jiang, A. Zhou, and L. He, “Mathematical language models: A survey,”ACM Computing Surveys, vol. 58, no. 6, 2025

  23. [23]

    A survey on large language models for mathematical reasoning,

    P.-Y . Wang, T.-S. Liu, C. Wang, Z. Li, Y . Wang, S. Yan, C. Jia, X.-H. Liu, X. Chen, J. Xu, and Y . Yu, “A survey on large language models for mathematical reasoning,”ACM Computing Surveys, vol. 58, no. 8, 2026

  24. [24]

    A survey of mathematical reasoning in the era of multimodal large language model: Benchmark, method & challenges,

    Y . Yan, J. Su, J. He, F. Fu, X. Zheng, Y . Lyu, K. Wang, S. Wang, Q. Wen, and X. Hu, “A survey of mathematical reasoning in the era of multimodal large language model: Benchmark, method & challenges,” inFindings of the Association for Computational Linguistics: ACL 2025, pp. 11798–11827, 2025

  25. [25]

    Large language model based multi-agents: A survey of progress and challenges,

    T. Guo, X. Chen, Y . Wang, R. Chang, S. Pei, N. V . Chawla, O. Wiest, and X. Zhang, “Large language model based multi-agents: A survey of progress and challenges,”arXiv preprint arXiv:2402.01680, 2024

  26. [26]

    Formal mathematical reasoning: A new frontier in AI,

    K. Yang, G. Poesia, J. He, W. Li, K. Lauter, S. Chaudhuri, and D. Song, “Formal mathematical reasoning: A new frontier in AI,”arXiv preprint arXiv:2412.16075, 2024

  27. [27]

    A survey on deep learning for theorem proving,

    Z. Li, J. Sun, L. Murphy, Q. Su, Z. Li, X. Zhang, K. Yang, and X. Si, “A survey on deep learning for theorem proving,”Conference on Language Modeling (COLM), 2024

  28. [28]

    AI will become mathematicians’ co-pilot

    T. Tao, “AI will become mathematicians’ co-pilot.” Scientific American Interview, 2024

  29. [29]

    AI is ready for primetime in math and theoretical physics

    T. Tao, “AI is ready for primetime in math and theoretical physics.” IPAM talk, reported by OpenAI Academy, 2026

  30. [30]

    Mathematical methods and human thought in the age of AI,

    T. Klowden and T. Tao, “Mathematical methods and human thought in the age of AI,”arXiv preprint, 2026. To appear in Blackwell Companion to the Philosophy of Mathematics

  31. [31]

    Guidelines for performing systematic literature reviews in software engineering,

    B. Kitchenham and S. Charters, “Guidelines for performing systematic literature reviews in software engineering,” Tech. Rep. EBSE-2007-01, Keele University and Durham University, 2007

  32. [32]

    PAL: Program-aided language models,

    L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y . Yang, J. Callan, and G. Neubig, “PAL: Program-aided language models,” inICML, 2023

  33. [33]

    Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks,

    W. Chen, X. Ma, X. Wang, and W. W. Cohen, “Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks,”TMLR, 2023

  34. [34]

    ToRA: A tool-integrated reasoning agent for mathematical problem solving,

    Z. Gou, Z. Shao, Y . Gong, Y . Shen, Y . Yang, M. Huang, N. Duan, and W. Chen, “ToRA: A tool-integrated reasoning agent for mathematical problem solving,” inICLR, 2024

  35. [35]

    GeoQA: A geometric question answering benchmark towards multi- modal numerical reasoning,

    J. Chen, J. Tang, J. Qin, X. Liang, L. Liu, E. P. Xing, and L. Lin, “GeoQA: A geometric question answering benchmark towards multi- modal numerical reasoning,”arXiv preprint arXiv:2105.14517, 2021

  36. [36]

    G-LLaV A: Solving geometric problem with multi-modal large language model,

    J. Gao, R. Pi, J. Zhang, J. Ye, W. Zhong, Y . Wang, L. Hong, J. Han, H. Xu, Z. Li, and L. Kong, “G-LLaV A: Solving geometric problem with multi-modal large language model,”arXiv preprint arXiv:2312.11370, 2023

  37. [37]

    DeepSeek-Prover: Advancing theorem proving in LLMs through large-scale synthetic data,

    H. Xin, D. Guo, Z. Shao, Z. Ren, Q. Zhu, B. Liu, C. Ruan, W. Li, and X. Liang, “DeepSeek-Prover: Advancing theorem proving in LLMs through large-scale synthetic data,”arXiv preprint arXiv:2405.14333, 2024

  38. [38]

    Understanding and solving arithmetic word problems: A computer simulation,

    C. R. Fletcher, “Understanding and solving arithmetic word problems: A computer simulation,”Behavior Research Methods, Instruments, & Computers, vol. 17, no. 5, pp. 565–571, 1985

  39. [39]

    Learning to solve arithmetic word problems with verb categorization,

    M. J. Hosseini, H. Hajishirzi, O. Etzioni, and N. Kushman, “Learning to solve arithmetic word problems with verb categorization,” inEMNLP, 2014

  40. [40]

    Learning to automatically solve algebra word problems,

    N. Kushman, Y . Artzi, L. Zettlemoyer, and R. Barzilay, “Learning to automatically solve algebra word problems,” inProceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 271–281, 2014

  41. [41]

    Deep neural solver for math word problems,

    Y . Wang, X. Liu, and S. Shi, “Deep neural solver for math word problems,” inEMNLP, pp. 845–854, 2017

  42. [42]

    A goal-driven tree-structured neural model for math word problems,

    Z. Xie and S. Sun, “A goal-driven tree-structured neural model for math word problems,” inIJCAI, pp. 5299–5305, 2019

  43. [43]

    Graph-to-tree learning for solving math word problems,

    J. Zhang, L. Wang, R. K.-W. Lee, Y . Bin, Y . Wang, J. Shao, and E.- P. Lim, “Graph-to-tree learning for solving math word problems,” in ACL, 2020

  44. [44]

    Generate & rank: A multi-task framework for math word problems,

    J. Shen, Y . Yin, L. Li, L. Shang, X. Jiang, M. Zhang, and Q. Liu, “Generate & rank: A multi-task framework for math word problems,” arXiv preprint arXiv:2109.03034, 2021

  45. [45]

    MWP-BERT: A strong baseline for math word problems,

    Z. Liang, J. Zhang, J. Shao, and X. Zhang, “MWP-BERT: A strong baseline for math word problems,”arXiv preprint arXiv:2107.13435, 2021

  46. [46]

    Learning to reason deductively: Math word problem solving as complex relation extraction,

    Z. Jie, J. Li, and W. Lu, “Learning to reason deductively: Math word problem solving as complex relation extraction,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5944–5955, 2022

  47. [47]

    Solving quantitative reasoning problems with language models,

    A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo,et al., “Solving quantitative reasoning problems with language models,” NeurIPS, vol. 35, pp. 3843–3857, 2022

  48. [48]

    DeepSeekMath: Pushing the limits of mathematical reasoning in open language models,

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu,et al., “DeepSeekMath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024

  49. [49]

    Qwen2.5-math technical report: Toward mathematical expert model via self-improvement,

    A. Yang, B. Zhang, B. Hui, B. Gao, B. Yu, C. Li, D. Liu, J. Tu, J. Zhou, J. Lin,et al., “Qwen2.5-math technical report: Toward mathematical expert model via self-improvement,”arXiv preprint arXiv:2409.12122, 2024

  50. [50]

    STaR: Bootstrapping reasoning with reasoning,

    E. Zelikman, Y . Wu, J. Mu, and N. D. Goodman, “STaR: Bootstrapping reasoning with reasoning,”NeurIPS, vol. 35, pp. 15476–15488, 2022

  51. [51]

    rStar-Math: Small LLMs can mas- ter math reasoning with self-evolved deep thinking,

    X. Guan, L. Li, Y . Chen, Z. Shao, Z. Ren, B. Liu, D. Dai, J. Shao, J. Song, Y . Wu, and Q. Zhu, “rStar-Math: Small LLMs can mas- ter math reasoning with self-evolved deep thinking,”arXiv preprint arXiv:2404.07846, 2025

  52. [52]

    LIMO: Less is more for reasoning,

    Y . Ye, Z. Huang, Y . Xiao, E. Chern, S. Xia, and P. Liu, “LIMO: Less is more for reasoning,”arXiv preprint arXiv:2502.03387, 2025

  53. [53]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,”NeurIPS, vol. 35, pp. 24824–24837, 2022

  54. [54]

    Self-consistency improves chain of thought rea- soning in language models,

    X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowd- hery, and D. Zhou, “Self-consistency improves chain of thought rea- soning in language models,” inICLR, 2023

  55. [55]

    Achieving>97% on GSM8K: Deeply understanding the problems makes LLMs better solvers for math word problems,

    Q. Zhong, K. Wang, Z. Xu, J. Liu, L. Ding, B. Du, and D. Tao, “Achieving>97% on GSM8K: Deeply understanding the problems makes LLMs better solvers for math word problems,”arXiv preprint arXiv:2404.14963, 2024

  56. [56]

    Tree of thoughts: Deliberate problem solving with large language models,

    S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y . Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,”NeurIPS, vol. 36, 2023

  57. [57]

    Least-to-most prompting enables complex reasoning in large language models,

    D. Zhou, N. Sch ¨arli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schu- urmans, C. Cui, O. Bousquet, Q. Le, and E. Chi, “Least-to-most prompting enables complex reasoning in large language models,” in ICLR, 2023

  58. [58]

    Graph of thoughts: Solving elaborate problems with large language models,

    M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyk, et al., “Graph of thoughts: Solving elaborate problems with large language models,” inProceedings of the AAAI conference on artificial intelligence, vol. 38, pp. 17682–17690, 2024

  59. [59]

    DeepSeek-R1: Incentivizing reason- ing capability in LLMs via reinforcement learning,

    D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi,et al., “DeepSeek-R1: Incentivizing reason- ing capability in LLMs via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

  60. [60]

    Kimi k1.5: Scaling reinforcement learning with LLMs,

    Kimi Team, “Kimi k1.5: Scaling reinforcement learning with LLMs,” arXiv preprint arXiv:2501.12599, 2025

  61. [61]

    s1: Simple test-time scaling,

    N. Muennighoff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P. Liang, E. Cand `es, and T. Hashimoto, “s1: Simple test-time scaling,”arXiv preprint arXiv:2501.19393, 2025

  62. [62]

    Improv- ing factuality and reasoning in language models through multiagent debate,

    Y . Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch, “Improv- ing factuality and reasoning in language models through multiagent debate,”arXiv preprint arXiv:2305.14325, 2023

  63. [63]

    ReConcile: Round-table conference improves reasoning via consensus among diverse LLMs,

    J. C.-Y . Chen, S. Saha, and M. Bansal, “ReConcile: Round-table conference improves reasoning via consensus among diverse LLMs,” inACL, 2024

  64. [64]

    MAgICoRe: Multi-agent, iterative, coarse-to-fine refinement for rea- soning,

    J. C.-Y . Chen, A. Prasad, S. Saha, E. Stengel-Eskin, and M. Bansal, “MAgICoRe: Multi-agent, iterative, coarse-to-fine refinement for rea- soning,” inEMNLP, 2025

  65. [65]

    Graph-of-agents: A graph-based framework for multi-agent LLM collaboration,

    S. Yun, J. Peng, P. Li, W. Fan, J. Chen, J. Zou, G. Li, and T. Chen, “Graph-of-agents: A graph-based framework for multi-agent LLM collaboration,” inICLR, 2026

  66. [66]

    AutoGPS: Automated geometry problem solving via multimodal formalization and deductive reasoning,

    W. Zhaoet al., “AutoGPS: Automated geometry problem solving via multimodal formalization and deductive reasoning,”arXiv preprint arXiv:2505.23381, 2025

  67. [67]

    FGeo- HyperGNet: Geometric problem solving integrating FormalGeo sym- bolic system and hypergraph neural network,

    X. Zhang, Y . Li, N. Zhu, C. Qin, Z. Zeng, and T. Leng, “FGeo- HyperGNet: Geometric problem solving integrating FormalGeo sym- bolic system and hypergraph neural network,” inProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pp. 4733–4741, 2025

  68. [68]

    Solving olympiad geometry without human demonstrations,

    T. H. Trinh, Y . Wu, Q. V . Le, H. He, and T. Luong, “Solving olympiad geometry without human demonstrations,”Nature, vol. 625, pp. 476– 482, 2024

  69. [69]

    Gold-medalist performance in solving olympiad geometry with AlphaGeometry2,

    Y . Chervonyi, T. H. Trinh, M. Ol ˇs´ak, X. Yang, H. Nguyen, M. Mene- gali, J. Jung, V . Verma, Q. V . Le, and T. Luong, “Gold-medalist performance in solving olympiad geometry with AlphaGeometry2,” arXiv preprint arXiv:2502.03544, 2025

  70. [70]

    MathVista: Evaluating mathematical reasoning of foundation models in visual contexts,

    P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao, “MathVista: Evaluating mathematical reasoning of foundation models in visual contexts,” inICLR, 2024

  71. [71]

    MathVerse: Does your multi-modal LLM truly see the diagrams in visual math problems?,

    R. Zhang, D. Jiang, Y . Zhang, H. Lin, Z. Guo, P. Qiu, A. Zhou, P. Lu, K.-W. Chang, Y . Qiao, and H. Li, “MathVerse: Does your multi-modal LLM truly see the diagrams in visual math problems?,” inECCV, 2024

  72. [72]

    Measuring multimodal mathematical reasoning with MATH-vision dataset,

    K. Wang, J. Pan, W. Shi, Z. Lu, M. Zhan, and H. Li, “Measuring multimodal mathematical reasoning with MATH-vision dataset,” in NeurIPS Datasets and Benchmarks Track, 2024

  73. [73]

    MA VIS: Mathematical visual instruction tuning,

    R. Zhang, X. Wei, D. Jiang, Y . Zhang, Z. Guo, C. Tong, J. Liu, A. Zhou, B. Wei, S. Zhang,et al., “MA VIS: Mathematical visual instruction tuning,”arXiv preprint arXiv:2407.08739, 2024

  74. [74]

    MINT-CoT: Enabling interleaved visual tokens in mathematical chain- of-thought reasoning,

    X. Chen, R. Zhang, D. Jiang, A. Zhou, S. Yan, W. Lin, and H. Li, “MINT-CoT: Enabling interleaved visual tokens in mathematical chain- of-thought reasoning,”arXiv preprint arXiv:2506.05331, 2025

  75. [75]

    Generative language modeling for automated theorem proving,

    S. Polu and I. Sutskever, “Generative language modeling for automated theorem proving,”arXiv preprint arXiv:2009.03393, 2020

  76. [76]

    Lean Copilot: Large language models as copilots for theorem proving in Lean,

    P. Song, K. Yang, and A. Anandkumar, “Lean Copilot: Large language models as copilots for theorem proving in Lean,”arXiv preprint arXiv:2404.12534, 2024

  77. [77]

    APOLLO: Automated LLM and Lean collaboration for proof repair via compiler feedback,

    Z. Sunet al., “APOLLO: Automated LLM and Lean collaboration for proof repair via compiler feedback,”arXiv preprint arXiv:2505.05758, 2025

  78. [78]

    Baldur: Whole-proof generation and repair with large language models,

    E. First, M. N. Rabe, T. Ringer, and Y . Brun, “Baldur: Whole-proof generation and repair with large language models,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, (New York, NY , USA), pp. 1229–1241, Association for Computing Machinery, 2023

  79. [79]

    DeepSeek-Prover-V2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition,

    Z. Ren, Z. Shao, J. Song, H. Xin, H. Wang, W. Zhao, L. Zhang, Z. Fu, Q. Zhu, D. Yang,et al., “DeepSeek-Prover-V2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition,”arXiv preprint arXiv:2504.21801, 2025

  80. [80]

    Kimina-Prover Preview: Towards large for- mal reasoning models with reinforcement learning,

    H. Wanget al., “Kimina-Prover Preview: Towards large for- mal reasoning models with reinforcement learning,”arXiv preprint arXiv:2504.11354, 2025

Showing first 80 references.