Artificial Intelligence for Mathematical Reasoning: An Integrated Survey of Language Models, Neuro-symbolic Systems, and Verified Discovery

Hasan Mahmud; Md Kamrul Hasan; Mohsinul Kabir; Syed Rifat Raiyan

arxiv: 2606.08728 · v1 · pith:3USIPORNnew · submitted 2026-06-07 · 💻 cs.AI · cs.CL· cs.CV· cs.LG

Artificial Intelligence for Mathematical Reasoning: An Integrated Survey of Language Models, Neuro-symbolic Systems, and Verified Discovery

Syed Rifat Raiyan , Mohsinul Kabir , Hasan Mahmud , Md Kamrul Hasan This is my paper

Pith reviewed 2026-06-27 18:38 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CVcs.LG

keywords mathematical reasoninglanguage modelsneuro-symbolic systemstheorem provingAI benchmarksfailure modesverified discovery

0 comments

The pith

AI mathematical reasoning is organized into four axes: informal solving over text and diagrams, formal proving in assistants, mathematical discovery, and the inference and training techniques that link them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The survey establishes a unified account of how AI systems approach mathematical reasoning by tracing the field's development from early rule-based methods to current large language models, neuro-symbolic provers, and verified discovery pipelines. It divides the landscape into informal reasoning over text and diagrams, formal reasoning inside proof assistants, systems that propose new constructions or attack open problems, and the prompting, tool-use, and reinforcement techniques that connect generation to verification. The account also catalogs benchmarks across arithmetic, competition problems, geometry, and formal proving while examining issues like saturation, contamination, and the gap between pass-at-1 and verifier-assisted results. A sympathetic reader would care because mathematical reasoning remains one of the clearest tests of whether AI systems can move beyond pattern matching to reliable, verifiable inference.

Core claim

This survey organizes the AI mathematical reasoning landscape along four axes: informal reasoning over text and diagrams that includes math word problems, multimodal geometry, and vision-language models; formal reasoning inside proof assistants covering autoformalization, tactic prediction, and proof search; mathematical discovery where systems propose constructions or improve bounds on open problems; and inference and training techniques such as chain-of-thought prompting, tool use, process reward models, and reinforcement learning with verification that increasingly tie generation to checking.

What carries the argument

The four-axis taxonomy that partitions the field into informal reasoning, formal reasoning in proof assistants, mathematical discovery, and inference/training techniques while also cataloging benchmarks and failure modes.

If this is right

Progress can be tracked separately along each of the four axes while noting where techniques from one axis improve another.
Benchmark issues such as contamination and the difference between pass@1 and verifier-assisted scoring affect how results are interpreted across informal and formal settings.
Failure modes like brittleness under perturbation and reward hacking appear in both language-model and neuro-symbolic systems.
Future systems are expected to combine generation with verification in verified-discovery workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The taxonomy could serve as a checklist when designing new hybrid systems that move fluidly between informal exploration and formal checking.
Attention to energy cost and inference efficiency may become a fifth practical axis as model scale grows.
Making formalization tools usable by working mathematicians would test whether the survey's identified directions actually accelerate open-problem work.

Load-bearing premise

That the chosen papers, benchmarks, and listed failure modes give a representative picture of a fast-moving field without major gaps or biases in coverage.

What would settle it

A later review that identifies a large body of work falling outside the four axes or that documents dominant failure modes not listed in the survey would show the organization is incomplete.

Figures

Figures reproduced from arXiv: 2606.08728 by Hasan Mahmud, Md Kamrul Hasan, Mohsinul Kabir, Syed Rifat Raiyan.

**Figure 3.** Figure 3: Search, screening, and taxonomy coding pipeline for the survey corpus. The left side shows the record-selection flow [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: A taxonomy of mathematical reasoning systems. The four main axes (informal text-only, multimodal, formal, discovery) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Chronology of AI systems for mathematical reasoning, organized as paradigm swimlanes. Early symbolic and statistical [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 7.** Figure 7: Equation Tree representing (x + n1) ÷ n2 = n3. Seq2Seq methods: [41] introduced DNS, an RNN-based SEQ2SEQ with a similarity-based retrieval component. The encoder uses GRUs [119], the decoder uses LSTM [120], and a Significant Number Identification (SNI) model identifies relevant numbers in the problem text. It released the landmark MATH23K corpus of 23,161 Chinese MWPs. EQUGENER [121] employs a memory-ne… view at source ↗

**Figure 8.** Figure 8: A worked-example comparison of SEQ2TREE (top) and GRAPH2TREE (bottom) on the same problem; yellowhighlighted tokens are extracted quantities. SEQ2TREE must infer from the LSTM’s hidden state that the two numbers share the unit “marbles”; GRAPH2TREE receives this fact explicitly through a quantity-relation graph encoding “5 same-unit 3”, so the GNN’s hidden state separates which quantities can combine arit… view at source ↗

**Figure 9.** Figure 9: The comprehension–generation–verification triad. Solid [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Evolution of prompting topologies for mathematical reasoning. The upper timeline sketches the shift from direct [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Four multi-agent topologies for mathematical reasoning. (a) Debate: fully connected rounds with a judge [ [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Evolution of RL algorithms for reasoning-model training. The main track (top) shows the progression from critic-based [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: The verified-discovery pipeline. Top row: four abstract stages, neural proposer (cyan), informal reasoning (yellow), formalization (pink), and verification (purple), each with its representative artifacts. Bottom row: a concrete instantiation by the GPT-5.2 Pro + ARISTOTLE team that produced a Lean-checked proof of Erdos problem #729 in early 2026, with one panel per ˝ stage showing the actual artifact it… view at source ↗

**Figure 14.** Figure 14: Benchmark saturation, 2021–2026: best-reported single-model performance on nine canonical mathematical-reasoning [PITH_FULL_IMAGE:figures/full_fig_p034_14.png] view at source ↗

read the original abstract

Mathematical reasoning has long served as a stringent test of machine intelligence; over the past decade, it has moved from a niche problem within NLP to one of the most consequential AI frontiers. This survey provides a unified account of the field's evolution, from early rule-based math word problem (MWP) solvers and template-driven geometry systems, through neural expression generation and LLM prompting, to contemporary reasoning models, multi-agent systems, neuro-symbolic theorem provers, and verified discovery workflows. We organize the landscape along four axes: (i) informal reasoning over text and diagrams, spanning MWP solving, multimodal geometry, and VLMs; (ii) formal reasoning in proof assistants, including autoformalization, tactic prediction, compiler-guided repair, and proof search; (iii) mathematical discovery, where systems propose constructions, improve bounds, or assist attacks on open problems; and (iv) the inference and training-time techniques, including CoT prompting, tool use, process reward models, and RLVR, that increasingly connect generation with verification. We catalog major benchmarks across grade-school arithmetic, competition mathematics, geometry, formal proving, multimodal and multilingual reasoning, and expert evaluation, and we examine benchmark saturation, contamination, reporting mismatches, and the distinction between pass@1, majority voting, and verifier-assisted pass@$k$. We critically assess failure modes: brittleness under perturbation, reward hacking, multimodal grounding failures, fragile formalization, and the energy cost of reasoning-scale inference. Drawing on recent perspectives from working mathematicians, we identify future directions centered on verified-discovery workflows, reasoning efficiency, and infrastructure to make AI-assisted formalization broadly usable. Companion materials: https://github.com/Starscream-11813/awesome-AI4Math.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A useful but standard survey that organizes the AI-for-math field into four axes and flags failure modes without adding new methods or results.

read the letter

This paper is a literature survey that splits AI mathematical reasoning into four axes: informal text/diagram reasoning, formal proof-assistant work, mathematical discovery, and the inference/training tricks that link them. It also lists benchmarks and notes common failure modes like brittleness and reward hacking.

What it does well is give a clean map of how the field moved from early rule-based solvers to current neuro-symbolic and LLM-based systems. The benchmark catalog and the short discussion of pass@k versus verifier-assisted metrics are practical for anyone trying to compare papers. The github link for companion materials is a reasonable addition.

The soft spots are the usual ones for surveys in a fast-moving area. Coverage depends on the authors' selection, and the critical comments on failure modes stay at a high level without new analysis or deeper checks on the cited works. No technical gaps are closed and no fresh data or derivations appear.

This is mainly for newcomers or researchers who want a quick orientation before diving into specific papers. It does not resolve open questions but could help people avoid reinventing the same taxonomies.

I would send it to peer review. The organization is clear enough that referees can check the synthesis and suggest missing references.

Referee Report

1 major / 2 minor

Summary. The manuscript is a literature survey that organizes AI approaches to mathematical reasoning into four axes—informal reasoning over text and diagrams, formal reasoning in proof assistants, mathematical discovery, and inference/training techniques—while cataloging benchmarks across arithmetic, competition math, geometry, and formal proving, and critically examining failure modes such as benchmark contamination, reward hacking, and multimodal grounding failures. It concludes with future directions emphasizing verified-discovery workflows and efficiency improvements, supported by a companion GitHub repository.

Significance. If the synthesis is representative, the four-axis taxonomy and benchmark catalog could provide a useful organizing framework for a fast-moving area, with the explicit discussion of pass@k distinctions and energy costs of inference adding practical value. The companion GitHub repository (awesome-AI4Math) is a concrete strength that supports reproducibility and ongoing updates.

major comments (1)

[Abstract and §1] Abstract and §1 (Introduction): the central claim that the survey constitutes a 'unified account' and 'critically assesses' failure modes rests on the assumption of representative literature selection, yet no explicit inclusion criteria, coverage metrics, or search methodology are stated; this directly affects the weakest assumption identified in the review and requires a dedicated subsection to make the representativeness claim verifiable.

minor comments (2)

The abstract lists 'multimodal and multilingual reasoning' benchmarks but the corresponding catalog section should include a table or explicit enumeration to improve scannability.
Future-directions paragraph would benefit from concrete, falsifiable milestones (e.g., specific open problems or benchmark thresholds) rather than high-level aspirations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and recommendation of minor revision. The single major comment identifies a genuine gap in methodological transparency that we can readily address without altering the core contributions.

read point-by-point responses

Referee: [Abstract and §1] Abstract and §1 (Introduction): the central claim that the survey constitutes a 'unified account' and 'critically assesses' failure modes rests on the assumption of representative literature selection, yet no explicit inclusion criteria, coverage metrics, or search methodology are stated; this directly affects the weakest assumption identified in the review and requires a dedicated subsection to make the representativeness claim verifiable.

Authors: We agree that the absence of an explicit literature-selection methodology weakens the verifiability of the 'unified account' and 'critically assesses' claims. In the revised manuscript we will insert a new subsection (placed immediately after the current §1.1) titled 'Literature Selection and Scope'. This subsection will detail: (i) search strategy (keywords, databases, and venues queried, including arXiv categories cs.AI/cs.LG, top conferences 2018–2024, and citation thresholds), (ii) inclusion/exclusion criteria (peer-reviewed or high-impact preprints focused on AI-driven mathematical reasoning, with explicit cut-offs for coverage of each axis), and (iii) quantitative coverage metrics (number of papers per axis and benchmark category). The GitHub repository will also be updated with the same selection log. These additions directly satisfy the request while preserving the existing narrative. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is a literature survey paper with no derivations, equations, fitted parameters, predictions, or theoretical claims. The central contribution is an organizational taxonomy across four axes plus benchmark cataloging and failure-mode discussion; none of these reduce to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The paper is self-contained as a review and draws on external literature without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper, the work introduces no free parameters, axioms, or invented entities; it aggregates and organizes existing published results.

pith-pipeline@v0.9.1-grok · 5870 in / 920 out tokens · 19864 ms · 2026-06-27T18:38:20.763854+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

266 extracted references · 55 linked inside Pith

[1]

E. A. Feigenbaum, J. Feldman,et al.,Computers and thought. New York: McGraw-Hill, 1963

1963
[2]

Natural language input for a computer problem solving system,

D. G. Bobrow, “Natural language input for a computer problem solving system,” 1964

1964
[3]

Neu- ropsychological performance, iq, personality, and grades in a longi- tudinal grade-school male sample,

J. Peterson, R. Pihl, D. Higgins, J. S ´eguin, and R. Tremblay, “Neu- ropsychological performance, iq, personality, and grades in a longi- tudinal grade-school male sample,”Individual Differences Research, vol. 1, pp. 159–172, 2003

2003
[4]

A broad look at the literature on math word problem-solving interventions for third graders,

S. Kingsdorf and J. Krawec, “A broad look at the literature on math word problem-solving interventions for third graders,”Cogent Education, vol. 3, no. 1, p. 1135770, 2016

2016
[5]

Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning,

P. Lu, R. Gong, S. Jiang, L. Qiu, S. Huang, X. Liang, and S.-C. Zhu, “Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning,”arXiv preprint arXiv:2105.04165, 2021

arXiv 2021
[6]

The Lean 4 theorem prover and program- ming language,

L. de Moura and S. Ullrich, “The Lean 4 theorem prover and program- ming language,”CADE, 2021

2021
[7]

Mathematical discoveries from program search with large language models,

B. Romera-Paredes, M. Barekatain, A. Novikov, M. Balog, M. P. Kumar, E. Dupont, F. J. Ruiz, J. S. Ellenberg, P. Wang, O. Fawzi,et al., “Mathematical discoveries from program search with large language models,”Nature, vol. 625, pp. 468–475, 2024

2024
[8]

AlphaEvolve: A cod- ing agent for scientific and algorithmic discovery,

A. Novikov, N. V ˜u, M. Eisenberger, E. Dupont, P.-S. Huang, A. Z. Wagner, S. Shirobokov, B. Kozlovskii,et al., “AlphaEvolve: A cod- ing agent for scientific and algorithmic discovery,”arXiv preprint arXiv:2506.13131, 2025

Pith/arXiv arXiv 2025
[9]

Autoformalization with large language models,

Y . Wu, A. Q. Jiang, W. Li, M. Rabe, C. Staats, M. Jamnik, and C. Szegedy, “Autoformalization with large language models,”NeurIPS, vol. 35, pp. 32353–32368, 2022

2022
[10]

Draft, sketch, and prove: Guiding formal theorem provers with informal proofs,

A. Q. Jiang, S. Welleck, J. P. Zhou, W. Li, J. Liu, M. Jamnik, T. Lacroix, Y . Wu, and G. Lample, “Draft, sketch, and prove: Guiding formal theorem provers with informal proofs,”ICLR, 2023

2023
[11]

Erd\h{o}s’ minimum overlap problem,

E. P. White, “Erd\h{o}s’ minimum overlap problem,”arXiv preprint arXiv:2201.05704, 2022

arXiv 2022
[12]

Are nlp models really able to solve simple math word problems?,

A. Patel, S. Bhattamishra, and N. Goyal, “Are nlp models really able to solve simple math word problems?,”arXiv preprint arXiv:2103.07191, 2021

Pith/arXiv arXiv 2021
[13]

GSM-Symbolic: Understanding the limitations of mathematical reasoning in large language models,

I. Mirzadeh, K. Alizadeh, H. Shahrokhi, O. Tuzel, S. Bengio, and M. Farajtabar, “GSM-Symbolic: Understanding the limitations of mathematical reasoning in large language models,”arXiv preprint arXiv:2410.05229, 2024

Pith/arXiv arXiv 2024
[14]

LeanDojo: Theorem prov- ing with retrieval-augmented language models,

K. Yang, A. M. Swope, A. Gu, R. Chalamala, P. Song, S. Yu, S. Godil, R. Prenger, and A. Anandkumar, “LeanDojo: Theorem prov- ing with retrieval-augmented language models,” inNeurIPS Datasets and Benchmarks Track, 2023

2023
[15]

A review of methods for automatic understanding of natural language mathematical problems,

A. Mukherjee and U. Garain, “A review of methods for automatic understanding of natural language mathematical problems,”Artificial Intelligence Review, vol. 29, no. 2, pp. 93–122, 2008

2008
[16]

Solving arithmetic mathematical word problems: A review and recent advancements,

S. Mandal and S. K. Naskar, “Solving arithmetic mathematical word problems: A review and recent advancements,” inInformation Tech- nology and Applied Mathematics(P. Chandra, D. Giri, F. Li, S. Kar, and D. K. Jana, eds.), (Singapore), pp. 95–114, Springer Singapore, 2019

2019
[17]

The gap of semantic parsing: A survey on automatic math word problem solvers,

D. Zhang, L. Wang, L. Zhang, B. T. Dai, and H. T. Shen, “The gap of semantic parsing: A survey on automatic math word problem solvers,” 2018

2018
[18]

Why are nlp models fumbling at elementary math? a survey of deep learning based word problem solvers,

S. S. Sundaram, S. Gurajada, M. Fisichella, S. S. Abraham,et al., “Why are nlp models fumbling at elementary math? a survey of deep learning based word problem solvers,”arXiv preprint arXiv:2205.15683, 2022

arXiv 2022
[19]

A survey of deep learning for mathematical reasoning,

P. Lu, L. Qiu, W. Yu, S. Welleck, and K.-W. Chang, “A survey of deep learning for mathematical reasoning,”ACL, 2023

2023
[20]

Large lan- guage models for mathematical reasoning: Progresses and challenges,

J. Ahn, R. Verma, R. Lou, D. Liu, R. Zhang, and W. Yin, “Large lan- guage models for mathematical reasoning: Progresses and challenges,” inEACL Student Research Workshop, 2024

2024
[21]

Learning to reason with LLMs,

OpenAI, “Learning to reason with LLMs,” tech. rep., OpenAI, 2024. https://openai.com/index/learning-to-reason-with-llms/

2024
[22]

Mathematical language models: A survey,

W. Liu, H. Hu, J. Zhou, Y . Ding, J. Li, J. Zeng, M. He, Q. Chen, B. Jiang, A. Zhou, and L. He, “Mathematical language models: A survey,”ACM Computing Surveys, vol. 58, no. 6, 2025

2025
[23]

A survey on large language models for mathematical reasoning,

P.-Y . Wang, T.-S. Liu, C. Wang, Z. Li, Y . Wang, S. Yan, C. Jia, X.-H. Liu, X. Chen, J. Xu, and Y . Yu, “A survey on large language models for mathematical reasoning,”ACM Computing Surveys, vol. 58, no. 8, 2026

2026
[24]

A survey of mathematical reasoning in the era of multimodal large language model: Benchmark, method & challenges,

Y . Yan, J. Su, J. He, F. Fu, X. Zheng, Y . Lyu, K. Wang, S. Wang, Q. Wen, and X. Hu, “A survey of mathematical reasoning in the era of multimodal large language model: Benchmark, method & challenges,” inFindings of the Association for Computational Linguistics: ACL 2025, pp. 11798–11827, 2025

2025
[25]

Large language model based multi-agents: A survey of progress and challenges,

T. Guo, X. Chen, Y . Wang, R. Chang, S. Pei, N. V . Chawla, O. Wiest, and X. Zhang, “Large language model based multi-agents: A survey of progress and challenges,”arXiv preprint arXiv:2402.01680, 2024

Pith/arXiv arXiv 2024
[26]

Formal mathematical reasoning: A new frontier in AI,

K. Yang, G. Poesia, J. He, W. Li, K. Lauter, S. Chaudhuri, and D. Song, “Formal mathematical reasoning: A new frontier in AI,”arXiv preprint arXiv:2412.16075, 2024

arXiv 2024
[27]

A survey on deep learning for theorem proving,

Z. Li, J. Sun, L. Murphy, Q. Su, Z. Li, X. Zhang, K. Yang, and X. Si, “A survey on deep learning for theorem proving,”Conference on Language Modeling (COLM), 2024

2024
[28]

AI will become mathematicians’ co-pilot

T. Tao, “AI will become mathematicians’ co-pilot.” Scientific American Interview, 2024

2024
[29]

AI is ready for primetime in math and theoretical physics

T. Tao, “AI is ready for primetime in math and theoretical physics.” IPAM talk, reported by OpenAI Academy, 2026

2026
[30]

Mathematical methods and human thought in the age of AI,

T. Klowden and T. Tao, “Mathematical methods and human thought in the age of AI,”arXiv preprint, 2026. To appear in Blackwell Companion to the Philosophy of Mathematics

2026
[31]

Guidelines for performing systematic literature reviews in software engineering,

B. Kitchenham and S. Charters, “Guidelines for performing systematic literature reviews in software engineering,” Tech. Rep. EBSE-2007-01, Keele University and Durham University, 2007

2007
[32]

PAL: Program-aided language models,

L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y . Yang, J. Callan, and G. Neubig, “PAL: Program-aided language models,” inICML, 2023

2023
[33]

Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks,

W. Chen, X. Ma, X. Wang, and W. W. Cohen, “Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks,”TMLR, 2023

2023
[34]

ToRA: A tool-integrated reasoning agent for mathematical problem solving,

Z. Gou, Z. Shao, Y . Gong, Y . Shen, Y . Yang, M. Huang, N. Duan, and W. Chen, “ToRA: A tool-integrated reasoning agent for mathematical problem solving,” inICLR, 2024

2024
[35]

GeoQA: A geometric question answering benchmark towards multi- modal numerical reasoning,

J. Chen, J. Tang, J. Qin, X. Liang, L. Liu, E. P. Xing, and L. Lin, “GeoQA: A geometric question answering benchmark towards multi- modal numerical reasoning,”arXiv preprint arXiv:2105.14517, 2021

arXiv 2021
[36]

G-LLaV A: Solving geometric problem with multi-modal large language model,

J. Gao, R. Pi, J. Zhang, J. Ye, W. Zhong, Y . Wang, L. Hong, J. Han, H. Xu, Z. Li, and L. Kong, “G-LLaV A: Solving geometric problem with multi-modal large language model,”arXiv preprint arXiv:2312.11370, 2023

arXiv 2023
[37]

DeepSeek-Prover: Advancing theorem proving in LLMs through large-scale synthetic data,

H. Xin, D. Guo, Z. Shao, Z. Ren, Q. Zhu, B. Liu, C. Ruan, W. Li, and X. Liang, “DeepSeek-Prover: Advancing theorem proving in LLMs through large-scale synthetic data,”arXiv preprint arXiv:2405.14333, 2024

arXiv 2024
[38]

Understanding and solving arithmetic word problems: A computer simulation,

C. R. Fletcher, “Understanding and solving arithmetic word problems: A computer simulation,”Behavior Research Methods, Instruments, & Computers, vol. 17, no. 5, pp. 565–571, 1985

1985
[39]

Learning to solve arithmetic word problems with verb categorization,

M. J. Hosseini, H. Hajishirzi, O. Etzioni, and N. Kushman, “Learning to solve arithmetic word problems with verb categorization,” inEMNLP, 2014

2014
[40]

Learning to automatically solve algebra word problems,

N. Kushman, Y . Artzi, L. Zettlemoyer, and R. Barzilay, “Learning to automatically solve algebra word problems,” inProceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 271–281, 2014

2014
[41]

Deep neural solver for math word problems,

Y . Wang, X. Liu, and S. Shi, “Deep neural solver for math word problems,” inEMNLP, pp. 845–854, 2017

2017
[42]

A goal-driven tree-structured neural model for math word problems,

Z. Xie and S. Sun, “A goal-driven tree-structured neural model for math word problems,” inIJCAI, pp. 5299–5305, 2019

2019
[43]

Graph-to-tree learning for solving math word problems,

J. Zhang, L. Wang, R. K.-W. Lee, Y . Bin, Y . Wang, J. Shao, and E.- P. Lim, “Graph-to-tree learning for solving math word problems,” in ACL, 2020

2020
[44]

Generate & rank: A multi-task framework for math word problems,

J. Shen, Y . Yin, L. Li, L. Shang, X. Jiang, M. Zhang, and Q. Liu, “Generate & rank: A multi-task framework for math word problems,” arXiv preprint arXiv:2109.03034, 2021

arXiv 2021
[45]

MWP-BERT: A strong baseline for math word problems,

Z. Liang, J. Zhang, J. Shao, and X. Zhang, “MWP-BERT: A strong baseline for math word problems,”arXiv preprint arXiv:2107.13435, 2021

arXiv 2021
[46]

Learning to reason deductively: Math word problem solving as complex relation extraction,

Z. Jie, J. Li, and W. Lu, “Learning to reason deductively: Math word problem solving as complex relation extraction,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5944–5955, 2022

2022
[47]

Solving quantitative reasoning problems with language models,

A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo,et al., “Solving quantitative reasoning problems with language models,” NeurIPS, vol. 35, pp. 3843–3857, 2022

2022
[48]

DeepSeekMath: Pushing the limits of mathematical reasoning in open language models,

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu,et al., “DeepSeekMath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024
[49]

Qwen2.5-math technical report: Toward mathematical expert model via self-improvement,

A. Yang, B. Zhang, B. Hui, B. Gao, B. Yu, C. Li, D. Liu, J. Tu, J. Zhou, J. Lin,et al., “Qwen2.5-math technical report: Toward mathematical expert model via self-improvement,”arXiv preprint arXiv:2409.12122, 2024

Pith/arXiv arXiv 2024
[50]

STaR: Bootstrapping reasoning with reasoning,

E. Zelikman, Y . Wu, J. Mu, and N. D. Goodman, “STaR: Bootstrapping reasoning with reasoning,”NeurIPS, vol. 35, pp. 15476–15488, 2022

2022
[51]

rStar-Math: Small LLMs can mas- ter math reasoning with self-evolved deep thinking,

X. Guan, L. Li, Y . Chen, Z. Shao, Z. Ren, B. Liu, D. Dai, J. Shao, J. Song, Y . Wu, and Q. Zhu, “rStar-Math: Small LLMs can mas- ter math reasoning with self-evolved deep thinking,”arXiv preprint arXiv:2404.07846, 2025

arXiv 2025
[52]

LIMO: Less is more for reasoning,

Y . Ye, Z. Huang, Y . Xiao, E. Chern, S. Xia, and P. Liu, “LIMO: Less is more for reasoning,”arXiv preprint arXiv:2502.03387, 2025

Pith/arXiv arXiv 2025
[53]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,”NeurIPS, vol. 35, pp. 24824–24837, 2022

2022
[54]

Self-consistency improves chain of thought rea- soning in language models,

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowd- hery, and D. Zhou, “Self-consistency improves chain of thought rea- soning in language models,” inICLR, 2023

2023
[55]

Achieving>97% on GSM8K: Deeply understanding the problems makes LLMs better solvers for math word problems,

Q. Zhong, K. Wang, Z. Xu, J. Liu, L. Ding, B. Du, and D. Tao, “Achieving>97% on GSM8K: Deeply understanding the problems makes LLMs better solvers for math word problems,”arXiv preprint arXiv:2404.14963, 2024

arXiv 2024
[56]

Tree of thoughts: Deliberate problem solving with large language models,

S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y . Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,”NeurIPS, vol. 36, 2023

2023
[57]

Least-to-most prompting enables complex reasoning in large language models,

D. Zhou, N. Sch ¨arli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schu- urmans, C. Cui, O. Bousquet, Q. Le, and E. Chi, “Least-to-most prompting enables complex reasoning in large language models,” in ICLR, 2023

2023
[58]

Graph of thoughts: Solving elaborate problems with large language models,

M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyk, et al., “Graph of thoughts: Solving elaborate problems with large language models,” inProceedings of the AAAI conference on artificial intelligence, vol. 38, pp. 17682–17690, 2024

2024
[59]

DeepSeek-R1: Incentivizing reason- ing capability in LLMs via reinforcement learning,

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi,et al., “DeepSeek-R1: Incentivizing reason- ing capability in LLMs via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025
[60]

Kimi k1.5: Scaling reinforcement learning with LLMs,

Kimi Team, “Kimi k1.5: Scaling reinforcement learning with LLMs,” arXiv preprint arXiv:2501.12599, 2025

Pith/arXiv arXiv 2025
[61]

s1: Simple test-time scaling,

N. Muennighoff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P. Liang, E. Cand `es, and T. Hashimoto, “s1: Simple test-time scaling,”arXiv preprint arXiv:2501.19393, 2025

Pith/arXiv arXiv 2025
[62]

Improv- ing factuality and reasoning in language models through multiagent debate,

Y . Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch, “Improv- ing factuality and reasoning in language models through multiagent debate,”arXiv preprint arXiv:2305.14325, 2023

Pith/arXiv arXiv 2023
[63]

ReConcile: Round-table conference improves reasoning via consensus among diverse LLMs,

J. C.-Y . Chen, S. Saha, and M. Bansal, “ReConcile: Round-table conference improves reasoning via consensus among diverse LLMs,” inACL, 2024

2024
[64]

MAgICoRe: Multi-agent, iterative, coarse-to-fine refinement for rea- soning,

J. C.-Y . Chen, A. Prasad, S. Saha, E. Stengel-Eskin, and M. Bansal, “MAgICoRe: Multi-agent, iterative, coarse-to-fine refinement for rea- soning,” inEMNLP, 2025

2025
[65]

Graph-of-agents: A graph-based framework for multi-agent LLM collaboration,

S. Yun, J. Peng, P. Li, W. Fan, J. Chen, J. Zou, G. Li, and T. Chen, “Graph-of-agents: A graph-based framework for multi-agent LLM collaboration,” inICLR, 2026

2026
[66]

AutoGPS: Automated geometry problem solving via multimodal formalization and deductive reasoning,

W. Zhaoet al., “AutoGPS: Automated geometry problem solving via multimodal formalization and deductive reasoning,”arXiv preprint arXiv:2505.23381, 2025

arXiv 2025
[67]

FGeo- HyperGNet: Geometric problem solving integrating FormalGeo sym- bolic system and hypergraph neural network,

X. Zhang, Y . Li, N. Zhu, C. Qin, Z. Zeng, and T. Leng, “FGeo- HyperGNet: Geometric problem solving integrating FormalGeo sym- bolic system and hypergraph neural network,” inProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pp. 4733–4741, 2025

2025
[68]

Solving olympiad geometry without human demonstrations,

T. H. Trinh, Y . Wu, Q. V . Le, H. He, and T. Luong, “Solving olympiad geometry without human demonstrations,”Nature, vol. 625, pp. 476– 482, 2024

2024
[69]

Gold-medalist performance in solving olympiad geometry with AlphaGeometry2,

Y . Chervonyi, T. H. Trinh, M. Ol ˇs´ak, X. Yang, H. Nguyen, M. Mene- gali, J. Jung, V . Verma, Q. V . Le, and T. Luong, “Gold-medalist performance in solving olympiad geometry with AlphaGeometry2,” arXiv preprint arXiv:2502.03544, 2025

arXiv 2025
[70]

MathVista: Evaluating mathematical reasoning of foundation models in visual contexts,

P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao, “MathVista: Evaluating mathematical reasoning of foundation models in visual contexts,” inICLR, 2024

2024
[71]

MathVerse: Does your multi-modal LLM truly see the diagrams in visual math problems?,

R. Zhang, D. Jiang, Y . Zhang, H. Lin, Z. Guo, P. Qiu, A. Zhou, P. Lu, K.-W. Chang, Y . Qiao, and H. Li, “MathVerse: Does your multi-modal LLM truly see the diagrams in visual math problems?,” inECCV, 2024

2024
[72]

Measuring multimodal mathematical reasoning with MATH-vision dataset,

K. Wang, J. Pan, W. Shi, Z. Lu, M. Zhan, and H. Li, “Measuring multimodal mathematical reasoning with MATH-vision dataset,” in NeurIPS Datasets and Benchmarks Track, 2024

2024
[73]

MA VIS: Mathematical visual instruction tuning,

R. Zhang, X. Wei, D. Jiang, Y . Zhang, Z. Guo, C. Tong, J. Liu, A. Zhou, B. Wei, S. Zhang,et al., “MA VIS: Mathematical visual instruction tuning,”arXiv preprint arXiv:2407.08739, 2024

arXiv 2024
[74]

MINT-CoT: Enabling interleaved visual tokens in mathematical chain- of-thought reasoning,

X. Chen, R. Zhang, D. Jiang, A. Zhou, S. Yan, W. Lin, and H. Li, “MINT-CoT: Enabling interleaved visual tokens in mathematical chain- of-thought reasoning,”arXiv preprint arXiv:2506.05331, 2025

arXiv 2025
[75]

Generative language modeling for automated theorem proving,

S. Polu and I. Sutskever, “Generative language modeling for automated theorem proving,”arXiv preprint arXiv:2009.03393, 2020

Pith/arXiv arXiv 2009
[76]

Lean Copilot: Large language models as copilots for theorem proving in Lean,

P. Song, K. Yang, and A. Anandkumar, “Lean Copilot: Large language models as copilots for theorem proving in Lean,”arXiv preprint arXiv:2404.12534, 2024

arXiv 2024
[77]

APOLLO: Automated LLM and Lean collaboration for proof repair via compiler feedback,

Z. Sunet al., “APOLLO: Automated LLM and Lean collaboration for proof repair via compiler feedback,”arXiv preprint arXiv:2505.05758, 2025

arXiv 2025
[78]

Baldur: Whole-proof generation and repair with large language models,

E. First, M. N. Rabe, T. Ringer, and Y . Brun, “Baldur: Whole-proof generation and repair with large language models,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, (New York, NY , USA), pp. 1229–1241, Association for Computing Machinery, 2023

2023
[79]

DeepSeek-Prover-V2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition,

Z. Ren, Z. Shao, J. Song, H. Xin, H. Wang, W. Zhao, L. Zhang, Z. Fu, Q. Zhu, D. Yang,et al., “DeepSeek-Prover-V2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition,”arXiv preprint arXiv:2504.21801, 2025

Pith/arXiv arXiv 2025
[80]

Kimina-Prover Preview: Towards large for- mal reasoning models with reinforcement learning,

H. Wanget al., “Kimina-Prover Preview: Towards large for- mal reasoning models with reinforcement learning,”arXiv preprint arXiv:2504.11354, 2025

Pith/arXiv arXiv 2025

Showing first 80 references.

[1] [1]

E. A. Feigenbaum, J. Feldman,et al.,Computers and thought. New York: McGraw-Hill, 1963

1963

[2] [2]

Natural language input for a computer problem solving system,

D. G. Bobrow, “Natural language input for a computer problem solving system,” 1964

1964

[3] [3]

Neu- ropsychological performance, iq, personality, and grades in a longi- tudinal grade-school male sample,

J. Peterson, R. Pihl, D. Higgins, J. S ´eguin, and R. Tremblay, “Neu- ropsychological performance, iq, personality, and grades in a longi- tudinal grade-school male sample,”Individual Differences Research, vol. 1, pp. 159–172, 2003

2003

[4] [4]

A broad look at the literature on math word problem-solving interventions for third graders,

S. Kingsdorf and J. Krawec, “A broad look at the literature on math word problem-solving interventions for third graders,”Cogent Education, vol. 3, no. 1, p. 1135770, 2016

2016

[5] [5]

Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning,

P. Lu, R. Gong, S. Jiang, L. Qiu, S. Huang, X. Liang, and S.-C. Zhu, “Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning,”arXiv preprint arXiv:2105.04165, 2021

arXiv 2021

[6] [6]

The Lean 4 theorem prover and program- ming language,

L. de Moura and S. Ullrich, “The Lean 4 theorem prover and program- ming language,”CADE, 2021

2021

[7] [7]

Mathematical discoveries from program search with large language models,

B. Romera-Paredes, M. Barekatain, A. Novikov, M. Balog, M. P. Kumar, E. Dupont, F. J. Ruiz, J. S. Ellenberg, P. Wang, O. Fawzi,et al., “Mathematical discoveries from program search with large language models,”Nature, vol. 625, pp. 468–475, 2024

2024

[8] [8]

AlphaEvolve: A cod- ing agent for scientific and algorithmic discovery,

A. Novikov, N. V ˜u, M. Eisenberger, E. Dupont, P.-S. Huang, A. Z. Wagner, S. Shirobokov, B. Kozlovskii,et al., “AlphaEvolve: A cod- ing agent for scientific and algorithmic discovery,”arXiv preprint arXiv:2506.13131, 2025

Pith/arXiv arXiv 2025

[9] [9]

Autoformalization with large language models,

Y . Wu, A. Q. Jiang, W. Li, M. Rabe, C. Staats, M. Jamnik, and C. Szegedy, “Autoformalization with large language models,”NeurIPS, vol. 35, pp. 32353–32368, 2022

2022

[10] [10]

Draft, sketch, and prove: Guiding formal theorem provers with informal proofs,

A. Q. Jiang, S. Welleck, J. P. Zhou, W. Li, J. Liu, M. Jamnik, T. Lacroix, Y . Wu, and G. Lample, “Draft, sketch, and prove: Guiding formal theorem provers with informal proofs,”ICLR, 2023

2023

[11] [11]

Erd\h{o}s’ minimum overlap problem,

E. P. White, “Erd\h{o}s’ minimum overlap problem,”arXiv preprint arXiv:2201.05704, 2022

arXiv 2022

[12] [12]

Are nlp models really able to solve simple math word problems?,

A. Patel, S. Bhattamishra, and N. Goyal, “Are nlp models really able to solve simple math word problems?,”arXiv preprint arXiv:2103.07191, 2021

Pith/arXiv arXiv 2021

[13] [13]

GSM-Symbolic: Understanding the limitations of mathematical reasoning in large language models,

I. Mirzadeh, K. Alizadeh, H. Shahrokhi, O. Tuzel, S. Bengio, and M. Farajtabar, “GSM-Symbolic: Understanding the limitations of mathematical reasoning in large language models,”arXiv preprint arXiv:2410.05229, 2024

Pith/arXiv arXiv 2024

[14] [14]

LeanDojo: Theorem prov- ing with retrieval-augmented language models,

K. Yang, A. M. Swope, A. Gu, R. Chalamala, P. Song, S. Yu, S. Godil, R. Prenger, and A. Anandkumar, “LeanDojo: Theorem prov- ing with retrieval-augmented language models,” inNeurIPS Datasets and Benchmarks Track, 2023

2023

[15] [15]

A review of methods for automatic understanding of natural language mathematical problems,

A. Mukherjee and U. Garain, “A review of methods for automatic understanding of natural language mathematical problems,”Artificial Intelligence Review, vol. 29, no. 2, pp. 93–122, 2008

2008

[16] [16]

Solving arithmetic mathematical word problems: A review and recent advancements,

S. Mandal and S. K. Naskar, “Solving arithmetic mathematical word problems: A review and recent advancements,” inInformation Tech- nology and Applied Mathematics(P. Chandra, D. Giri, F. Li, S. Kar, and D. K. Jana, eds.), (Singapore), pp. 95–114, Springer Singapore, 2019

2019

[17] [17]

The gap of semantic parsing: A survey on automatic math word problem solvers,

D. Zhang, L. Wang, L. Zhang, B. T. Dai, and H. T. Shen, “The gap of semantic parsing: A survey on automatic math word problem solvers,” 2018

2018

[18] [18]

Why are nlp models fumbling at elementary math? a survey of deep learning based word problem solvers,

S. S. Sundaram, S. Gurajada, M. Fisichella, S. S. Abraham,et al., “Why are nlp models fumbling at elementary math? a survey of deep learning based word problem solvers,”arXiv preprint arXiv:2205.15683, 2022

arXiv 2022

[19] [19]

A survey of deep learning for mathematical reasoning,

P. Lu, L. Qiu, W. Yu, S. Welleck, and K.-W. Chang, “A survey of deep learning for mathematical reasoning,”ACL, 2023

2023

[20] [20]

Large lan- guage models for mathematical reasoning: Progresses and challenges,

J. Ahn, R. Verma, R. Lou, D. Liu, R. Zhang, and W. Yin, “Large lan- guage models for mathematical reasoning: Progresses and challenges,” inEACL Student Research Workshop, 2024

2024

[21] [21]

Learning to reason with LLMs,

OpenAI, “Learning to reason with LLMs,” tech. rep., OpenAI, 2024. https://openai.com/index/learning-to-reason-with-llms/

2024

[22] [22]

Mathematical language models: A survey,

W. Liu, H. Hu, J. Zhou, Y . Ding, J. Li, J. Zeng, M. He, Q. Chen, B. Jiang, A. Zhou, and L. He, “Mathematical language models: A survey,”ACM Computing Surveys, vol. 58, no. 6, 2025

2025

[23] [23]

A survey on large language models for mathematical reasoning,

P.-Y . Wang, T.-S. Liu, C. Wang, Z. Li, Y . Wang, S. Yan, C. Jia, X.-H. Liu, X. Chen, J. Xu, and Y . Yu, “A survey on large language models for mathematical reasoning,”ACM Computing Surveys, vol. 58, no. 8, 2026

2026

[24] [24]

A survey of mathematical reasoning in the era of multimodal large language model: Benchmark, method & challenges,

Y . Yan, J. Su, J. He, F. Fu, X. Zheng, Y . Lyu, K. Wang, S. Wang, Q. Wen, and X. Hu, “A survey of mathematical reasoning in the era of multimodal large language model: Benchmark, method & challenges,” inFindings of the Association for Computational Linguistics: ACL 2025, pp. 11798–11827, 2025

2025

[25] [25]

Large language model based multi-agents: A survey of progress and challenges,

T. Guo, X. Chen, Y . Wang, R. Chang, S. Pei, N. V . Chawla, O. Wiest, and X. Zhang, “Large language model based multi-agents: A survey of progress and challenges,”arXiv preprint arXiv:2402.01680, 2024

Pith/arXiv arXiv 2024

[26] [26]

Formal mathematical reasoning: A new frontier in AI,

K. Yang, G. Poesia, J. He, W. Li, K. Lauter, S. Chaudhuri, and D. Song, “Formal mathematical reasoning: A new frontier in AI,”arXiv preprint arXiv:2412.16075, 2024

arXiv 2024

[27] [27]

A survey on deep learning for theorem proving,

Z. Li, J. Sun, L. Murphy, Q. Su, Z. Li, X. Zhang, K. Yang, and X. Si, “A survey on deep learning for theorem proving,”Conference on Language Modeling (COLM), 2024

2024

[28] [28]

AI will become mathematicians’ co-pilot

T. Tao, “AI will become mathematicians’ co-pilot.” Scientific American Interview, 2024

2024

[29] [29]

AI is ready for primetime in math and theoretical physics

T. Tao, “AI is ready for primetime in math and theoretical physics.” IPAM talk, reported by OpenAI Academy, 2026

2026

[30] [30]

Mathematical methods and human thought in the age of AI,

T. Klowden and T. Tao, “Mathematical methods and human thought in the age of AI,”arXiv preprint, 2026. To appear in Blackwell Companion to the Philosophy of Mathematics

2026

[31] [31]

Guidelines for performing systematic literature reviews in software engineering,

B. Kitchenham and S. Charters, “Guidelines for performing systematic literature reviews in software engineering,” Tech. Rep. EBSE-2007-01, Keele University and Durham University, 2007

2007

[32] [32]

PAL: Program-aided language models,

L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y . Yang, J. Callan, and G. Neubig, “PAL: Program-aided language models,” inICML, 2023

2023

[33] [33]

Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks,

W. Chen, X. Ma, X. Wang, and W. W. Cohen, “Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks,”TMLR, 2023

2023

[34] [34]

ToRA: A tool-integrated reasoning agent for mathematical problem solving,

Z. Gou, Z. Shao, Y . Gong, Y . Shen, Y . Yang, M. Huang, N. Duan, and W. Chen, “ToRA: A tool-integrated reasoning agent for mathematical problem solving,” inICLR, 2024

2024

[35] [35]

GeoQA: A geometric question answering benchmark towards multi- modal numerical reasoning,

J. Chen, J. Tang, J. Qin, X. Liang, L. Liu, E. P. Xing, and L. Lin, “GeoQA: A geometric question answering benchmark towards multi- modal numerical reasoning,”arXiv preprint arXiv:2105.14517, 2021

arXiv 2021

[36] [36]

G-LLaV A: Solving geometric problem with multi-modal large language model,

J. Gao, R. Pi, J. Zhang, J. Ye, W. Zhong, Y . Wang, L. Hong, J. Han, H. Xu, Z. Li, and L. Kong, “G-LLaV A: Solving geometric problem with multi-modal large language model,”arXiv preprint arXiv:2312.11370, 2023

arXiv 2023

[37] [37]

DeepSeek-Prover: Advancing theorem proving in LLMs through large-scale synthetic data,

H. Xin, D. Guo, Z. Shao, Z. Ren, Q. Zhu, B. Liu, C. Ruan, W. Li, and X. Liang, “DeepSeek-Prover: Advancing theorem proving in LLMs through large-scale synthetic data,”arXiv preprint arXiv:2405.14333, 2024

arXiv 2024

[38] [38]

Understanding and solving arithmetic word problems: A computer simulation,

C. R. Fletcher, “Understanding and solving arithmetic word problems: A computer simulation,”Behavior Research Methods, Instruments, & Computers, vol. 17, no. 5, pp. 565–571, 1985

1985

[39] [39]

Learning to solve arithmetic word problems with verb categorization,

M. J. Hosseini, H. Hajishirzi, O. Etzioni, and N. Kushman, “Learning to solve arithmetic word problems with verb categorization,” inEMNLP, 2014

2014

[40] [40]

Learning to automatically solve algebra word problems,

N. Kushman, Y . Artzi, L. Zettlemoyer, and R. Barzilay, “Learning to automatically solve algebra word problems,” inProceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 271–281, 2014

2014

[41] [41]

Deep neural solver for math word problems,

Y . Wang, X. Liu, and S. Shi, “Deep neural solver for math word problems,” inEMNLP, pp. 845–854, 2017

2017

[42] [42]

A goal-driven tree-structured neural model for math word problems,

Z. Xie and S. Sun, “A goal-driven tree-structured neural model for math word problems,” inIJCAI, pp. 5299–5305, 2019

2019

[43] [43]

Graph-to-tree learning for solving math word problems,

J. Zhang, L. Wang, R. K.-W. Lee, Y . Bin, Y . Wang, J. Shao, and E.- P. Lim, “Graph-to-tree learning for solving math word problems,” in ACL, 2020

2020

[44] [44]

Generate & rank: A multi-task framework for math word problems,

J. Shen, Y . Yin, L. Li, L. Shang, X. Jiang, M. Zhang, and Q. Liu, “Generate & rank: A multi-task framework for math word problems,” arXiv preprint arXiv:2109.03034, 2021

arXiv 2021

[45] [45]

MWP-BERT: A strong baseline for math word problems,

Z. Liang, J. Zhang, J. Shao, and X. Zhang, “MWP-BERT: A strong baseline for math word problems,”arXiv preprint arXiv:2107.13435, 2021

arXiv 2021

[46] [46]

Learning to reason deductively: Math word problem solving as complex relation extraction,

Z. Jie, J. Li, and W. Lu, “Learning to reason deductively: Math word problem solving as complex relation extraction,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5944–5955, 2022

2022

[47] [47]

Solving quantitative reasoning problems with language models,

A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo,et al., “Solving quantitative reasoning problems with language models,” NeurIPS, vol. 35, pp. 3843–3857, 2022

2022

[48] [48]

DeepSeekMath: Pushing the limits of mathematical reasoning in open language models,

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu,et al., “DeepSeekMath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024

[49] [49]

Qwen2.5-math technical report: Toward mathematical expert model via self-improvement,

A. Yang, B. Zhang, B. Hui, B. Gao, B. Yu, C. Li, D. Liu, J. Tu, J. Zhou, J. Lin,et al., “Qwen2.5-math technical report: Toward mathematical expert model via self-improvement,”arXiv preprint arXiv:2409.12122, 2024

Pith/arXiv arXiv 2024

[50] [50]

STaR: Bootstrapping reasoning with reasoning,

E. Zelikman, Y . Wu, J. Mu, and N. D. Goodman, “STaR: Bootstrapping reasoning with reasoning,”NeurIPS, vol. 35, pp. 15476–15488, 2022

2022

[51] [51]

rStar-Math: Small LLMs can mas- ter math reasoning with self-evolved deep thinking,

X. Guan, L. Li, Y . Chen, Z. Shao, Z. Ren, B. Liu, D. Dai, J. Shao, J. Song, Y . Wu, and Q. Zhu, “rStar-Math: Small LLMs can mas- ter math reasoning with self-evolved deep thinking,”arXiv preprint arXiv:2404.07846, 2025

arXiv 2025

[52] [52]

LIMO: Less is more for reasoning,

Y . Ye, Z. Huang, Y . Xiao, E. Chern, S. Xia, and P. Liu, “LIMO: Less is more for reasoning,”arXiv preprint arXiv:2502.03387, 2025

Pith/arXiv arXiv 2025

[53] [53]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,”NeurIPS, vol. 35, pp. 24824–24837, 2022

2022

[54] [54]

Self-consistency improves chain of thought rea- soning in language models,

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowd- hery, and D. Zhou, “Self-consistency improves chain of thought rea- soning in language models,” inICLR, 2023

2023

[55] [55]

Achieving>97% on GSM8K: Deeply understanding the problems makes LLMs better solvers for math word problems,

Q. Zhong, K. Wang, Z. Xu, J. Liu, L. Ding, B. Du, and D. Tao, “Achieving>97% on GSM8K: Deeply understanding the problems makes LLMs better solvers for math word problems,”arXiv preprint arXiv:2404.14963, 2024

arXiv 2024

[56] [56]

Tree of thoughts: Deliberate problem solving with large language models,

S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y . Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,”NeurIPS, vol. 36, 2023

2023

[57] [57]

Least-to-most prompting enables complex reasoning in large language models,

D. Zhou, N. Sch ¨arli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schu- urmans, C. Cui, O. Bousquet, Q. Le, and E. Chi, “Least-to-most prompting enables complex reasoning in large language models,” in ICLR, 2023

2023

[58] [58]

Graph of thoughts: Solving elaborate problems with large language models,

M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyk, et al., “Graph of thoughts: Solving elaborate problems with large language models,” inProceedings of the AAAI conference on artificial intelligence, vol. 38, pp. 17682–17690, 2024

2024

[59] [59]

DeepSeek-R1: Incentivizing reason- ing capability in LLMs via reinforcement learning,

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi,et al., “DeepSeek-R1: Incentivizing reason- ing capability in LLMs via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025

[60] [60]

Kimi k1.5: Scaling reinforcement learning with LLMs,

Kimi Team, “Kimi k1.5: Scaling reinforcement learning with LLMs,” arXiv preprint arXiv:2501.12599, 2025

Pith/arXiv arXiv 2025

[61] [61]

s1: Simple test-time scaling,

N. Muennighoff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P. Liang, E. Cand `es, and T. Hashimoto, “s1: Simple test-time scaling,”arXiv preprint arXiv:2501.19393, 2025

Pith/arXiv arXiv 2025

[62] [62]

Improv- ing factuality and reasoning in language models through multiagent debate,

Y . Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch, “Improv- ing factuality and reasoning in language models through multiagent debate,”arXiv preprint arXiv:2305.14325, 2023

Pith/arXiv arXiv 2023

[63] [63]

ReConcile: Round-table conference improves reasoning via consensus among diverse LLMs,

J. C.-Y . Chen, S. Saha, and M. Bansal, “ReConcile: Round-table conference improves reasoning via consensus among diverse LLMs,” inACL, 2024

2024

[64] [64]

MAgICoRe: Multi-agent, iterative, coarse-to-fine refinement for rea- soning,

J. C.-Y . Chen, A. Prasad, S. Saha, E. Stengel-Eskin, and M. Bansal, “MAgICoRe: Multi-agent, iterative, coarse-to-fine refinement for rea- soning,” inEMNLP, 2025

2025

[65] [65]

Graph-of-agents: A graph-based framework for multi-agent LLM collaboration,

S. Yun, J. Peng, P. Li, W. Fan, J. Chen, J. Zou, G. Li, and T. Chen, “Graph-of-agents: A graph-based framework for multi-agent LLM collaboration,” inICLR, 2026

2026

[66] [66]

AutoGPS: Automated geometry problem solving via multimodal formalization and deductive reasoning,

W. Zhaoet al., “AutoGPS: Automated geometry problem solving via multimodal formalization and deductive reasoning,”arXiv preprint arXiv:2505.23381, 2025

arXiv 2025

[67] [67]

FGeo- HyperGNet: Geometric problem solving integrating FormalGeo sym- bolic system and hypergraph neural network,

X. Zhang, Y . Li, N. Zhu, C. Qin, Z. Zeng, and T. Leng, “FGeo- HyperGNet: Geometric problem solving integrating FormalGeo sym- bolic system and hypergraph neural network,” inProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pp. 4733–4741, 2025

2025

[68] [68]

Solving olympiad geometry without human demonstrations,

T. H. Trinh, Y . Wu, Q. V . Le, H. He, and T. Luong, “Solving olympiad geometry without human demonstrations,”Nature, vol. 625, pp. 476– 482, 2024

2024

[69] [69]

Gold-medalist performance in solving olympiad geometry with AlphaGeometry2,

Y . Chervonyi, T. H. Trinh, M. Ol ˇs´ak, X. Yang, H. Nguyen, M. Mene- gali, J. Jung, V . Verma, Q. V . Le, and T. Luong, “Gold-medalist performance in solving olympiad geometry with AlphaGeometry2,” arXiv preprint arXiv:2502.03544, 2025

arXiv 2025

[70] [70]

MathVista: Evaluating mathematical reasoning of foundation models in visual contexts,

P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao, “MathVista: Evaluating mathematical reasoning of foundation models in visual contexts,” inICLR, 2024

2024

[71] [71]

MathVerse: Does your multi-modal LLM truly see the diagrams in visual math problems?,

R. Zhang, D. Jiang, Y . Zhang, H. Lin, Z. Guo, P. Qiu, A. Zhou, P. Lu, K.-W. Chang, Y . Qiao, and H. Li, “MathVerse: Does your multi-modal LLM truly see the diagrams in visual math problems?,” inECCV, 2024

2024

[72] [72]

Measuring multimodal mathematical reasoning with MATH-vision dataset,

K. Wang, J. Pan, W. Shi, Z. Lu, M. Zhan, and H. Li, “Measuring multimodal mathematical reasoning with MATH-vision dataset,” in NeurIPS Datasets and Benchmarks Track, 2024

2024

[73] [73]

MA VIS: Mathematical visual instruction tuning,

R. Zhang, X. Wei, D. Jiang, Y . Zhang, Z. Guo, C. Tong, J. Liu, A. Zhou, B. Wei, S. Zhang,et al., “MA VIS: Mathematical visual instruction tuning,”arXiv preprint arXiv:2407.08739, 2024

arXiv 2024

[74] [74]

MINT-CoT: Enabling interleaved visual tokens in mathematical chain- of-thought reasoning,

X. Chen, R. Zhang, D. Jiang, A. Zhou, S. Yan, W. Lin, and H. Li, “MINT-CoT: Enabling interleaved visual tokens in mathematical chain- of-thought reasoning,”arXiv preprint arXiv:2506.05331, 2025

arXiv 2025

[75] [75]

Generative language modeling for automated theorem proving,

S. Polu and I. Sutskever, “Generative language modeling for automated theorem proving,”arXiv preprint arXiv:2009.03393, 2020

Pith/arXiv arXiv 2009

[76] [76]

Lean Copilot: Large language models as copilots for theorem proving in Lean,

P. Song, K. Yang, and A. Anandkumar, “Lean Copilot: Large language models as copilots for theorem proving in Lean,”arXiv preprint arXiv:2404.12534, 2024

arXiv 2024

[77] [77]

APOLLO: Automated LLM and Lean collaboration for proof repair via compiler feedback,

Z. Sunet al., “APOLLO: Automated LLM and Lean collaboration for proof repair via compiler feedback,”arXiv preprint arXiv:2505.05758, 2025

arXiv 2025

[78] [78]

Baldur: Whole-proof generation and repair with large language models,

E. First, M. N. Rabe, T. Ringer, and Y . Brun, “Baldur: Whole-proof generation and repair with large language models,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, (New York, NY , USA), pp. 1229–1241, Association for Computing Machinery, 2023

2023

[79] [79]

DeepSeek-Prover-V2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition,

Z. Ren, Z. Shao, J. Song, H. Xin, H. Wang, W. Zhao, L. Zhang, Z. Fu, Q. Zhu, D. Yang,et al., “DeepSeek-Prover-V2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition,”arXiv preprint arXiv:2504.21801, 2025

Pith/arXiv arXiv 2025

[80] [80]

Kimina-Prover Preview: Towards large for- mal reasoning models with reinforcement learning,

H. Wanget al., “Kimina-Prover Preview: Towards large for- mal reasoning models with reinforcement learning,”arXiv preprint arXiv:2504.11354, 2025

Pith/arXiv arXiv 2025