Advancing Mathematics Research with AI-Driven Formal Proof Search
Pith reviewed 2026-05-22 05:05 UTC · model grok-4.3
The pith
An AI agent using large language models generates and verifies formal proofs in Lean to solve open Erdős problems and OEIS conjectures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an AI agent can autonomously resolve open problems by producing Lean-verifiable formal proofs, as shown by the resolution of 9 Erdős problems and 44 OEIS conjectures, with the technique now applied across multiple branches of mathematics.
What carries the argument
An agent that alternates large-language-model generation of candidate Lean proofs with automated verification inside the Lean kernel.
If this is right
- Formal verification can be inserted into the workflow of active mathematical research at modest per-problem cost.
- The same agent design works across combinatorics, optimization, graph theory, algebraic geometry, and quantum optics.
- A basic alternating generation-and-verification loop already reproduces the Erdős successes, though harder problems raise its cost.
- Large-scale testing on hundreds of open statements is now feasible with current language-model capabilities.
Where Pith is reading between the lines
- If the cost stays low, smaller research groups could run their own formal-proof searches on local conjectures without large budgets.
- The approach could be tried in other formal systems besides Lean to broaden the set of checkable problems.
- Hybrid loops that let human mathematicians edit the agent's intermediate steps might solve still harder open questions.
- Success on Erdős and OEIS lists suggests the method could be aimed next at problems that have resisted both human and computer attack for decades.
Load-bearing premise
The targeted Erdős problems and OEIS conjectures were still open before this work, and the Lean outputs are complete formal proofs produced without essential human intervention in the derivation steps.
What would settle it
An independent Lean expert confirming that one of the reported proofs is incomplete or that any of the nine Erdős problems had already been solved before the agent ran would disprove the claim of autonomous resolution.
Figures
read the original abstract
Large language models (LLMs) increasingly excel at mathematical reasoning, but their unreliability limits their utility in mathematics research. A mitigation is using LLMs to generate formal proofs in languages like Lean. We perform the first large-scale evaluation of this method's ability to solve open problems. Our most capable agent autonomously resolved 9 of 353 open Erd\H{o}s problems at the per-problem cost of a few hundred dollars, proved 44/492 OEIS conjectures, and is being deployed in combinatorics, optimization, graph theory, algebraic geometry, and quantum optics research. A basic agent alternating LLM-based generation with Lean-based verification replicated the Erd\H{o}s successes but proved costlier on the hardest problems. These findings demonstrate the power of AI-aided formal proof search and shed light on the agent designs that enable it.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports the first large-scale evaluation of LLM-based agents for generating formal proofs in Lean to solve open mathematical problems. The central claims are that the most capable agent autonomously resolved 9 of 353 open Erdős problems at a per-problem cost of a few hundred dollars, proved 44 of 492 OEIS conjectures, and is now deployed across combinatorics, optimization, graph theory, algebraic geometry, and quantum optics; a simpler alternating generation-verification agent is shown to replicate some successes but at higher cost on hard instances.
Significance. If the reported successes are independently confirmed, the work would demonstrate that current LLM-driven formal proof search can address genuinely open problems at modest cost, providing a concrete benchmark for AI-assisted mathematics. The use of Lean verification supplies a high standard of correctness that distinguishes this from informal LLM reasoning claims, and the multi-domain deployment suggests immediate research utility beyond the Erdős and OEIS benchmarks.
major comments (2)
- [Abstract and results section] Abstract and results section: the headline claim that the agent 'autonomously resolved 9 of 353 open Erdős problems' is load-bearing for the autonomy and cost assertions, yet the manuscript supplies neither per-problem literature-search logs confirming the problems were previously unsolved nor proof-size statistics or artifact links demonstrating that the Lean outputs are complete, hole-free formal proofs rather than partial derivations requiring human post-editing. Without these, the success counts cannot be fully assessed.
- [Methods and evaluation sections] Methods and evaluation sections: the reported success counts (9/353 Erdős, 44/492 OEIS) are presented without error bars, failure-mode breakdowns, or independent audit of the Lean artifacts. This omission directly affects the reliability of the comparative claim that the basic agent 'replicated the Erdős successes but proved costlier on the hardest problems.'
minor comments (2)
- [§2] The description of agent architectures would benefit from a clearer tabular comparison of the 'most capable' versus 'basic' designs, including exact prompting strategies and Lean interaction loops.
- [Figures and tables] Figure captions and table legends should explicitly state whether the reported costs include only inference or also human oversight time.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which help clarify the presentation of our results on LLM-based formal proof search. We address each major comment below and describe the revisions we will incorporate to improve transparency and verifiability.
read point-by-point responses
-
Referee: [Abstract and results section] Abstract and results section: the headline claim that the agent 'autonomously resolved 9 of 353 open Erdős problems' is load-bearing for the autonomy and cost assertions, yet the manuscript supplies neither per-problem literature-search logs confirming the problems were previously unsolved nor proof-size statistics or artifact links demonstrating that the Lean outputs are complete, hole-free formal proofs rather than partial derivations requiring human post-editing. Without these, the success counts cannot be fully assessed.
Authors: We agree that explicit documentation strengthens the autonomy and correctness claims. In the revised manuscript we will add a dedicated subsection (and supplementary table) that lists, for each of the nine solved Erdős problems, the pre-experiment literature references establishing that the problem remained open, together with links to the corresponding Lean proof files in a public repository. We will also report proof-size statistics (number of tactics, lines of code, and kernel-verification time) for every success. All reported proofs were produced and checked end-to-end by the agent with no human post-editing; the added artifacts will make this verifiable. revision: yes
-
Referee: [Methods and evaluation sections] Methods and evaluation sections: the reported success counts (9/353 Erdős, 44/492 OEIS) are presented without error bars, failure-mode breakdowns, or independent audit of the Lean artifacts. This omission directly affects the reliability of the comparative claim that the basic agent 'replicated the Erdős successes but proved costlier on the hardest problems.'
Authors: We concur that quantitative uncertainty measures and failure analysis improve interpretability. The revision will include binomial 95 % confidence intervals for both success rates and a breakdown of the dominant failure modes (timeout, tactic failure, search exhaustion) across the full set of attempts. We will also expand the agent-comparison section with per-problem cost figures for the hardest instances to support the cost-effectiveness claim. All Lean artifacts and run logs will be released publicly to enable independent audit; performing a third-party audit ourselves lies outside the scope of the present study. revision: partial
Circularity Check
No significant circularity; results are empirical evaluations on external benchmarks
full rationale
The paper reports counts of resolved open Erdős problems and OEIS conjectures obtained by running LLM-based agents and verifying outputs in Lean. These success metrics are generated by direct application to independently defined external problems rather than by fitting parameters to subsets of the target data or by self-referential definitions. No equations, ansatzes, or uniqueness theorems are invoked that reduce the headline claims to the inputs by construction. The methodology relies on external verification and open problem lists, making the derivation chain self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The Lean theorem prover provides sound verification of formal proofs
Forward citations
Cited by 3 Pith papers
-
Code evolution for link prediction in complex networks
Code evolution produces link prediction algorithms with average AUC of 0.915 versus 0.783 for human-designed methods across 580 networks, with better scalability.
-
Recursions for Mock Theta Functions
Derives weighted recursions for coefficients of mock theta functions f and ω via holomorphic projection on vector-valued Rankin-Cohen brackets, exploiting vanishing cusp form spaces.
-
Skills for the future software profession: beyond agentic AI!
Round-table discussions with researchers and practitioners indicate verification and validation skills will become central for software engineers in the agentic AI era.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.