pith. sign in

arxiv: 2510.12787 · v4 · pith:D2MS2PUPnew · submitted 2025-10-14 · 💻 cs.AI · cs.MA

Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics

Pith reviewed 2026-05-25 07:44 UTC · model grok-4.3

classification 💻 cs.AI cs.MA
keywords theorem provingLeanmulti-agent systemsformal verificationabstract algebraquantum theorylarge language modelsautomated reasoning
0
0 comments X

The pith

Ax-Prover equips LLMs with Lean tools in a multi-agent setup to prove theorems across math and quantum physics where specialized systems fall short.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Ax-Prover is a multi-agent framework that pairs large language models with Lean theorem-proving tools through the Model Context Protocol. The system generates formal proofs either autonomously or in collaboration with human experts. It was tested on two public math benchmarks plus two new Lean benchmarks in abstract algebra and quantum theory. Performance matches frontier models on established datasets but exceeds them on the new domain benchmarks. The authors present this as evidence that a tool-based agentic method supplies a generalizable route to formal verification across scientific fields, illustrated by its use in formalizing a complex cryptography theorem.

Core claim

The paper claims that a multi-agent system combining LLMs for reasoning with Lean tools via MCP delivers a generalizable methodology for automated theorem proving, remaining competitive with state-of-the-art provers on public math benchmarks while largely outperforming them on newly introduced benchmarks in abstract algebra and quantum theory.

What carries the argument

The multi-agent framework that supplies LLMs with Lean verification tools through the Model Context Protocol to enforce formal correctness during proof generation.

If this is right

  • The approach remains competitive with specialized provers on existing public math benchmarks.
  • It produces higher success rates than those systems on the new abstract algebra and quantum theory benchmarks.
  • It supports collaborative formalization of complex results such as a cryptography theorem with human experts.
  • The tool-based agentic design supplies a single methodology usable across multiple scientific domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same architecture could be tested on benchmarks from additional fields such as number theory or condensed-matter physics without domain-specific retraining.
  • Success here raises the question of whether similar tool integration would improve performance in non-Lean formal systems or in proof search without language models.
  • If the pattern holds, research groups might shift from maintaining separate provers per subfield toward shared agentic tool layers.

Load-bearing premise

That outperformance on the two new benchmarks demonstrates genuine cross-domain generalization rather than effects from how those benchmarks were constructed or tuned to the system.

What would settle it

An evaluation on a third independently designed benchmark in a different domain, with no changes to Ax-Prover, that shows whether the performance edge disappears.

Figures

Figures reproduced from arXiv: 2510.12787 by Benjamin Breen, Dirk Englund, Frank H. L. Koppens, Jacob McCarran, Jacob M. Taylor, Javier Aspuru Mijares, Kfir Sulimany, Marco Del Tredici, Weichen Winston Yin.

Figure 1
Figure 1. Figure 1: Left: the multi-agent workflow of Ax-Prover. Right: the tool-enhanced reasoning of the Prover. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The main steps performed by the Prover to prove a theorem. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

We present Ax-Prover, a multi-agent system for automated theorem proving in Lean that can solve problems across diverse scientific domains and operate either autonomously or collaboratively with human experts. To achieve this, Ax-Prover approaches scientific problem solving through formal proof generation, a process that demands both creative reasoning and strict syntactic rigor. Ax-Prover meets this challenge by equipping Large Language Models (LLMs), which provide knowledge and reasoning, with Lean tools via the Model Context Protocol (MCP), which ensure formal correctness. To evaluate its performance as an autonomous prover, we benchmark our approach against frontier LLMs and specialized prover models on two public math benchmarks and on two Lean benchmarks we introduce in the fields of abstract algebra and quantum theory. On public datasets, Ax-Prover is competitive with state-of-the-art provers, while it largely outperforms them on the new benchmarks. This shows that, unlike specialized systems that struggle to generalize, our tool-based agentic theorem prover approach offers a generalizable methodology for formal verification across diverse scientific domains. Furthermore, we demonstrate Ax-Prover's assistant capabilities in a practical use case, showing how it enabled an expert mathematician to formalize the proof of a complex cryptography theorem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents Ax-Prover, a multi-agent system that integrates LLMs with Lean theorem-proving tools via the Model Context Protocol (MCP) to generate formal proofs. It evaluates the system as an autonomous prover on two existing public math benchmarks (where it is reported competitive with frontier LLMs and specialized provers) and on two newly introduced Lean benchmarks in abstract algebra and quantum theory (where it is reported to largely outperform specialized models). The central claim is that this tool-based agentic architecture provides a generalizable methodology for formal verification across scientific domains, unlike specialized systems. The paper also includes a collaborative demonstration in which Ax-Prover assists an expert in formalizing a complex cryptography theorem.

Significance. If the reported performance deltas on the new benchmarks are shown to be robust and free of construction bias, the work would offer a concrete demonstration that agentic, tool-augmented LLMs can extend formal theorem proving into domains such as abstract algebra and quantum theory where specialized provers have limited coverage. The explicit use of MCP for Lean tool access and the human-in-the-loop cryptography example are concrete strengths that could be built upon by the formal-methods and AI-for-science communities.

major comments (3)
  1. [Abstract] Abstract: the central claim that Ax-Prover 'largely outperforms' specialized models on the two new Lean benchmarks in abstract algebra and quantum theory is stated without any numerical results, baseline scores, error bars, problem counts, or description of how the benchmarks were constructed or filtered. This information is load-bearing for the generalization argument and its absence prevents evaluation of whether the reported gains reflect architectural advantages or benchmark-specific properties.
  2. [Benchmark sections (inferred from abstract)] The description of the new benchmarks (abstract algebra and quantum theory) provides no details on problem selection criteria, difficulty calibration, coverage of Lean tactics, or steps taken to prevent data leakage from training corpora. Without these, the outperformance cannot be taken as evidence that the multi-agent tool-use design generalizes rather than exploiting construction choices that reward iterative agent behavior.
  3. [Evaluation and conclusion sections (inferred from abstract)] The generalization claim ('unlike specialized systems that struggle to generalize, our tool-based agentic theorem prover approach offers a generalizable methodology') rests entirely on the empirical comparison to the two new benchmarks; no ablation isolating the contribution of the multi-agent MCP architecture versus single-pass LLM prompting is reported, leaving the causal link between architecture and cross-domain performance untested.
minor comments (2)
  1. [Abstract] The abstract mentions 'two public math benchmarks' without naming them; explicit citation of the datasets (e.g., miniF2F, ProofNet) would improve reproducibility.
  2. [Use-case section (inferred from abstract)] The cryptography use-case is presented as a practical demonstration but lacks any quantitative measure of human effort saved or proof length/complexity, which would strengthen the assistant-capability claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the paper to strengthen the presentation of the benchmarks and evaluation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that Ax-Prover 'largely outperforms' specialized models on the two new Lean benchmarks in abstract algebra and quantum theory is stated without any numerical results, baseline scores, error bars, problem counts, or description of how the benchmarks were constructed or filtered. This information is load-bearing for the generalization argument and its absence prevents evaluation of whether the reported gains reflect architectural advantages or benchmark-specific properties.

    Authors: We agree that the abstract would benefit from greater specificity to support the central claim. In the revised manuscript we have updated the abstract to report problem counts for the new benchmarks, key performance numbers with baselines, and a concise description of construction and filtering. Full tables with error bars appear in the evaluation section. revision: yes

  2. Referee: [Benchmark sections (inferred from abstract)] The description of the new benchmarks (abstract algebra and quantum theory) provides no details on problem selection criteria, difficulty calibration, coverage of Lean tactics, or steps taken to prevent data leakage from training corpora. Without these, the outperformance cannot be taken as evidence that the multi-agent tool-use design generalizes rather than exploiting construction choices that reward iterative agent behavior.

    Authors: We acknowledge the need for these details. The revised manuscript adds a new subsection that specifies problem selection criteria (recent publications and textbooks), difficulty calibration via expert review and tactic complexity, Lean tactic coverage, and leakage-prevention steps including post-training-cutoff sourcing and manual originality checks. revision: yes

  3. Referee: [Evaluation and conclusion sections (inferred from abstract)] The generalization claim ('unlike specialized systems that struggle to generalize, our tool-based agentic theorem prover approach offers a generalizable methodology') rests entirely on the empirical comparison to the two new benchmarks; no ablation isolating the contribution of the multi-agent MCP architecture versus single-pass LLM prompting is reported, leaving the causal link between architecture and cross-domain performance untested.

    Authors: We agree that a direct ablation would strengthen the causal argument. The original experiments did not include a single-pass versus multi-agent ablation. In the revision we have added an explicit discussion of this limitation in the evaluation section, clarified that the specialized baselines are architecturally distinct from LLM-based agents, and noted that future work will include such an ablation. The current evidence still supports generalization via the observed cross-domain performance gap. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation only

full rationale

The paper presents an empirical multi-agent system evaluated on public benchmarks plus two author-introduced Lean benchmarks in abstract algebra and quantum theory. No mathematical derivation chain, fitted parameters, or first-principles results are described. The generalization claim rests on reported outperformance comparisons rather than any self-referential construction, self-citation load-bearing premise, or ansatz. This matches the default expectation for non-circular empirical work; the central claim has independent content in the form of external benchmark results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied AI-systems paper describing an engineering framework; it introduces no free parameters, mathematical axioms, or postulated physical entities.

pith-pipeline@v0.9.0 · 5785 in / 1144 out tokens · 23660 ms · 2026-05-25T07:44:14.644341+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Fine-Tuning Small Reasoning Models for Quantum Field Theory

    cs.LG 2026-04 unverdicted novelty 7.0

    Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.

  2. NeuroClaw Technical Report

    cs.CV 2026-04 unverdicted novelty 6.0

    NeuroClaw introduces a three-tier multi-agent framework and NeuroBench benchmark that improve executability and reproducibility scores for neuroimaging tasks when used with multimodal LLMs.

  3. The Topological Dual of a Dataset: A Logic-to-Topology Encoding for AlphaGeometry-Style Data

    cs.AI 2026-04 unverdicted novelty 6.0

    The topological dual of a dataset is introduced as a transformation that encodes logical structures into topological ones to expose invariants in neural latent spaces for AlphaGeometry-style reasoning.

  4. Automated Conjecture Resolution with Formal Verification

    cs.LG 2026-04 unverdicted novelty 6.0

    An AI framework combining informal reasoning and formal verification resolves an open commutative algebra problem and produces a Lean 4-checked proof with minimal human input.

  5. LLMs with in-context learning for Algorithmic Theoretical Physics

    cs.LG 2026-05 unverdicted novelty 5.0

    Frontier LLMs with in-context learning and CAS integration solve most algorithmic tasks in theoretical physics when supplied with worked examples.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 5 Pith papers · 9 internal anchors

  1. [1]

    arXiv, 2025

    Mathematics (arxiv archive). arXiv, 2025. Accessed: 2025-09-24

  2. [2]

    Aristotle: IMO-level Automated Theorem Proving

    Tudor Achim, Alex Best, Kevin Der, Math¨ıs F´ed´erico, Sergei Gukov, Daniel Halpern-Leister, Kirsten Hennings- gard, Yury Kudryashov, Alexander Meiburg, Martin Michelsen, et al. Aristotle: Imo-level automated theorem proving.arXiv preprint arXiv:2510.01346, 2025

  3. [3]

    Claude 3.7 sonnet and claude code.https://www.anthropic.com/news/ claude-3-7-sonnet, 2025

    Anthropic. Claude 3.7 sonnet and claude code.https://www.anthropic.com/news/ claude-3-7-sonnet, 2025. Accessed: YYYY-MM-DD

  4. [4]

    Claude 4.https://www.anthropic.com/news/claude-4, 2025

    Anthropic. Claude 4.https://www.anthropic.com/news/claude-4, 2025. Accessed: 2025-09-16

  5. [5]

    Claude documentation, 2025

    Anthropic. Claude documentation, 2025. Accessed: 2025-09-19

  6. [6]

    Claude opus 4.1.https://www.anthropic.com/claude/opus, 2025

    Anthropic. Claude opus 4.1.https://www.anthropic.com/claude/opus, 2025. Accessed: 2025-10- 14

  7. [7]

    Introducing claude haiku 4.5, October 2025

    Anthropic PBC. Introducing claude haiku 4.5, October 2025. Accessed on November 14, 2025

  8. [8]

    Leanlm: Large language models for lean theorem proving.arXiv preprint arXiv:2306.09264, 2023

    Ethan Ayers et al. Leanlm: Large language models for lean theorem proving.arXiv preprint arXiv:2306.09264, 2023

  9. [9]

    Formal proving with llms: Lean as a benchmark

    Zhenisbek Azerbayev et al. Formal proving with llms: Lean as a benchmark. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 23

  10. [10]

    Prover agent: An agent-based framework for formal mathematical proofs.arXiv preprint arXiv:2506.19923, 2025

    Kaito Baba, Chaoran Liu, Shuhei Kurita, and Akiyoshi Sannai. Prover agent: An agent-based framework for formal mathematical proofs.arXiv preprint arXiv:2506.19923, 2025

  11. [11]

    Satisfiability modulo theories: An appe- tizer.Communications of the ACM, 65(6):69–77, 2022

    Haniel Barbosa, Clark Barrett, Pascal Fontaine, and Andrew Reynolds. Satisfiability modulo theories: An appe- tizer.Communications of the ACM, 65(6):69–77, 2022

  12. [12]

    Computer-aided security proofs for the working cryptographer

    Gilles Barthe, Benjamin Gr ´egoire, Sylvain Heraud, and Santiago Zanella B ´eguelin. Computer-aided security proofs for the working cryptographer. InAnnual Cryptology Conference, pages 71–90. Springer, 2011

  13. [13]

    Tactician: Lean proof automation with knn

    Bart Blaauwbroek et al. Tactician: Lean proof automation with knn. InProceedings of the International Confer- ence on Interactive Theorem Proving (ITP), pages 348–366. Springer, 2020

  14. [14]

    A computationally sound mechanized prover for security protocols.IEEE Transactions on Dependable and Secure Computing, 5(4):193–207, 2008

    Bruno Blanchet. A computationally sound mechanized prover for security protocols.IEEE Transactions on Dependable and Secure Computing, 5(4):193–207, 2008

  15. [15]

    Knudsen, Gregor Leander, Ventzislav Nikov, Christof Paar, Christian Rechberger, Peter Rombouts, Søren S

    Julia Borghoff, Anne Canteaut, Tim G ¨uneysu, Elif Bilge Kavun, Miroslav Kne ˇzevi´c, Lars R. Knudsen, Gregor Leander, Ventzislav Nikov, Christof Paar, Christian Rechberger, Peter Rombouts, Søren S. Thomsen, and Tolga Yalc ¸ın. Prince – a low-latency block cipher for pervasive computing applications. In Xiaoyun Wang and Kazue Sako, editors,Advances in Cry...

  16. [16]

    Seed-prover: Deep and broad reasoning for automated theorem proving.arXiv preprint arXiv:2507.23726, 2025

    Luoxin Chen, Jinming Gu, Liankai Huang, Wenhao Huang, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Kaijing Ma, et al. Seed-prover: Deep and broad reasoning for automated theorem proving.arXiv preprint arXiv:2507.23726, 2025

  17. [17]

    Evaluating Large Language Models Trained on Code

    Mark Chen et al. Evaluating large language models trained on code. InarXiv preprint arXiv:2107.03374, 2021

  18. [18]

    Quantum algorithms for lattice problems

    Yilei Chen. Quantum algorithms for lattice problems. Technical report, Cryptology ePrint Archive, Report 2024/555, 2024. Updated April 18: algorithm contains an unfixable bug invalidating the main claim (see Section 3.5.9, Page 37)

  19. [19]

    Gold-medalist performance in solving olympiad geometry with alphageometry2.arXiv preprint arXiv:2502.03544, 2025

    Yuri Chervonyi, Trieu H Trinh, Miroslav Ol ˇs´ak, Xiaomeng Yang, Hoang Nguyen, Marcelo Menegali, Junehyuk Jung, Vikas Verma, Quoc V Le, and Thang Luong. Gold-medalist performance in solving olympiad geometry with alphageometry2.arXiv preprint arXiv:2502.03544, 2025

  20. [20]

    Llms as conversational partners for mathematicians.arXiv preprint arXiv:2305.XXXX, 2023

    Emily Collins et al. Llms as conversational partners for mathematicians.arXiv preprint arXiv:2305.XXXX, 2023

  21. [21]

    Z3: An efficient smt solver

    Leonardo de Moura and Nikolaj Bjørner. Z3: An efficient smt solver. InProceedings of the International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS), pages 337–340. Springer, 2008

  22. [22]

    The lean theorem prover (system description)

    Leonardo de Moura, Soonho Kong, Jeremy Avigad, Floris van Doorn, and Jakob von Raumer. The lean theorem prover (system description). InAutomated Deduction – CADE-25, volume 9195 ofLecture Notes in Computer Science, pages 378–388. Springer, 2015

  23. [23]

    Deepseek-prover-v1 dataset.https://huggingface.co/datasets/deepseek-ai/ DeepSeek-Prover-V1, 2024

    DeepSeek-AI. Deepseek-prover-v1 dataset.https://huggingface.co/datasets/deepseek-ai/ DeepSeek-Prover-V1, 2024. Accessed: 2025-08-24

  24. [24]

    Deepseek-prover-v2-671b, 2025

    deepseek ai. Deepseek-prover-v2-671b, 2025. Accessed: 2025-09-24

  25. [25]

    Lean-lsp-mcp: Tools for agentic interaction with the lean theorem prover, 3 2025

    Oliver Dressler. Lean-lsp-mcp: Tools for agentic interaction with the lean theorem prover, 3 2025. Accessed: 2025-08-24

  26. [26]

    Abstract algebra

    David S Dummit and Richard M Foote. Abstract algebra. john wile & sons.Inc., Hoboken, NJ, 2004

  27. [27]

    Praveen Gauravaram, Lars Knudsen, Krystian Matusiewicz, Florian Mendel, Christian Rechberger, Martin Schl¨affer, and Søren S. Thomsen. Grøstl – a sha-3 candidate. Submission to nist, NIST, September 2008. Available athttp://www.groestl.info/

  28. [28]

    Tactictoe: Learning to prove with tactics

    Th ´er`ese Gauthier, Cezary Kaliszyk, and Josef Urban. Tactictoe: Learning to prove with tactics. InProceedings of the International Conference on Automated Deduction (CADE), pages 275–294. Springer, 2021. 24

  29. [29]

    Towards an AI co-scientist

    Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist.arXiv preprint arXiv:2502.18864, 2025

  30. [30]

    Learning to prove theorems via interacting with proof assistants

    Daniel Huang et al. Learning to prove theorems via interacting with proof assistants. InInternational Conference on Machine Learning (ICML), pages 2654–2663, 2019

  31. [31]

    Alemi, et al

    Geoffrey Irving, Christian Szegedy, Alexander A. Alemi, et al. Deepmath - deep sequence models for premise selection. InAdvances in Neural Information Processing Systems (NeurIPS), pages 2235–2243, 2016

  32. [32]

    Another look at” provable security”.Journal of Cryptology, 20(1):3–37, 2007

    Neal Koblitz and Alfred J Menezes. Another look at” provable security”.Journal of Cryptology, 20(1):3–37, 2007

  33. [33]

    First-order theorem proving and vampire

    Laura Kov ´acs and Andrei V oronkov. First-order theorem proving and vampire. InProceedings of the Interna- tional Conference on Computer Aided Verification (CAV), pages 1–35. Springer, 2013

  34. [34]

    Leanagent: Lifelong learning for formal theorem proving.arXiv preprint arXiv:2410.06209, 2024

    Adarsh Kumarappan, Mo Tiwari, Peiyang Song, Robert Joseph George, Chaowei Xiao, and Anima Anandkumar. Leanagent: Lifelong learning for formal theorem proving.arXiv preprint arXiv:2410.06209, 2024

  35. [35]

    Deep reinforcement learning for theorem proving

    Guillaume Lample and Franc ¸ois Charton. Deep reinforcement learning for theorem proving. InInternational Conference on Learning Representations (ICLR), 2022

  36. [36]

    Mathlib statistics.https://leanprover-community.github.io/ mathlib_stats.html, 2025

    Lean Prover Community. Mathlib statistics.https://leanprover-community.github.io/ mathlib_stats.html, 2025. GitHub repository for generating statistics plots for Mathlib; accessed 2025-08-24

  37. [37]

    Numinamath.[https://huggingface.co/AI-MO/NuminaMath-1.5](https: //github.com/project-numina/aimo-progress-prize/blob/main/report/numina_ dataset.pdf), 2024

    Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath.[https://huggingface.co/AI-MO/NuminaMath-1.5](https: //github.com/project-numina/aimo-progress-prize/blob/main/report/num...

  38. [38]

    Goedel-prover: A frontier model for open-source automated theorem proving.arXiv preprint arXiv:2502.07640, 2025

    Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia Li, Mengzhou Xia, Danqi Chen, Sanjeev Arora, et al. Goedel-prover: A frontier model for open-source automated theorem proving.arXiv preprint arXiv:2502.07640, 2025

  39. [39]

    Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

    Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, et al. Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction.arXiv preprint arXiv:2508.03613, 2025

  40. [40]

    Unconditional Security Of Quantum Key Distribution Over Arbitrarily Long Distances

    Hoi-Kwong Lo and HF Chau. Unconditional security of quantum key distribution over arbitrarily long distances. arXiv preprint quant-ph/9803006, 1998

  41. [41]

    Unconditional security of quantum key distribution over arbitrarily long distances.science, 283(5410):2050–2056, 1999

    Hoi-Kwong Lo and Hoi Fung Chau. Unconditional security of quantum key distribution over arbitrarily long distances.science, 283(5410):2050–2056, 1999

  42. [42]

    Process-driven autoformalization in lean 4

    Zeyuan Lu, Guodong Zhang, Junxiao Chen, Huan Chen, Yilun Chen, Zonglin Li, Yiping Li, Lianmin Wang, Yao Lin, Ce Zhang, and Jie Chen. Process-driven autoformalization in lean 4. InThe Thirteenth International Conference on Learning Representations, 2025

  43. [43]

    Introducing gauss, an agent for autoformalization, 2025

    Math Inc. Introducing gauss, an agent for autoformalization, 2025. Announcement of autoformalization agent for formal verification in mathematics

  44. [44]

    A formalization of the generalized quantum stein’s lemma in lean.arXiv preprint arXiv:2510.08672, 2025

    Alex Meiburg, Leonardo A Lessa, and Rodolfo R Soldati. A formalization of the generalized quantum stein’s lemma in lean.arXiv preprint arXiv:2510.08672, 2025

  45. [45]

    P. R. Mishra, Yogesh Kumar, Susanta Samanta, and Atul Gaur. A new algorithm for computing branch number of non-singular matrices over finite fields.arXiv preprint arXiv:2405.07007, 2024. 25

  46. [46]

    What is the model context protocol (mcp)?https://modelcontextprotocol

    Model Context Protocol. What is the model context protocol (mcp)?https://modelcontextprotocol. io/docs/getting-started/intro, 2024. Accessed: 2025-10-05

  47. [47]

    Robust de-anonymization of large sparse datasets

    Arvind Narayanan and Vitaly Shmatikov. Robust de-anonymization of large sparse datasets. In2008 IEEE Symposium on Security and Privacy (sp 2008), pages 111–125. IEEE, 2008

  48. [48]

    Advanced encryption standard (aes)

    National Institute of Standards and Technology. Advanced encryption standard (aes). Federal Information Processing Standards Publication FIPS 197-upd1, U.S. Department of Commerce, Gaithersburg, MD, 2001. Published November 26, 2001; Updated May 9, 2023

  49. [49]

    Cambridge university press, 2010

    Michael A Nielsen and Isaac L Chuang.Quantum computation and quantum information. Cambridge university press, 2010

  50. [50]

    Numinamath-lean dataset.https://huggingface.co/datasets/AI-MO/ NuminaMath-LEAN, 2025

    Numina-Team. Numinamath-lean dataset.https://huggingface.co/datasets/AI-MO/ NuminaMath-LEAN, 2025. Accessed: 2025-08-24

  51. [51]

    Openai models documentation, 2025

    OpenAI. Openai models documentation, 2025. Accessed: 2025-09-19

  52. [52]

    Apollo: Automated llm and lean collaboration for advanced formal reasoning.arXiv preprint arXiv:2505.05758, 2025

    Azim Ospanov, Farzan Farnia, and Roozbeh Yousefzadeh. Apollo: Automated llm and lean collaboration for advanced formal reasoning.arXiv preprint arXiv:2505.05758, 2025

  53. [53]

    Formal mathematics statement curriculum learning

    Stanislas Polu et al. Formal mathematics statement curriculum learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  54. [54]

    Generative language modeling for automated theorem proving

    Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. InInterna- tional Conference on Learning Representations (ICLR), 2020

  55. [55]

    DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

    Z.Z. Ren, Zhihong Shao, Wenfeng Liang, et al. Deepseek-prover-v2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition.https://arxiv.org/abs/2504.21801, 2025. Accessed: 2025-08-24

  56. [56]

    S ´ebastien Rousseau. Bug discovered in quantum algorithm for lattice-based crypto.https:// sebastienrousseau.com/2024-04-22-bug-discovered-in-quantum-algorithm-for-lattice-based-crypto/ index.html, April 22 2024. Accessed: [add access date here]

  57. [57]

    E prover 2.0: Integrating equational and first-order logic

    Stephan Schulz, Simon Cruanes, and Petar Vukmirovi´c. E prover 2.0: Integrating equational and first-order logic. InProceedings of the International Conference on Automated Deduction (CADE), pages 523–541. Springer, 2019

  58. [58]

    Simple proof of security of the bb84 quantum key distribution protocol.Physical review letters, 85(2):441, 2000

    Peter W Shor and John Preskill. Simple proof of security of the bb84 quantum key distribution protocol.Physical review letters, 85(2):441, 2000

  59. [59]

    Sequences of games: a tool for taming complexity in security proofs.cryptology eprint archive, 2004

    Victor Shoup. Sequences of games: a tool for taming complexity in security proofs.cryptology eprint archive, 2004

  60. [60]

    Lean copilot: Large language models as copilots for theorem proving in lean.arXiv preprint arXiv:2404.12534, 2024

    Peiyang Song, Kaiyu Yang, and Anima Anandkumar. Lean copilot: Large language models as copilots for theorem proving in lean.arXiv preprint arXiv:2404.12534, 2024

  61. [61]

    k-anonymity: A model for protecting privacy.International journal of uncertainty, fuzziness and knowledge-based systems, 10(05):557–570, 2002

    Latanya Sweeney. k-anonymity: A model for protecting privacy.International journal of uncertainty, fuzziness and knowledge-based systems, 10(05):557–570, 2002

  62. [62]

    Putnambench leaderboard.https://trishullab.github.io/PutnamBench/ leaderboard.html, 2025

    Trishullab. Putnambench leaderboard.https://trishullab.github.io/PutnamBench/ leaderboard.html, 2025. Accessed: 2025-10-07

  63. [63]

    Putnambench: Evaluating neural theorem-provers on the putnam mathematical competi- tion.Advances in Neural Information Processing Systems, 37:11545–11569, 2024

    George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jennings, Amitayush Thakur, and Swarat Chaudhuri. Putnambench: Evaluating neural theorem-provers on the putnam mathematical competi- tion.Advances in Neural Information Processing Systems, 37:11545–11569, 2024

  64. [64]

    Machine learning preselected proof steps

    Josef Urban, Geoff Sutcliffe, Stefan Petrov, and Josef Vysko ˇcil. Machine learning preselected proof steps. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pages 2046–2051, 2011. 26

  65. [65]

    Hilbert: Recursively building formal proofs with informal reasoning.arXiv preprint arXiv:2509.22819, 2025

    Sumanth Varambally, Thomas V oice, Yanchao Sun, Zhifeng Chen, Rose Yu, and Ke Ye. Hilbert: Recursively building formal proofs with informal reasoning.arXiv preprint arXiv:2509.22819, 2025

  66. [66]

    Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

    Haiming Wang et al. Kimina-prover preview: Towards large formal reasoning models with reinforcement learn- ing.https://arxiv.org/abs/2504.11354, 2025. Accessed: 2025-08-24

  67. [67]

    Autoformalization in the era of large language models: A survey.arXiv preprint arXiv:2505.23486, 2025

    Qihao Wu, Haotian Zhang, Jialin Chen, Yizhou Li, Xingjian Zhang, Ce Zhang, and Jie Chen. Autoformalization in the era of large language models: A survey.arXiv preprint arXiv:2505.23486, 2025

  68. [68]

    Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data.https: //arxiv.org/abs/2405.14333, 2024

    Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, and Xiaodan Liang. Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data.https: //arxiv.org/abs/2405.14333, 2024. Accessed: 2025-08-24

  69. [69]

    Deepseek-prover-v1

    Huajian Xin, ZZ Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, et al. Deepseek-prover-v1. 5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search.arXiv preprint arXiv:2408.08152, 2024

  70. [70]

    Leandojo: Theorem proving with large language models

    Zhangir Xin et al. Leandojo: Theorem proving with large language models. InInternational Conference on Learning Representations (ICLR), 2024

  71. [71]

    The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

    Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066, 2025

  72. [72]

    Lean workbook: A large-scale lean problem set formalized from natural language math problems.https://arxiv.org/abs/2406

    Huaiyuan Ying, Zijian Wu, Yihan Geng, Jiayu Wang, Dahua Lin, and Kai Chen. Lean workbook: A large-scale lean problem set formalized from natural language math problems.https://arxiv.org/abs/2406. 03847, 2024. Accessed: 2025-08-24

  73. [73]

    MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

    Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. Minif2f: A cross-system benchmark for formal olympiad- level mathematics.https://arxiv.org/abs/2109.00110, 2021. Accessed: 2025-08-24. 27