pith. sign in

arxiv: 2605.29955 · v1 · pith:Z5CG7KAYnew · submitted 2026-05-28 · 💻 cs.AI

Formalizing Mathematics at Scale

Pith reviewed 2026-06-29 07:32 UTC · model grok-4.3

classification 💻 cs.AI
keywords autoformalizationLean 4multi-agent systemsformal mathematicstheorem verificationtextbook formalizationAtlas libraryLLM agents
0
0 comments X

The pith

A multi-agent system translates 26 math textbooks into a verified Lean library of 45,000 declarations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AutoformBot, a system that coordinates thousands of LLM agents equipped with Lean verification tools to convert informal textbook prose into formal definitions and proofs. It applies this process to 26 open-access texts covering analysis, algebra, topology, combinatorics, and probability, yielding Atlas with over 45,000 Lean 4 declarations and 500,000 lines of code. The central demonstration is that the resulting library is machine-verified, which the authors take as evidence that autoformalization of core graduate mathematics can be performed at scale. This feasibility claim rests on the system's handling of dependencies, collaboration, and verification without requiring extensive manual correction.

Core claim

AutoformBot orchestrates LLM agents with formal verification, dependency-aware scheduling, and version control to translate textbook content into Lean 4. The process applied to 26 textbooks produces Atlas, a library containing more than 45,000 declarations that pass Lean's checker. The authors state that this scale of autoformalization for graduate-level mathematics is now economically and technically feasible and opens the door to automated verification of research-level mathematics whether generated by humans or machines.

What carries the argument

AutoformBot, a multi-agent orchestration framework that equips LLM agents with Lean verification tools, dependency scheduling, and collaborative version control to convert informal mathematics into verified formal code.

If this is right

  • Textbook content across standard graduate fields can be turned into machine-checked libraries without manual formalization of each step.
  • Verified libraries of this size become practical to build and maintain through automated agent coordination.
  • Automated verification extends to both human-written and machine-generated mathematics at the level of entire textbooks.
  • Dependency management and version control can be handled at the scale of hundreds of thousands of lines without central human oversight.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same orchestration might be applied to research papers rather than only textbooks, though coverage of open problems would require additional mechanisms.
  • The released Atlas library could serve as training data to improve future formalization models beyond the current system.
  • Extending the approach to fields with heavier use of diagrams or non-textual arguments would test whether the text-to-formal pipeline generalizes.
  • If the cost per declaration continues to drop, formalization could become a standard preprocessing step for new mathematical results.

Load-bearing premise

The Lean verifier catches all errors in the generated formalizations, including any subtle mismatches between the original textbook intent and the translated statements.

What would settle it

An error in one of the Atlas declarations that passes Lean's type checker but contradicts the corresponding textbook statement.

read the original abstract

We present AutoformBot, a multi-agent system for building an Autoformalized Textbook Library At Scale (Atlas) in Lean 4. AutoformBot orchestrates thousands of LLM agents, equipped with formal verification tools, dependency-aware task scheduling, and collaborative version control, to translate informal textbook prose into machine-checked definitions and proofs. We apply our methods to a corpus of 26 open-access textbooks spanning analysis, algebra, topology, combinatorics, and probability, producing Atlas: a verified library of over 45,000 Lean 4 declarations and 500 thousand lines of code. We release two artifacts: (i) AutoformBot, the open-source multi-agent framework; and (ii) Atlas, the resulting formal library. Our results suggest that autoformalizing the core content of graduate-level mathematics at scale is now economically and technically feasible. This opens the door to the automated verification of both human- and machine-generated mathematics at a research level.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces AutoformBot, a multi-agent LLM orchestration system equipped with formal verification, dependency scheduling, and version control. It applies the system to 26 open-access textbooks across analysis, algebra, topology, combinatorics, and probability to produce Atlas, a Lean 4 library containing over 45,000 declarations and 500,000 lines of code. The authors release both AutoformBot and Atlas as open artifacts and conclude that autoformalizing the core content of graduate-level mathematics at scale is now economically and technically feasible.

Significance. If the generated formalizations accurately reflect the source textbooks, the work provides a concrete empirical demonstration of large-scale autoformalization together with fully released artifacts. The open release of both the multi-agent framework and the resulting library constitutes a clear strength that can support downstream research in verified mathematics.

major comments (2)
  1. [Abstract] Abstract: the description of Atlas as a 'verified library' produced from textbook prose is load-bearing for the feasibility claim, yet the manuscript reports no fidelity metrics, human audits, or alignment statistics between the informal source statements and the generated Lean declarations. Lean verification alone confirms internal consistency of the output code but does not establish semantic fidelity or rule out systematic translation gaps.
  2. [Results] Results section describing Atlas: no quantitative figures are supplied for coverage completeness (e.g., fraction of textbook theorems successfully formalized), post-verification error rates, or total human effort required, leaving the strength of evidence for the central scalability claim moderate.
minor comments (2)
  1. [Methods] The multi-agent workflow diagram would benefit from explicit annotation of the dependency-aware scheduler and version-control integration steps.
  2. [Results] A short table summarizing per-textbook declaration counts and verification success rates would improve readability of the scale claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on fidelity and quantitative evidence. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the description of Atlas as a 'verified library' produced from textbook prose is load-bearing for the feasibility claim, yet the manuscript reports no fidelity metrics, human audits, or alignment statistics between the informal source statements and the generated Lean declarations. Lean verification alone confirms internal consistency of the output code but does not establish semantic fidelity or rule out systematic translation gaps.

    Authors: We agree that the manuscript does not report explicit fidelity metrics or human audits, and that Lean verification alone establishes internal consistency rather than semantic alignment with the source textbooks. In the revised version we will add a dedicated evaluation subsection reporting human audit results on a representative sample of declarations (including alignment accuracy and identified translation issues) together with any available system-level success statistics. We will also revise the abstract to distinguish between Lean verification and source fidelity. revision: yes

  2. Referee: [Results] Results section describing Atlas: no quantitative figures are supplied for coverage completeness (e.g., fraction of textbook theorems successfully formalized), post-verification error rates, or total human effort required, leaving the strength of evidence for the central scalability claim moderate.

    Authors: We acknowledge the absence of these quantitative figures in the current results section. The manuscript emphasizes aggregate scale rather than per-textbook coverage or error rates. In revision we will add coverage estimates (where textbook theorem counts can be determined), post-verification error rates extracted from AutoformBot logs, and an account of human oversight effort. These additions will strengthen the evidence presented for scalability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems demonstration with released artifacts

full rationale

The paper describes construction and application of AutoformBot to generate Atlas from 26 textbooks, yielding 45k Lean declarations. The central claim of feasibility is an empirical outcome from running the system, not a derivation that reduces to fitted parameters, self-definitions, or self-citation chains. Lean verification is an external checker independent of the paper's inputs. No equations, uniqueness theorems, or ansatzes are invoked that collapse to the paper's own construction. The result is self-contained against external benchmarks via released code and library.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The feasibility claim rests on the correctness of Lean 4 verification and the assumption that the agent system can translate textbook content without introducing unverifiable gaps. No free parameters or invented physical entities are involved.

axioms (1)
  • standard math Lean 4's type checker and kernel correctly implement the underlying logic and detect all proof errors.
    The entire verification claim depends on Lean 4 being a sound proof assistant; invoked when stating that Atlas consists of 'verified' declarations.
invented entities (1)
  • AutoformBot multi-agent orchestration system no independent evidence
    purpose: Coordinate LLMs, verification tools, and dependency scheduling for textbook translation
    New system introduced by the authors to achieve the reported scale.

pith-pipeline@v0.9.1-grok · 5709 in / 1283 out tokens · 27098 ms · 2026-06-29T07:32:59.772072+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 27 canonical work pages · 10 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Blumberg, Martin Hairer, Joe Kileel, Tamara G

    Mohammed Abouzaid, Andrew J. Blumberg, Martin Hairer, Joe Kileel, Tamara G. Kolda, Paul D. Nelson, Daniel Spielman, Nikhil Srivastava, Rachel Ward, Shmuel Weinberger, and Lauren Williams. First P roof, 2026. https://arxiv.org/abs/2602.05192

  3. [3]

    Aristotle: IMO-level Automated Theorem Proving

    Tudor Achim, Alex Best, Alberto Bietti, Mathis Federico, Sergei Gukov, Daniel Halpern-Leistner, et al. Aristotle: IMO -level automated theorem proving. arXiv preprint arXiv:2510.01346, 2025

  4. [4]

    NuminaMath-LEAN

    AI-MO . NuminaMath-LEAN . https://huggingface.co/datasets/AI-MO/NuminaMath-LEAN, 2025. Hugging Face dataset. 100K competition mathematics problems formalized in Lean 4. Accessed: 2026-04-29

  5. [5]

    Model context protocol, 2024

    Anthropic . Model context protocol, 2024. https://modelcontextprotocol.io

  6. [6]

    Claude O pus 4.6 system card

    Anthropic . Claude O pus 4.6 system card. https://www-cdn.anthropic.com/0dd865075ad3132672ee0ab40b05a53f14cf5288.pdf, 2026

  7. [7]

    Formalization of De Giorgi--Nash--Moser Theory in Lean

    Scott Armstrong and Julia Kempe. Formalization of D e G iorgi-- N ash-- M oser T heory in L ean, 2026. https://arxiv.org/abs/2604.05984

  8. [8]

    Takeaways from the F irst- P roof trenches

    Scott Armstrong, Julia Kempe, and Remi Munos. Takeaways from the F irst- P roof trenches. https://kempejulia1.github.io/1stProof-Attempt/1stProofTakeaways.pdf, 2026. Accessed: 2026-05-06

  9. [9]

    Ayers, Dragomir Radev, and Jeremy Avigad

    Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, and Jeremy Avigad. ProofNet : Autoformalizing and formally proving undergraduate-level mathematics. arXiv preprint arXiv:2302.12433, 2023

  10. [10]

    The role of the Mizar mathematical library for interactive proof development in Mizar

    Grzegorz Bancerek, Czes aw Byli \'n ski, Adam Grabowski, Artur Korni owicz, Roman Matuszewski, Adam Naumowicz, and Karol P a k. The role of the Mizar mathematical library for interactive proof development in Mizar . Journal of Automated Reasoning, 61 0 (1--4): 0 9--32, 2018. doi:10.1007/s10817-017-9440-6

  11. [11]

    Interactive Theorem Proving and Program Development: Coq'Art: The Calculus of Inductive Constructions

    Yves Bertot and Pierre Cast \'e ran. Interactive Theorem Proving and Program Development: Coq'Art: The Calculus of Inductive Constructions. Texts in Theoretical Computer Science. Springer, 2004. ISBN 978-3-540-20854-9. doi:10.1007/978-3-662-07964-5

  12. [12]

    Bors-ng: A merge bot for GitHub pull requests

    bors-ng contributors . Bors-ng: A merge bot for GitHub pull requests. https://github.com/bors-ng/bors-ng, 2017. GitHub repository

  13. [13]

    A brief overview of Agda -- a functional language with dependent types

    Ana Bove, Peter Dybjer, and Ulf Norell. A brief overview of Agda -- a functional language with dependent types. In Stefan Berghofer, Tobias Nipkow, Christian Urban, and Makarius Wenzel, editors, Theorem Proving in Higher Order Logics, volume 5674 of Lecture Notes in Computer Science, pages 73--78. Springer, 2009. doi:10.1007/978-3-642-03359-9_6

  14. [14]

    Loogle: Lean search engine, 2023

    Joachim Breitner. Loogle: Lean search engine, 2023. https://loogle.lean-lang.org

  15. [15]

    Fel's conjecture on syzygies of numerical semigroups

    Evan Chen, Chris Cummins, Dejan Grubisic, Leopold Haller, Letong Hong, Andranik Kurghinyan, Kenny Lau, Hugh Leather, Seewoo Lee, Aram Markosyan, et al. Fel's conjecture on syzygies of numerical semigroups. arXiv preprint arXiv:2602.03716, 2026

  16. [16]

    Seed-Prover : Deep and broad reasoning for automated theorem proving

    Luoxin Chen, Jinming Gu, Liankai Huang, Wenhao Huang, Zhicheng Jiang, et al. Seed-Prover : Deep and broad reasoning for automated theorem proving. arXiv preprint arXiv:2507.23726, 2025

  17. [17]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  18. [18]

    The Lean 4 theorem prover and programming language

    Leonardo de Moura and Sebastian Ullrich. The Lean 4 theorem prover and programming language. In International Conference on Automated Deduction (CADE), 2021

  19. [19]

    DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

    DeepSeek-AI . DeepSeek-Prover-V2 : Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition. arXiv preprint arXiv:2504.21801, 2025

  20. [20]

    lean-lsp-mcp: MCP server for the Lean language server, 2025

    Oliver Dressler. lean-lsp-mcp: MCP server for the Lean language server, 2025. https://github.com/oOo0oOo/lean-lsp-mcp

  21. [21]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Jun-Mei Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiaoling Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z

    Tony Feng, Trieu H. Trinh, Garrett Bingham, Dawsen Hwang, Yuri Chervonyi, Junehyuk Jung, Joonkyung Lee, Carlo Pagano, Sang-hyun Kim, Federico Pasqualotto, Sergei Gukov, Jonathan N. Lee, Junsu Kim, Kaiying Hou, Golnaz Ghiasi, Yi Tay, YaGuang Li, Chenkai Kuang, Yuan Liu, Hanzhao Lin, Evan Zheran Liu, Nigamaa Nayakanti, Xiaomeng Yang, Heng-Tze Cheng, Demis H...

  22. [22]

    Abel: Sample efficient online reinforcement learning for neural theorem proving

    Fabian Gloeckle, Jannis Limperg, Gabriel Synnaeve, and Amaury Hayat. Abel: Sample efficient online reinforcement learning for neural theorem proving. In The 4th Workshop on Mathematical Reasoning and AI at NeurIPS'24, 2024

  23. [23]

    Automatic Textbook Formalization

    Fabian Gloeckle, Ahmad Rammal, Charles Arnal, Remi Munos, Vivien Cabannes, Gabriel Synnaeve, and Amaury Hayat. Automatic textbook formalization. arXiv preprint arXiv:2604.03071, 2026

  24. [24]

    Gemini 3.1 pro model card

    Google DeepMind . Gemini 3.1 pro model card. https://deepmind.google/models/model-cards/gemini-3-1-pro/, 2026

  25. [25]

    Mizar in a nutshell

    Adam Grabowski, Artur Korni owicz, and Adam Naumowicz. Mizar in a nutshell. Journal of Formalized Reasoning, 3 0 (2): 0 153--245, 2010. doi:10.6092/issn.1972-5787/1980

  26. [26]

    Progress in Formalizing Sphere Packing in Dimension 8

    Sidharth Hariharan, Christopher Birkbeck, Seewoo Lee, Ho Kiu Gareth Ma, Bhavik Mehta, Auguste Poiroux, and Maryna Viazovska. A milestone in formalization: The sphere packing problem in dimension 8. arXiv preprint, 2026. https://arxiv.org/abs/2604.23468

  27. [27]

    HOL Light : An overview

    John Harrison. HOL Light : An overview. In Stefan Berghofer, Tobias Nipkow, Christian Urban, and Makarius Wenzel, editors, Theorem Proving in Higher Order Logics, volume 5674 of Lecture Notes in Computer Science, pages 60--66. Springer, 2009. doi:10.1007/978-3-642-03359-9_4

  28. [28]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Advances in Neural Information Processing Systems (NeurIPS), 2021

  29. [29]

    MetaGPT : Meta programming for a multi-agent collaborative framework

    Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, et al. MetaGPT : Meta programming for a multi-agent collaborative framework. In International Conference on Learning Representations (ICLR), 2024

  30. [30]

    Olympiad-level formal mathematical reasoning with reinforcement learning

    Thomas Hubert, Rishi Mehta, Laurent Sartran, Mikl \'o s Z. Horv \'a th, Goran Z u z i \'c , Eric Wieser, Aja Huang, Julian Schrittwieser, et al. Olympiad-level formal mathematical reasoning with reinforcement learning. Nature, 651: 0 607--613, 2025. doi:10.1038/s41586-025-09833-y

  31. [31]

    AI for Mathematics: Progress, Challenges, and Prospects

    Haocheng Ju and Bin Dong. AI for mathematics: Progress, challenges, and prospects. arXiv preprint arXiv:2601.13209, 2026

  32. [32]

    Hypertree proof search for neural theorem proving

    Guillaume Lample, Timothee Lacroix, Marie-Anne Lachaux, Aurelien Rodriguez, Amaury Hayat, Thibaut Lavril, Gabriel Ebner, and Xavier Martinet. Hypertree proof search for neural theorem proving. Advances in neural information processing systems, 35: 0 26337--26349, 2022

  33. [33]

    The Network Structure of Mathlib

    Xinze Li, Nanyun Peng, Simone Severini, and Patrick Shafto. The network structure of M athlib. arXiv preprint arXiv:2604.24797, 2026

  34. [34]

    Goedel-Prover : A frontier model for open-source automated theorem proving

    Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia Li, Mengzhou Xia, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-Prover : A frontier model for open-source automated theorem proving. arXiv preprint arXiv:2502.07640, 2025

  35. [35]

    Liu et al.Numina-Lean-Agent: An Open and General Agentic Reasoning System for Formal Mathematics

    Junqi Liu, Zihao Zhou, Zekai Zhu, Marco Dos Santos, Weikun He, Jiawei Liu, Ran Wang, Yunzhou Xie, Junqiao Zhao, Qiufeng Wang, et al. Numina-lean-agent: An open and general agentic reasoning system for formal mathematics. arXiv preprint arXiv:2601.14027, 2026

  36. [36]

    Strong pnt

    Math Inc. Strong pnt. https://math-inc.github.io/strongpnt/, 2026. AI-generated Lean formalization of the strong Prime Number Theorem

  37. [37]

    Megill and David A

    Norman D. Megill and David A. Wheeler. Metamath: A Computer Language for Mathematical Proofs. Lulu Press, Morrisville, North Carolina, 2019. ISBN 978-0-359-70223-7. https://us.metamath.org/downloads/metamath.pdf

  38. [38]

    Lean REPL , 2024

    Scott Morrison. Lean REPL , 2024. https://github.com/leanprover-community/repl

  39. [39]

    Paulson, and Markus Wenzel

    Tobias Nipkow, Lawrence C. Paulson, and Markus Wenzel. Isabelle/HOL: A Proof Assistant for Higher-Order Logic, volume 2283 of Lecture Notes in Computer Science. Springer, 2002. ISBN 978-3-540-43376-7. doi:10.1007/3-540-45949-9

  40. [40]

    GPT-5.4 thinking system card

    OpenAI . GPT-5.4 thinking system card. https://openai.com/index/gpt-5-4-thinking-system-card/, 2026

  41. [41]

    Generative Language Modeling for Automated Theorem Proving

    Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393, 2020

  42. [42]

    Formal mathematics statement curriculum learning

    Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, and Ilya Sutskever. Formal mathematics statement curriculum learning. arXiv preprint arXiv:2202.01344, 2022

  43. [43]

    The Lean mathematical library

    The Mathlib Community . The Lean mathematical library. arXiv preprint arXiv:1910.09336, 2020

  44. [44]

    Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition

    George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jennings, Amitayush Thakur, and Swarat Chaudhuri. Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition. Advances in Neural Information Processing Systems, 37: 0 11545--11569, 2024

  45. [45]

    130k lines of formal topology in two weeks: Simple and cheap autoformalization for everyone?, 2026

    Josef Urban. 130k lines of formal topology in two weeks: Simple and cheap autoformalization for everyone?, 2026

  46. [46]

    Mathlib conventions

    Vasily Vilin. Mathlib conventions. https://github.com/Vilin97/mathlib-conventions, 2026. Community-maintained collection of Mathlib coding conventions

  47. [47]

    LEGO-Prover : Neural theorem proving with growing libraries

    Haiming Wang, Huajian Xin, Chuanyang Zheng, Li Lin, Zhengying Liu, Qingxing Cao, Yinya Huang, Jing Xiong, Han Shi, Enze Xie, Jianwei Yin, Zhenguo Li, and Heng Liao. LEGO-Prover : Neural theorem proving with growing libraries. arXiv preprint arXiv:2310.00656, 2024

  48. [48]

    Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

    Haiming Wang, Mert Unsal, Xiaohan Lin, Mantas Baksys, et al. Kimina-Prover preview: Towards large formal reasoning models with reinforcement learning. arXiv preprint arXiv:2504.11354, 2025

  49. [49]

    M2F : Automated formalization of mathematical literature at scale, 2026

    Zichen Wang, Wanli Ma, Zhenyu Ming, Gong Zhang, Kun Yuan, and Zaiwen Wen. M2F : Automated formalization of mathematical literature at scale, 2026

  50. [50]

    Jiang, Wenda Li, Markus N

    Yuhuai Wu, Albert Q. Jiang, Wenda Li, Markus N. Rabe, Charles Staats, Mateja Jamnik, and Christian Szegedy. Autoformalization with large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2022

  51. [51]

    DeepSeek-Prover : Advancing theorem proving in LLMs through large-scale synthetic data

    Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, and Xiaodan Liang. DeepSeek-Prover : Advancing theorem proving in LLMs through large-scale synthetic data. In International Conference on Machine Learning (ICML), 2024

  52. [52]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE -agent: Agent-computer interfaces enable automated software engineering. In Advances in Neural Information Processing Systems (NeurIPS), 2024

  53. [53]

    Lean workbook: A large-scale Lean problem set formalized from natural language math problems

    Huaiyuan Ying, Zijian Shi, Zhengying He, Bohan Gao, Chenyang Chen, Zhicheng Yu, Songrun Song, Qicheng Fan, Yinya Li, Lin Li, et al. Lean workbook: A large-scale Lean problem set formalized from natural language math problems. In Advances in Neural Information Processing Systems (NeurIPS), 2024

  54. [54]

    mini F2F : A cross-system benchmark for formal O lympiad-level mathematics

    Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. mini F2F : A cross-system benchmark for formal O lympiad-level mathematics. In International Conference on Learning Representations (ICLR), 2022

  55. [55]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM -as-a-judge with MT-Bench and Chatbot Arena . In Advances in Neural Information Processing Systems (NeurIPS), 2023