ImProver: Agent-Based Automated Proof Optimization

Jeremy Avigad; Prasad Tetali; Riyaz Ahuja; Sean Welleck

arxiv: 2410.04753 · v2 · pith:AZAG5ORXnew · submitted 2024-10-07 · 💻 cs.AI · cs.CL· cs.LG· cs.LO

ImProver: Agent-Based Automated Proof Optimization

Riyaz Ahuja , Jeremy Avigad , Prasad Tetali , Sean Welleck This is my paper

Pith reviewed 2026-05-23 19:47 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LGcs.LO

keywords automated proof optimizationLLM agentsLean theorem proverproof rewritingformal mathematicsChain-of-Stateserror correction

0 comments

The pith

An LLM agent rewrites Lean proofs to make them shorter, more modular, and more readable while remaining correct.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines automated proof optimization as the task of rewriting an existing correct proof in Lean to improve an arbitrary user-specified criterion such as length or readability. ImProver is presented as the first method for this task, an LLM-based agent that incorporates symbolic Lean context through a Chain-of-States technique plus error correction and retrieval steps. Experiments apply the agent to real undergraduate, competition, and research-level theorems and report that the resulting proofs are substantially shorter, more modular, and more readable. The work matters because human-written proofs are often not optimal for machine learning or specific downstream applications, and manual refinement is costly.

Core claim

ImProver shows that an LLM agent, equipped with a Chain-of-States technique that maintains symbolic Lean context, along with error-correction and retrieval, can rewrite proofs of real-world theorems so that they remain accepted by Lean yet become substantially shorter, more modular, and more readable.

What carries the argument

The ImProver agent, which uses a Chain-of-States technique to incorporate symbolic Lean context for guiding iterative proof rewrites.

If this is right

Proofs can be automatically adapted to particular styles or criteria without manual rewriting.
Training data for proof-generation models can be improved by replacing human proofs with optimized versions.
Formal libraries become easier to maintain and extend when proofs are made more modular.
Downstream tasks that consume formal proofs, such as explanation or extraction, gain from clearer structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same agent architecture could be ported to other interactive theorem provers if equivalent symbolic context can be extracted.
Repeated optimization passes might produce even shorter proofs if the process is iterated until no further gains occur.
Optimized proofs could serve as synthetic data to fine-tune smaller models for proof generation.

Load-bearing premise

An LLM agent can reliably generate rewrites that improve the chosen metric, stay logically equivalent to the original proof, and pass Lean checking without introducing undetected errors.

What would settle it

Apply ImProver to a held-out collection of theorems and verify whether every output proof type-checks in Lean and preserves the original statement when reviewed by a human expert.

Figures

Figures reproduced from arXiv: 2410.04753 by Jeremy Avigad, Prasad Tetali, Riyaz Ahuja, Sean Welleck.

**Figure 1.** Figure 1: ImProver automatically rewrites formal proofs to optimize a criterion such as length or readability while remaining correct. In this example, ImProver optimizes a human-written lemma (right) from the 2022 International Math Olympiad (Question 2, solution from Compfiles (David Renshaw, 2024)) for length. ImProver’s optimized proof is correct and more concise. 2 RELATED WORK Recently there has been wide int… view at source ↗

**Figure 2.** Figure 2: A Lean proof (left) with Chain-of-States promptin [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Optimizing a group-theoretic result from MIL Chap [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Solving a group theorem exercise from MIL Chapter 8 [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Optimizing a lemma from IMO 2019 P1 for readability [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Optimizing a lemma from the solutions of MIL CH08 S0 [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Optimizing a lemma from MIL CH04 S01 solution for le [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Optimizing a theorem from Mathlib/FundamentalGroupoid/InducedMapsfor length After optimizing for readability, we see that the model did not change the structure of the proof. Rather, it added an intermediate declaration so that users can better understand the state after the convert. This intermediate tactic greatly helps in the understandability and clarity of the proof. Original (human-written) /-- Anoth… view at source ↗

**Figure 9.** Figure 9: Optimizing a theorem from Mathlib/FundamentalGroupoid/SimplyConnected for readability 19 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

read the original abstract

Large language models (LLMs) have been used to generate formal proofs of mathematical theorems in proofs assistants such as Lean. However, we often want to optimize a formal proof with respect to various criteria, depending on its downstream use. For example, we may want a proof to adhere to a certain style, or to be readable, concise, or modularly structured. Having suitably optimized proofs is also important for learning tasks, especially since human-written proofs may not optimal for that purpose. To this end, we study a new problem of automated proof optimization: rewriting a proof so that it is correct and optimizes for an arbitrary criterion, such as length or readability. As a first method for automated proof optimization, we present ImProver, a large-language-model agent that rewrites proofs to optimize arbitrary user-defined metrics in Lean. We find that naively applying LLMs to proof optimization falls short, and we incorporate various improvements into ImProver, such as the use of symbolic Lean context in a novel Chain-of-States technique, as well as error-correction and retrieval. We test ImProver on rewriting real-world undergraduate, competition, and research-level mathematics theorems, finding that ImProver is capable of rewriting proofs so that they are substantially shorter, more modular, and more readable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ImProver adds a Chain-of-States agent with live Lean context for proof rewriting, but the abstract gives no numbers to back the claimed gains in length or readability.

read the letter

The core contribution is a new agent setup for rewriting existing Lean proofs to hit user metrics like shortness or modularity. They combine Chain-of-States (which keeps symbolic context live) with error correction and retrieval, and they test it on undergrad, contest, and research theorems. That moves past plain proof generation, which is a useful step for library maintenance and training data work. The paper is honest that naive LLM calls do not cut it, and the listed fixes address some of the obvious failure modes. Credit for shipping a concrete system that actually runs on real theorems rather than toy examples. The main gap is evidence. The abstract says the rewrites are substantially better but supplies no tables, baselines, or failure breakdowns, so it is impossible to judge effect size or whether the gains survive human review for actual readability. The stress-test point on equivalence and downstream usability is fair; Lean acceptance alone does not guarantee the output stays modular or avoids subtle style regressions that matter for education or further automation. If the full paper has controlled comparisons and human ratings, that would close the loop. This is aimed at the automated theorem proving and Lean tooling crowd. Readers already building agents or maintaining math libraries will get the most out of the architecture details. The work is coherent enough on its own terms to deserve referee time, even if the experiments need tightening.

Referee Report

2 major / 1 minor

Summary. The paper introduces ImProver, an LLM-based agent system for automated proof optimization in Lean. It rewrites existing proofs to optimize arbitrary user-specified metrics (e.g., length, modularity, readability) while preserving logical correctness. The method augments naive LLM prompting with a Chain-of-States technique that maintains symbolic Lean context, plus error-correction and retrieval mechanisms. Experiments are reported on undergraduate, competition, and research-level theorems, with the central claim that ImProver produces substantially shorter, more modular, and more readable proofs than naive application.

Significance. If the empirical claims hold under rigorous evaluation, the work would be a useful contribution to formal mathematics tooling by demonstrating an agent architecture that can target downstream criteria beyond mere provability. The integration of symbolic context via Chain-of-States and the focus on optimization rather than generation alone are concrete strengths; however, the absence of quantitative metrics in the abstract leaves the practical impact difficult to gauge.

major comments (2)

[Abstract] Abstract: the assertion that ImProver produces 'substantially shorter, more modular, and more readable' proofs is presented without any numerical results, baselines, or evaluation protocol. This is load-bearing for the central empirical claim and prevents assessment of whether the reported improvements are real or merely anecdotal.
[Evaluation] Evaluation section (inferred from abstract description): no quantitative breakdown is supplied of failure modes, how logical equivalence is verified beyond Lean acceptance, or whether gains in modularity/readability survive human review or affect downstream tasks. This directly bears on the reliability assumption that the agent avoids subtle degradations.

minor comments (1)

[Abstract] The abstract states that 'naively applying LLMs to proof optimization falls short' but does not specify the exact failure modes observed or the ablation that isolates the contribution of Chain-of-States versus error-correction versus retrieval.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive comments, which highlight important aspects of clarity and evaluation rigor. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that ImProver produces 'substantially shorter, more modular, and more readable' proofs is presented without any numerical results, baselines, or evaluation protocol. This is load-bearing for the central empirical claim and prevents assessment of whether the reported improvements are real or merely anecdotal.

Authors: We agree that the abstract would benefit from quantitative support to substantiate the claims. The body of the paper reports specific metrics from the experiments (including length reductions, modularity scores, and success rates relative to naive LLM baselines). We will revise the abstract to include key numerical results and a brief reference to the evaluation protocol. revision: yes
Referee: [Evaluation] Evaluation section (inferred from abstract description): no quantitative breakdown is supplied of failure modes, how logical equivalence is verified beyond Lean acceptance, or whether gains in modularity/readability survive human review or affect downstream tasks. This directly bears on the reliability assumption that the agent avoids subtle degradations.

Authors: The evaluation section details verification via successful Lean compilation (which enforces logical equivalence) and provides quantitative breakdowns of improvements along with failure mode analysis across the theorem sets. We will expand the failure mode discussion for greater transparency. However, the work does not include human review of readability or downstream task evaluations, which are acknowledged as limitations. revision: partial

standing simulated objections not resolved

Human review of readability gains and evaluation of effects on downstream tasks were not conducted in the current study.

Circularity Check

0 steps flagged

No circularity: empirical agent demonstration with no derivation chain

full rationale

The paper describes an LLM-based agent system (ImProver) for rewriting Lean proofs to optimize user metrics, evaluated empirically on undergraduate/competition/research theorems. No equations, first-principles derivations, fitted parameters presented as predictions, or load-bearing self-citations appear in the abstract or described structure. Claims rest on experimental outcomes (shorter/more modular proofs accepted by Lean) rather than any self-referential reduction. This is a standard empirical systems paper; the derivation chain is absent, so circularity score is 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The system rests on the assumption that Lean correctly validates proof equivalence; no free parameters or new entities are introduced beyond standard LLM usage.

axioms (1)

domain assumption Lean theorem prover accurately determines whether a rewritten proof is logically valid and equivalent to the original.
The entire optimization loop depends on Lean accepting the output as correct.

pith-pipeline@v0.9.0 · 5771 in / 1155 out tokens · 31042 ms · 2026-05-23T19:47:14.411351+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Lean Refactor: Multi-Objective Controllable Proof Optimization via Agentic Strategy Search
cs.LO 2026-05 unverdicted novelty 6.0

Lean Refactor uses retrieval from a curated multi-objective strategy database to guide frozen LLMs in refactoring Lean proofs, reporting over 70% token compression on benchmarks and improved version transfer.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[4]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page
[5]

AI achieves silver-medal standard solving international mathematical olympiad problems

AlphaProof and AlphaGeometry Teams. AI achieves silver-medal standard solving international mathematical olympiad problems. https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/, 2024

work page 2024
[6]

A tactic language for declarative proofs

Serge Autexier and Dominik Dietrich. A tactic language for declarative proofs. In Matt Kaufmann and Lawrence C. Paulson (eds.), Interactive Theorem Proving, pp.\ 99--114, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg

work page 2010
[7]

The use of mmr, diversity-based reranking for reordering documents and producing summaries

Jaime Carbonell and Jade Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '98, pp.\ 335–336, New York, NY, USA, 1998. Association for Computing Machinery. ISBN 1581130155. do...

work page doi:10.1145/290941.291025 1998
[8]

Teaching Large Language Models to Self-Debug

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug, 2023. URL https://arxiv.org/abs/2304.05128

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

compfiles

David Renshaw . compfiles. https://github.com/dwrensha/compfiles, 2024

work page 2024
[10]

Rabe, Talia Ringer, and Yuriy Brun

Emily First, Markus N. Rabe, Talia Ringer, and Yuriy Brun. Baldur: Whole-proof generation and repair with large language models, 2023

work page 2023
[11]

Proof artifact co-training for theorem proving with language models

Jesse Michael Han, Jason Rute, Yuhuai Wu, Edward Ayers, and Stanislas Polu. Proof artifact co-training for theorem proving with language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=rpxJc9j04U

work page 2022
[12]

minictx: Neural theorem proving with (long-)contexts, 2024

Jiewen Hu, Thomas Zhu, and Sean Welleck. minictx: Neural theorem proving with (long-)contexts, 2024. URL https://arxiv.org/abs/2408.03350

work page arXiv 2024
[13]

Draft, sketch, and prove: Guiding formal theorem provers with informal proofs

Albert Qiaochu Jiang, Sean Welleck, Jin Peng Zhou, Timothee Lacroix, Jiacheng Liu, Wenda Li, Mateja Jamnik, Guillaume Lample, and Yuhuai Wu. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=SMa9EAovKMC

work page 2023
[14]

lean-training-data

Kim Morrison . lean-training-data. https://github.com/kim-em/lean-training-data, 2024

work page 2024
[15]

Hypertree proof search for neural theorem proving

Guillaume Lample, Timothee Lacroix, Marie anne Lachaux, Aurelien Rodriguez, Amaury Hayat, Thibaut Lavril, Gabriel Ebner, and Xavier Martinet. Hypertree proof search for neural theorem proving. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/foru...

work page 2022
[16]

mathematics\_in\_lean

leanprover-community . mathematics\_in\_lean. https://github.com/leanprover-community/mathematics\_in\_lean, 2024

work page 2024
[17]

A survey on deep learning for theorem proving, 2024

Zhaoyu Li, Jialiang Sun, Logan Murphy, Qidong Su, Zenan Li, Xian Zhang, Kaiyu Yang, and Xujie Si. A survey on deep learning for theorem proving, 2024

work page 2024
[18]

A survey of deep learning for mathematical reasoning

Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, and Kai-Wei Chang. A survey of deep learning for mathematical reasoning. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 14605--14631, Toronto, Canada, July 2023. Association for C...

work page doi:10.18653/v1/2023.acl-long.817 2023
[19]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In Thirty-seventh Conference on Neural Informat...

work page 2023
[20]

The Lean mathematical library

The mathlib Community. The lean mathematical library. In Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, POPL ’20. ACM, January 2020. doi:10.1145/3372885.3373824. URL http://dx.doi.org/10.1145/3372885.3373824

work page doi:10.1145/3372885.3373824 2020
[21]

The lean 4 theorem prover and programming language

Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. In Automated Deduction – CADE 28: 28th International Conference on Automated Deduction, Virtual Event, July 12–15, 2021, Proceedings, pp.\ 625–635, Berlin, Heidelberg, 2021. Springer-Verlag. ISBN 978-3-030-79875-8. doi:10.1007/978-3-030-79876-5_37. URL https://doi...

work page doi:10.1007/978-3-030-79876-5_37 2021
[22]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Generative language modeling for automated theorem proving, 2020

Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving, 2020

work page 2020
[24]

Formal mathematics statement curriculum learning, 2022

Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, and Ilya Sutskever. Formal mathematics statement curriculum learning, 2022

work page 2022
[25]

An in-context learning agent for formal theorem-proving, 2024

Amitayush Thakur, George Tsoukalas, Yeming Wen, Jimmy Xin, and Swarat Chaudhuri. An in-context learning agent for formal theorem-proving, 2024

work page 2024
[26]

Chi, Quoc V Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=...

work page 2022
[27]

Formal proof sketches

Freek Wiedijk. Formal proof sketches. In Stefano Berardi, Mario Coppo, and Ferruccio Damiani (eds.), Types for Proofs and Programs, pp.\ 378--393, Berlin, Heidelberg, 2004. Springer Berlin Heidelberg. ISBN 978-3-540-24849-1

work page 2004
[28]

LeanDojo : Theorem proving with retrieval-augmented language models

Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, and Anima Anandkumar. LeanDojo : Theorem proving with retrieval-augmented language models. In Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[29]

Xu, Zhengbao Jiang, and Graham Neubig

Shuyan Zhou, Uri Alon, Frank F. Xu, Zhengbao Jiang, and Graham Neubig. Docprompting: Generating code by retrieving the docs. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=ZTCxT2t2Ru

work page 2023

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[3] [3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[4] [4]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page

[5] [5]

AI achieves silver-medal standard solving international mathematical olympiad problems

AlphaProof and AlphaGeometry Teams. AI achieves silver-medal standard solving international mathematical olympiad problems. https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/, 2024

work page 2024

[6] [6]

A tactic language for declarative proofs

Serge Autexier and Dominik Dietrich. A tactic language for declarative proofs. In Matt Kaufmann and Lawrence C. Paulson (eds.), Interactive Theorem Proving, pp.\ 99--114, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg

work page 2010

[7] [7]

The use of mmr, diversity-based reranking for reordering documents and producing summaries

Jaime Carbonell and Jade Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '98, pp.\ 335–336, New York, NY, USA, 1998. Association for Computing Machinery. ISBN 1581130155. do...

work page doi:10.1145/290941.291025 1998

[8] [8]

Teaching Large Language Models to Self-Debug

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug, 2023. URL https://arxiv.org/abs/2304.05128

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

compfiles

David Renshaw . compfiles. https://github.com/dwrensha/compfiles, 2024

work page 2024

[10] [10]

Rabe, Talia Ringer, and Yuriy Brun

Emily First, Markus N. Rabe, Talia Ringer, and Yuriy Brun. Baldur: Whole-proof generation and repair with large language models, 2023

work page 2023

[11] [11]

Proof artifact co-training for theorem proving with language models

Jesse Michael Han, Jason Rute, Yuhuai Wu, Edward Ayers, and Stanislas Polu. Proof artifact co-training for theorem proving with language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=rpxJc9j04U

work page 2022

[12] [12]

minictx: Neural theorem proving with (long-)contexts, 2024

Jiewen Hu, Thomas Zhu, and Sean Welleck. minictx: Neural theorem proving with (long-)contexts, 2024. URL https://arxiv.org/abs/2408.03350

work page arXiv 2024

[13] [13]

Draft, sketch, and prove: Guiding formal theorem provers with informal proofs

Albert Qiaochu Jiang, Sean Welleck, Jin Peng Zhou, Timothee Lacroix, Jiacheng Liu, Wenda Li, Mateja Jamnik, Guillaume Lample, and Yuhuai Wu. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=SMa9EAovKMC

work page 2023

[14] [14]

lean-training-data

Kim Morrison . lean-training-data. https://github.com/kim-em/lean-training-data, 2024

work page 2024

[15] [15]

Hypertree proof search for neural theorem proving

Guillaume Lample, Timothee Lacroix, Marie anne Lachaux, Aurelien Rodriguez, Amaury Hayat, Thibaut Lavril, Gabriel Ebner, and Xavier Martinet. Hypertree proof search for neural theorem proving. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/foru...

work page 2022

[16] [16]

mathematics\_in\_lean

leanprover-community . mathematics\_in\_lean. https://github.com/leanprover-community/mathematics\_in\_lean, 2024

work page 2024

[17] [17]

A survey on deep learning for theorem proving, 2024

Zhaoyu Li, Jialiang Sun, Logan Murphy, Qidong Su, Zenan Li, Xian Zhang, Kaiyu Yang, and Xujie Si. A survey on deep learning for theorem proving, 2024

work page 2024

[18] [18]

A survey of deep learning for mathematical reasoning

Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, and Kai-Wei Chang. A survey of deep learning for mathematical reasoning. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 14605--14631, Toronto, Canada, July 2023. Association for C...

work page doi:10.18653/v1/2023.acl-long.817 2023

[19] [19]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In Thirty-seventh Conference on Neural Informat...

work page 2023

[20] [20]

The Lean mathematical library

The mathlib Community. The lean mathematical library. In Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, POPL ’20. ACM, January 2020. doi:10.1145/3372885.3373824. URL http://dx.doi.org/10.1145/3372885.3373824

work page doi:10.1145/3372885.3373824 2020

[21] [21]

The lean 4 theorem prover and programming language

Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. In Automated Deduction – CADE 28: 28th International Conference on Automated Deduction, Virtual Event, July 12–15, 2021, Proceedings, pp.\ 625–635, Berlin, Heidelberg, 2021. Springer-Verlag. ISBN 978-3-030-79875-8. doi:10.1007/978-3-030-79876-5_37. URL https://doi...

work page doi:10.1007/978-3-030-79876-5_37 2021

[22] [22]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Generative language modeling for automated theorem proving, 2020

Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving, 2020

work page 2020

[24] [24]

Formal mathematics statement curriculum learning, 2022

Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, and Ilya Sutskever. Formal mathematics statement curriculum learning, 2022

work page 2022

[25] [25]

An in-context learning agent for formal theorem-proving, 2024

Amitayush Thakur, George Tsoukalas, Yeming Wen, Jimmy Xin, and Swarat Chaudhuri. An in-context learning agent for formal theorem-proving, 2024

work page 2024

[26] [26]

Chi, Quoc V Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=...

work page 2022

[27] [27]

Formal proof sketches

Freek Wiedijk. Formal proof sketches. In Stefano Berardi, Mario Coppo, and Ferruccio Damiani (eds.), Types for Proofs and Programs, pp.\ 378--393, Berlin, Heidelberg, 2004. Springer Berlin Heidelberg. ISBN 978-3-540-24849-1

work page 2004

[28] [28]

LeanDojo : Theorem proving with retrieval-augmented language models

Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, and Anima Anandkumar. LeanDojo : Theorem proving with retrieval-augmented language models. In Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[29] [29]

Xu, Zhengbao Jiang, and Graham Neubig

Shuyan Zhou, Uri Alon, Frank F. Xu, Zhengbao Jiang, and Graham Neubig. Docprompting: Generating code by retrieving the docs. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=ZTCxT2t2Ru

work page 2023