arxiv: 2602.24273 · v3 · submitted 2026-02-27 · 💻 cs.AI

Recognition: no theorem link

A Minimal Agent for Automated Theorem Proving

Borja Requena , Austin Letson , Krystian Nowakowski , Izan Beltran-Ferreiro , Leopoldo Sarra

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:42 UTC · model grok-4.3

classification 💻 cs.AI

keywords automated theorem provingAI agentsiterative refinementlibrary searchcontext managementlanguage modelsbaseline architecturecost efficiency

0 comments

The pith

A minimal agent achieves competitive theorem proving performance with iterative refinement, library search, and context management at a fraction of state-of-the-art costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a minimal agentic baseline for automated theorem proving that implements only the core shared features of advanced systems: iterative proof refinement, library search, and context management. This design is tested across qualitatively different benchmarks with various frontier language models. Results show performance comparable to more complex approaches while using a significantly simpler architecture and much lower computational resources. The work also finds consistent gains from the iterative process over single-shot generation in sample efficiency and cost.

Core claim

The paper establishes that a minimal agent implementing iterative proof refinement, library search, and context management delivers competitive theorem-proving results on diverse benchmarks. This holds while employing a simpler architecture and a fraction of the cost of current state-of-the-art systems. The iterative approach further demonstrates clear advantages over single-shot methods in sample efficiency and cost effectiveness.

What carries the argument

The minimal agentic baseline coordinating iterative refinement of partial proofs, search over a library of theorems, and dynamic context management to guide language model generation.

Load-bearing premise

The core features of iterative refinement, library search, and context management suffice to achieve competitive results across different benchmarks without additional specialized components.

What would settle it

A benchmark where the minimal agent's success rate drops substantially below leading systems or where iterative refinement shows no efficiency improvement over single-shot generation.

Figures

Figures reproduced from arXiv: 2602.24273 by Austin Letson, Borja Requena, Izan Beltran-Ferreiro, Krystian Nowakowski, Leopoldo Sarra.

**Figure 2.** Figure 2: Ablation study on different components of the minimal agent. We compare single-shot parallel calls of a general purpose LLM with an iterative approach with feedback. We show the performance as a function of the number of sampled proofs as additional elements are added to the system. Starting by adding a feedback mechanism, we then evaluate two different memory mechanisms (history of previous attempts and a… view at source ↗

**Figure 3.** Figure 3: (a) Comparison between LLMs. Claude Sonnet 4.5 and Opus 4.5 have a thinking budget of 10, 000 tokens. Both Gemini 3 Flash and Pro have thinking set to "high". The pass@k are computed from 50 independent proof samples for each model and we show the 95% confidence interval around the obtained value. For the full system, we show the mean performance over 3 full runs up to 50 iterations and show the standard e… view at source ↗

read the original abstract

We propose a minimal agentic baseline that enables systematic comparison across different AI-based theorem prover architectures. This design implements the core features shared among state-of-the-art systems: iterative proof refinement, library search and context management. We evaluate this agentic approach using qualitatively different benchmarks and compare various frontier language models and design choices. Our results show competitive performance compared to state-of-the-art approaches, while using a significantly simpler architecture and a fraction of their cost. Additionally, we demonstrate consistent advantages of an iterative approach over multiple single-shot generations, especially in terms of sample efficiency and cost effectiveness. The implementation is released open-source as a candidate reference for future research and as an accessible prover for the community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's real contribution is the open-source minimal baseline, but its competitive performance claim has zero supporting numbers or benchmarks in the abstract.

read the letter

The main takeaway is that this work packages a minimal agent using iterative refinement, library search, and context management, then releases the code as a public reference baseline. That setup is genuinely useful because it gives researchers a shared, lightweight starting point instead of everyone reinventing the same core pieces from scratch. The open-source release stands out as the clearest positive step here, since it lets others inspect, run, and extend the implementation directly. The paper also makes a reasonable case that these shared features might be enough for decent results without the extra machinery in heavier systems, and it flags the efficiency gains from iteration over single-shot attempts. Those points are straightforward and worth having on record. The soft spot is straightforward too: the abstract asserts competitive performance at much lower cost but supplies no benchmark names, no pass rates, no cost numbers, no ablations, and no error analysis. Without any of that data, the central claim cannot be checked, so it is impossible to know whether the minimal design actually delivers or whether hidden details are doing the work. Since only the abstract is available, this gap might be filled in the full version, but it is the load-bearing part of the paper right now. This is the sort of paper that would interest people who build or benchmark AI theorem provers and want a reproducible reference to test against. A reader looking for a simple, accessible prover to run experiments on would get practical value from the code release. I would send it to peer review once the full results and evaluation details are included, because the baseline idea itself is worth a careful look even if the numbers need close scrutiny.

Referee Report

1 major / 1 minor

Summary. The paper proposes a minimal agentic baseline for automated theorem proving implementing core features shared by SOTA systems: iterative proof refinement, library search, and context management. It evaluates the approach on qualitatively different benchmarks with various frontier language models, claiming competitive performance to state-of-the-art methods using a simpler architecture and a fraction of the cost, plus consistent advantages of iterative over single-shot generation in sample efficiency and cost-effectiveness. The implementation is released open-source.

Significance. If the central claims hold, the work would be significant by supplying a simple, reproducible open-source baseline that enables systematic comparisons across AI theorem-proving architectures and demonstrates the value of iterative refinement for efficiency. The open-source release is a clear strength for community use and future research.

major comments (1)

[Abstract] Abstract: the claim that 'our results show competitive performance compared to state-of-the-art approaches, while using a significantly simpler architecture and a fraction of their cost' is presented without any benchmark names, pass rates, cost figures, ablation data, or error analysis. This absence is load-bearing because the central thesis is that the three core features alone suffice; without the numbers it is impossible to verify whether the minimal design actually delivers the stated outcome or whether unreported details are responsible.

minor comments (1)

[Abstract] Abstract: 'qualitatively different benchmarks' is imprecise; naming the specific benchmarks (e.g., miniF2F, ProofNet) would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the single major comment below and will revise the manuscript to incorporate the requested details.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'our results show competitive performance compared to state-of-the-art approaches, while using a significantly simpler architecture and a fraction of their cost' is presented without any benchmark names, pass rates, cost figures, ablation data, or error analysis. This absence is load-bearing because the central thesis is that the three core features alone suffice; without the numbers it is impossible to verify whether the minimal design actually delivers the stated outcome or whether unreported details are responsible.

Authors: We agree that the abstract would benefit from greater specificity to substantiate the central claims. In the revised manuscript we will update the abstract to name the primary benchmarks (MiniF2F and ProofNet), report the key pass rates achieved by the minimal agent versus the cited state-of-the-art baselines, and include approximate cost ratios demonstrating the claimed reduction. These figures are already present in the experimental sections of the full paper; the revision will simply preview them in the abstract while preserving its length and readability. This change directly addresses the referee's concern that the abstract must stand on its own to support the thesis that the three core features suffice. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external benchmarks

full rationale

The available text (abstract only) contains no derivation chain, equations, fitted parameters, or self-citations. The central claim of competitive performance is framed as an empirical comparison against independent state-of-the-art systems on external benchmarks, with an open-source release for verification. No self-definitional loops, predictions that reduce to inputs by construction, or load-bearing self-citations appear. The evaluation is therefore self-contained against outside standards rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper describes an empirical agent architecture without introducing new mathematical axioms, free parameters, or invented entities; it relies on standard LLM capabilities and existing theorem proving libraries.

pith-pipeline@v0.9.0 · 5393 in / 928 out tokens · 49371 ms · 2026-05-15T18:42:05.188965+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 7 internal anchors

[1]

Density hypercubes, higher order inter- ference and hyper-decoherence: A categorical approach

Leonardo de Moura and Sebastian Ullrich. “The Lean 4 Theorem Prover and Program- ming Language”. In:Automated Deduction – CADE 28: 28th International Conference on Automated Deduction, Virtual Event, July 12–15, 2021, Proceedings. Berlin, Heidelberg: Springer-Verlag, 2021, pp. 625–635.isbn: 978-3-030-79875-8.doi:10.1007/978-3-030- 79876-5_37.url:https://d...

work page doi:10.1007/978-3-030- 2021
[2]

Lars Becker et al.A blueprint for the formalization of Carleson’s theorem on convergence of Fourier series. 2025. arXiv:2405.06423 [math.CA] .url: https://arxiv.org/abs/ 2405.06423

work page arXiv 2025
[3]

Johan Commelin and Adam Topaz.Abstraction boundaries and spec driven development in pure mathematics. 2023. arXiv:2309.14870 [math.HO].url: https://arxiv.org/abs/ 2309.14870

work page arXiv 2023
[4]

Olympiad-level formal mathematical reasoning with reinforcement learning

Thomas Hubert et al. “Olympiad-level formal mathematical reasoning with reinforcement learning”. In:Nature(Nov. 2025).doi: 10 . 1038 / s41586 - 025 - 09833 - y.url: https : //doi.org/10.1038/s41586-025-09833-y

work page doi:10.1038/s41586-025-09833-y 2025
[5]

Tudor Achim et al.Aristotle: IMO-level Automated Theorem Proving. 2025. arXiv:2510. 01346 [cs.AI].url:https://arxiv.org/abs/2510.01346

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

HepLean: Digitalising high energy physics

Joseph Tooby-Smith. “HepLean: Digitalising high energy physics”. In:Computer Physics Communications308 (2025), p. 109457.issn: 0010-4655.doi: https://doi.org/10.1016/ j.cpc.2024.109457 .url: https://www.sciencedirect.com/science/article/pii/ S0010465524003801

work page arXiv 2025
[7]

Gergely Székely.First-Order Logic Investigation of Relativity Theory with an Emphasis on Accelerated Observers. 2010. arXiv:1005.0973 [gr-qc].url: https://arxiv.org/abs/ 1005.0973

work page internal anchor Pith review Pith/arXiv arXiv 2010
[8]

Towards formal proofs of feedback control theory

Omar A. Jasim and Sandor M. Veres. “Towards formal proofs of feedback control theory”. In: 2017 21st International Conference on System Theory, Control and Computing (ICSTCC). 2017, pp. 43–48.doi:10.1109/ICSTCC.2017.8107009

work page doi:10.1109/icstcc.2017.8107009 2017
[9]

Formal Analysis of Linear Control Systems Using Theorem Proving

Adnan Rashid and Osman Hasan. “Formal Analysis of Linear Control Systems Using Theorem Proving”. In:Formal Methods and Software Engineering. Ed. by Zhenhua Duan and Luke Ong. Cham: Springer International Publishing, 2017, pp. 345–361.isbn: 978-3- 319-68690-5

work page 2017
[10]

A formally certified end-to-end implementation of Shor’s factorization algorithm

Yuxiang Peng et al. “A formally certified end-to-end implementation of Shor’s factorization algorithm”. In:Proceedings of the National Academy of Sciences120.21 (2023), e2218775120. doi: 10.1073/pnas.2218775120 . eprint: https://www.pnas.org/doi/pdf/10.1073/ pnas.2218775120.url:https://www.pnas.org/doi/abs/10.1073/pnas.2218775120

work page doi:10.1073/pnas.2218775120 2023
[11]

Formal Verification of Quantum Programs: Theory, Tools, and Challenges

Marco Lewis, Sadegh Soudjani, and Paolo Zuliani. “Formal Verification of Quantum Programs: Theory, Tools, and Challenges”. In:ACM Transactions on Quantum Computing 5.1 (Dec. 2023).doi:10.1145/3624483.url:https://doi.org/10.1145/3624483

work page doi:10.1145/3624483.url:https://doi.org/10.1145/3624483 2023
[12]

Jianqiao Lu et al.Process-Driven Autoformalization in Lean 4. 2024. arXiv:2406.01940 [cs.CL].url:https://arxiv.org/abs/2406.01940

work page arXiv 2024
[13]

AugustePoirouxetal.Reliable Evaluation and Benchmarks for Statement Autoformalization

work page
[14]

arXiv:2406.07222 [cs.CL].url:https://arxiv.org/abs/2406.07222

work page arXiv
[15]

Sumanth Varambally et al.Hilbert: Recursively Building Formal Proofs with Informal Reasoning. 2025. arXiv:2509.22819 [cs.AI].url: https://arxiv.org/abs/2509.22819

work page arXiv 2025
[16]

Z. Z. Ren et al.DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition. 2025. arXiv:2504.21801 [cs.CL].url: https://arxiv.org/abs/2504.21801. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Luoxin Chen et al.Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

work page
[18]

arXiv:2507.23726 [cs.AI].url:https://arxiv.org/abs/2507.23726

work page arXiv
[19]

George Tsoukalas et al.PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition. 2024. arXiv:2407.11214 [cs.AI].url: https://arxiv.org/ abs/2407.11214

work page arXiv 2024
[20]

MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. “MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics”. In:arXiv preprint arXiv:2109.00110(2021)

work page arXiv 2021
[21]

arXiv:2603.02668 [cs.AI].url:https://arxiv.org/abs/2603.02668

Austin Letson et al.SorryDB: Can AI Provers Complete Real-World Lean Theorems?2026. arXiv:2603.02668 [cs.AI].url:https://arxiv.org/abs/2603.02668

work page arXiv 2026
[22]

Yong Lin et al.Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction. 2025. arXiv:2508.03613 [cs.LG].url: https://arxiv. org/abs/2508.03613

work page arXiv 2025
[23]

Haiming Wang et al.Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning. 2025. arXiv:2504.11354 [cs.AI] .url: https://arxiv. org/abs/2504.11354

work page arXiv 2025
[24]

Kefan Dong and Tengyu Ma.STP: Self-play LLM Theorem Provers with Iterative Con- jecturing and Proving. Mar. 2025.doi:10.48550/arXiv.2502.00212. arXiv: 2502.00212 [cs]. (Visited on 01/22/2026)

work page doi:10.48550/arxiv.2502.00212 2025
[25]

Ran Xin et al.BFS-Prover: Scalable Best-First Tree Search for LLM-based Automatic Theorem Proving. Oct. 2025.doi:10.48550/arXiv.2502.03438. arXiv: 2502.03438 [cs]. (Visited on 01/22/2026)

work page doi:10.48550/arxiv.2502.03438 2025
[26]

Jiangjie Chen et al.Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience. 2025. arXiv:2512.17260 [cs.CL].url: https://arxiv.org/ abs/2512.17260

work page arXiv 2025
[27]

https: //doi.org/10.1145/3372885.3373824 .https://doi.org/10.1145/3372885.3373824 15

The mathlib Community. “The lean mathematical library”. In:Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs. POPL ’20. ACM, Jan. 2020, pp. 367–381.doi:10.1145/3372885.3373824.url: http://dx.doi.org/10. 1145/3372885.3373824

work page doi:10.1145/3372885.3373824.url: 2020
[28]

Ruida Wang et al.MA-LoT: Model-Collaboration Lean-based Long Chain-of-Thought Reasoning enhances Formal Theorem Proving. 2025. arXiv:2503.03205 [cs.CL] .url: https://arxiv.org/abs/2503.03205

work page arXiv 2025
[29]

2023.doi: 10.48550/arXiv.2306.15626

Kaiyu Yang et al.LeanDojo: Theorem Proving with Retrieval-Augmented Language Models. 2023.doi: 10.48550/arXiv.2306.15626. arXiv: 2306.15626 [cs].url: http://arxiv. org/abs/2306.15626(visited on 01/23/2026). Pre-published

work page doi:10.48550/arxiv.2306.15626 2023
[30]

2025.doi:10.48550/arXiv.2412.20735

Yang Li et al.HunyuanProver: A Scalable Data Synthesis Framework and Guided Tree Search for Automated Theorem Proving. 2025.doi:10.48550/arXiv.2412.20735. arXiv: 2412.20735 [cs] .url: http://arxiv.org/abs/2412.20735 (visited on 01/23/2026). Pre-published

work page doi:10.48550/arxiv.2412.20735 2025
[31]

Ziju Shen et al.REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reason- ing. Nov. 2025.doi: 10.48550/arXiv.2505.20613. arXiv: 2505.20613 [cs]. (Visited on 01/22/2026)

work page doi:10.48550/arxiv.2505.20613 2025
[32]

DT-Solver:AutomatedTheoremProvingwithDynamic-TreeSampling Guided by Proof-level Value Function

HaimingWangetal.“DT-Solver:AutomatedTheoremProvingwithDynamic-TreeSampling Guided by Proof-level Value Function”. In:Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Ed. by Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki. Toronto, Canada: Association for Computational Linguistics, July 20...

work page doi:10.18653/v1/2023.acl-long.706 2023
[33]

May 2022

Guillaume Lample et al.HyperTree Proof Search for Neural Theorem Proving. May 2022. doi:10.48550/arXiv.2205.11491. arXiv:2205.11491 [cs]. (Visited on 01/23/2026)

work page doi:10.48550/arxiv.2205.11491 2022
[34]

Yong Lin et al.Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving. 2025. arXiv:2502.07640 [cs.LG].url:https://arxiv.org/abs/2502.07640

work page arXiv 2025
[35]

Benjamin Breen et al.Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics. Nov. 2025.doi:10.48550/arXiv.2510. 12787. arXiv:2510.12787 [cs]. (Visited on 01/22/2026)

work page doi:10.48550/arxiv.2510 2025
[36]

Kaito Baba et al.Prover Agent: An Agent-Based Framework for Formal Mathematical Proofs. Oct. 2025.doi: 10.48550/arXiv.2506.19923. arXiv: 2506.19923 [cs]. (Visited on 01/22/2026)

work page doi:10.48550/arxiv.2506.19923 2025
[37]

Chenrui Cao et al.Reviving DSP for Advanced Theorem Proving in the Era of Reasoning Models. 2025. arXiv:2506.11487 [cs.AI].url:https://arxiv.org/abs/2506.11487

work page arXiv 2025
[38]

GeorgeTsoukalasetal.PutnamBench Leaderboard.https://trishullab.github.io/PutnamBench/leaderboard.html

work page
[39]

(Visited on 01/23/2026)

work page 2026
[40]

Huajian Xin et al.DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Rein- forcement Learning and Monte-Carlo Tree Search. 2024. arXiv:2408.08152 [cs.CL].url: https://arxiv.org/abs/2408.08152

work page arXiv 2024
[41]

Jiang et al.Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs

Albert Q. Jiang et al.Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs. 2023. arXiv:2210.12283 [cs.AI].url: https://arxiv.org/abs/2210. 12283

work page arXiv 2023
[42]

2026.url: https : / / www

Logical Intelligence.Aleph Prover by Logical Intelligence. 2026.url: https : / / www . logicalintelligence.com/aleph-prover.html(visited on 01/20/2026)

work page 2026
[43]

Jiedong Jiang et al.FATE: A Formal Benchmark Series for Frontier Algebra of Multiple Difficulty Levels. 2025. arXiv:2511.02872 [cs.LG].url: https://arxiv.org/abs/2511. 02872

work page arXiv 2025
[44]

Rongge Xu et al.LeanCat: A Benchmark Suite for Formal Category Theory in Lean (Part I: 1-Categories). 2025. arXiv:2512.24796 [cs.LO].url: https://arxiv.org/abs/2512. 24796

work page arXiv 2025
[45]

Shunyu Yao et al.ReAct: Synergizing Reasoning and Acting in Language Models. 2023. arXiv:2210.03629 [cs.CL].url:https://arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

A Semantic Search Engine for Mathlib4

Guoxiong Gao et al. “A Semantic Search Engine for Mathlib4”. In:Findings of the As- sociation for Computational Linguistics: EMNLP 2024. Ed. by Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen. Miami, Florida, USA: Association for Computational Lin- guistics, Nov. 2024, pp. 8001–8013.doi:10.18653/v1/2024.findings-emnlp.470.url: https://aclanthology.org/2...

work page doi:10.18653/v1/2024.findings-emnlp.470.url: 2024
[47]

Accessed: 2026-01-23

Tavily.Tavily Search API.https://www.tavily.com. Accessed: 2026-01-23. 2025

work page 2026
[48]

2025.url:https://github.com/augustepoiroux/LeanInteract

Auguste Poiroux, Viktor Kuncak, and Antoine Bosselut.LeanInteract: A Python Interface for Lean 4. 2025.url:https://github.com/augustepoiroux/LeanInteract

work page 2025
[49]

Reflexion: language agents with verbal reinforcement learning

Noah Shinn et al. “Reflexion: language agents with verbal reinforcement learning”. In: Advances in Neural Information Processing Systems. Ed. by A. Oh et al. Vol. 36. Curran Associates, Inc., 2023, pp. 8634–8652.url:https://proceedings.neurips.cc/paper_ files / paper / 2023 / file / 1b44b878bb782e6954cd888628510e90 - Paper - Conference . pdf

work page 2023
[50]

Liangming Pan et al.Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies. 2023. arXiv:2308 . 03188 [cs.CL].url: https://arxiv.org/abs/2308.03188. 12

work page arXiv 2023
[51]

Matthew Renze and Erhan Guven.Self-Reflection in LLM Agents: Effects on Problem- Solving Performance. 2024. arXiv:2405.06682 [cs.CL].url: https://arxiv.org/abs/ 2405.06682

work page arXiv 2024
[52]

GitHub repository

GasStationManager and Paul Lezeau.SafeVerify: A Lean4 script for robustly verifying submitted proofs of theorems and implementations of functions.https://github.com/ GasStationManager/SafeVerify. GitHub repository. License: Apache-2.0. Accessed 2026- 01-19. 2026

work page 2026
[53]

https://github.com/leanprover/lean4checker

leanprover contributors.lean4checker: Replay the Environment for a given Lean module. https://github.com/leanprover/lean4checker. GitHub repository. License: Apache-2.0. Accessed 2026-01-19. 2026

work page 2026
[54]

Matthieu Zimmer et al.Bourbaki: Self-Generated and Goal-Conditioned MDPs for Theorem Proving. 2025. arXiv:2507.02726 [cs.AI].url:https://arxiv.org/abs/2507.02726

work page arXiv 2025
[55]

ABEL: Sample Efficient Online Reinforcement Learning for Neural Theorem Proving

Fabian Gloeckle et al. “ABEL: Sample Efficient Online Reinforcement Learning for Neural Theorem Proving”. In:The 4th Workshop on Mathematical Reasoning and AI at NeurIPS’24. 2024.url:https://openreview.net/forum?id=kk3mSjVCUO

work page 2024
[56]

Zijian Wu et al.InternLM2.5-StepProver: Advancing Automated Theorem Proving via Critic-Guided Search. 2025. arXiv:2410.15700 [cs.AI].url: https://arxiv.org/abs/ 2410.15700

work page arXiv 2025
[57]

Zheng Cai et al.InternLM2 Technical Report. 2024. arXiv:2403.17297 [cs.CL] .url: https://arxiv.org/abs/2403.17297

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Gheorghe Comanici et al.Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. 2025. arXiv: 2507.06261 [cs.CL].url:https://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

Daya Guo et al. “DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning”. In:Nature645.8081 (Sept. 2025), pp. 633–638.issn: 1476-4687.doi:10.1038/ s41586-025-09422-z.url:http://dx.doi.org/10.1038/s41586-025-09422-z

work page doi:10.1038/s41586-025-09422-z 2025
[60]

Amitayush Thakur et al.An In-Context Learning Agent for Formal Theorem-Proving. 2024. arXiv:2310.04353 [cs.LG].url:https://arxiv.org/abs/2310.04353

work page arXiv 2024
[61]

GPT-4o System Card

OpenAI et al.GPT-4o System Card. 2024. arXiv: 2410 . 21276 [cs.CL].url: https : //arxiv.org/abs/2410.21276. A Ablation dataset In Table 2, we report the list of100problems randomly extracted from the Putnam competition to run our ablation studies. B Prompts This section contains the prompts used in our experiments. Every block in the agent architecture de...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

**Proof Development Strategy** - Use appropriate Lean 4 tactics to make progress - Leverage the build feedback and experience to inform the next action and make steady progress towards the final proof - The proof must eventually be complete and fully build without any ‘sorry‘ statements

work page
[64]

Mathlib.Topology.Basic

**Quality Standards** - Use appropriate Lean 4 tactics and syntax - Reference Mathlib lemmas when applicable - Add imports as needed for any lemmas or tactics you use </requirements> <output-format> Return structured output with these fields. In all cases, please produce all the fields in your output: **reasoning**: Think carefully about the theorem at ha...

work page
[65]

**Preserve Theorem Statement** - NEVER modify the theorem signature, name or type - Only change the proof body (after :=) - Work ONLY on the target theorem

work page
[66]

**Proof Development Strategy** - Use appropriate Lean 4 tactics to make progress - The proof must be complete and valid lean4 proof and fully 17 successfully without any ‘sorry‘ or ‘admit‘ statements, nor introducing new axioms

work page
[67]

Mathlib.Topology.Basic

**Quality Standards** - Use appropriate Lean 4 tactics and syntax - Reference Mathlib lemmas when applicable - Add imports as needed for any lemmas or tactics you use </requirements> <output-format> Return structured output with these fields. In all cases, please produce all the fields in your output: **reasoning**: Think carefully about the theorem at ha...

work page
[68]

$(a + b)^2 = a^2 + b^2$

work page
[69]

Given that $ab = -ba$, we have $ab + ba = ab - ab = 0$

$(a - b)^2 = a^2 + b^2$ In a ring, $(a + b)^2 = a^2 + ab + ba + b^2$. Given that $ab = -ba$, we have $ab + ba = ab - ab = 0$. So $(a + b)^2 = a^2 + b^2$. Similarly, $(a - b)^2 = a^2 - ab - ba + b^2 = a^2 - ab + ab + b^2 = a^2 + b^2$ (since $-ba = ab$). 23 Let me think about how to prove this in Lean 4. I'll need to expand the squares and use the hypothesi...

work page
[70]

The`ring_nf`tactic doesn't help here because the ring is non-commutative (it's a general Ring, not a CommRing), so`ring` and`ring_nf`tactics won't work as expected for simplification

work page
[71]

The issue is that in a non-commutative ring, the`ring`tactic assumes commutativity and won't help

After`ring_nf`, the goal remains`(a + b) ^ 2 = a ^ 2 + b ^ 2`, so the rewrite fails because the pattern`a * b + b * a`doesn't appear in the unexpanded form. The issue is that in a non-commutative ring, the`ring`tactic assumes commutativity and won't help. The prover needs to manually expand the squares and use the hypothesis directly. Key lessons: 1.`ring...

work page
[72]

The prover needs to manually expand`(a + b)^2`and`(a - b)^2` and then simplify using the hypothesis

work page
[73]

Square expansion:`x^2 = x * x`, then `(a+b)*(a+b) = a*a + a*b + b*a + b*b`

work page
[74]

Let me write a concise context for the prover

Need to use lemmas like`add_mul`,`mul_add`,`pow_two`or`sq` to expand, then use the hypothesis that`a*b = -(b*a)`to simplify`a*b + b*a = 0`. Let me write a concise context for the prover. 28 Experience context Key lessons from this attempt:

work page
[75]

The`ring`family of tactics assumes commutativity, so they cannot simplify expressions involving`a * b`and`b * a`as separate terms

**`ring`and`ring_nf`don't work for non-commutative rings**: The theorem is about a general`Ring R`, which is non-commutative. The`ring`family of tactics assumes commutativity, so they cannot simplify expressions involving`a * b`and`b * a`as separate terms

work page
[79]

Notice that the previous context is empty in the user input

**After full expansion**,`(a + b)^2` becomes`a*a + a*b + b*a + b*b`, which equals `a^2 + (a*b + b*a) + b^2 = a^2 + 0 + b^2 = a^2 + b^2`. Notice that the previous context is empty in the user input. This is because this is the first iteration, so there is no context aggregating past experience yet. In the following iteration, this agent’s output would fill...

work page
[80]

The`ring`family of tactics assumes commutativity, so they 31 cannot simplify expressions involving`a * b`and`b * a`as separate terms

**`ring`and`ring_nf`don't work for non-commutative rings**: The theorem is about a general`Ring R`, which is non-commutative. The`ring`family of tactics assumes commutativity, so they 31 cannot simplify expressions involving`a * b`and`b * a`as separate terms

work page
[81]

**Manual expansion is required**: Since`ring_nf`doesn't expand the goal, the pattern`a * b + b * a`never appears in the expression`(a + b) ^ 2 = a ^ 2 + b ^ 2`, causing the rewrite to fail

work page
[82]

**Correct approach**: Need to manually expand the squares using lemmas like: -`pow_two x`or`sq x`to rewrite`x^2`as`x * x` -`add_mul`,`mul_add`to distribute multiplication over addition -`sub_mul`,`mul_sub`for the subtraction case

work page
[83]

From`h : a * b = -(b * a)`, we get`a * b + b * a = -(b * a) + b * a = 0`

**The helper lemma`hab : a * b + b * a = 0`is correct** and will be useful once the expressions are properly expanded. From`h : a * b = -(b * a)`, we get`a * b + b * a = -(b * a) + b * a = 0`

work page
[84]

</experience> Agent reasoning Let me analyze the problem more carefully

**After full expansion**,`(a + b)^2` becomes`a*a + a*b + b*a + b*b`, which equals `a^2 + (a*b + b*a) + b^2 = a^2 + 0 + b^2 = a^2 + b^2`. </experience> Agent reasoning Let me analyze the problem more carefully. The feedback tells me that`ring`and`ring_nf`don't work because the ring is non- commutative. I need to manually expand the squares using basic lemm...

work page

Showing first 80 references.