ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization

Jeremy Avigad; Riyaz Ahuja; Sean Welleck; Tate Rowney

arxiv: 2605.22885 · v1 · pith:2TN5JHT4new · submitted 2026-05-21 · 💻 cs.AI · cs.CL· cs.LG· cs.LO

ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization

Riyaz Ahuja , Tate Rowney , Jeremy Avigad , Sean Welleck This is my paper

Pith reviewed 2026-05-25 06:08 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LGcs.LO

keywords proof optimizationLean 4neurosymboliclanguage modelsformal mathematicsexpert iterationself-improving modelstheorem proving

0 comments

The pith

A 7B model trained via ImProver 2 outperforms much larger models at optimizing Lean 4 proofs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ImProver 2 is a neurosymbolic framework that pairs an expert-iteration training loop with a scaffold exposing both formal Lean structure and lightweight informal abstractions. The method trains models to refactor verified proofs according to new metrics that track structural properties relevant to maintainability and data quality. A resulting 7B-parameter model beats orders-of-magnitude larger models in the same family and matches mid-tier frontier models across those metrics. The scaffold itself lifts performance for both small and large models, showing that proof optimization can be treated as a learnable task.

Core claim

ImProver 2 combines data-efficient expert iteration with a neurosymbolic scaffold to train language models for automated proof optimization in Lean 4. This produces a 7B model that outperforms larger models within its family and competes with mid-tier frontier models on a suite of structural proof metrics. The same scaffold improves results across model scales, establishing that small models can restructure research-level proofs when given appropriate structure and training.

What carries the argument

The neurosymbolic scaffold exposing formal Lean structure alongside lightweight informal abstractions, used inside an expert-iteration pipeline.

If this is right

Small models can restructure complex, research-level proofs effectively.
The neurosymbolic scaffold raises performance for both small and frontier-scale models.
Proof optimization becomes a scalable, learnable task rather than a purely heuristic one.
Optimized proofs improve both library maintainability and the quality of training data for neural provers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scaffold-and-iteration pattern could be tested on proof libraries in other formal systems.
Iteratively improved proof data might accelerate training of future neural provers beyond the metrics reported here.
The structural metrics could serve as an automated filter or ranking signal inside existing proof libraries.

Load-bearing premise

The suite of structural proof metrics accurately reflects the goals of maintainability and training-data quality without introducing unmeasured biases or omissions.

What would settle it

Human experts rating the actual maintainability and readability of proofs produced by the 7B model versus the original proofs and versus outputs from larger models.

Figures

Figures reproduced from arXiv: 2605.22885 by Jeremy Avigad, Riyaz Ahuja, Sean Welleck, Tate Rowney.

**Figure 2.** Figure 2: Example Lean proof (top) and its generated proof tree (bottom). Note the root node (in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: ImProver2 training loop. The diagram illustrates the iterative process of generation, retrieval, filtering, and training. Node colors represent the evolution of data from initial sampling (Blue) through processing (Purple) to training (Magenta). comes from three sources: a context slice to find relevant lemmas or definitions, goal-state traces to highlight the exact effect of each tactic on the progress of… view at source ↗

**Figure 4.** Figure 4: Effect of parameter count on model performance on mean improvement at best@16 across [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of ImProver 2 against frontier GPT-based models and ImProver, evaluated on [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Performance of ImProver 2 as a function of sample budget across training iterations. Gains [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Training statistics during iteration 3 of the dependency metric. Model shows stable [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Effect of neurosymbolic augmentation on base model performance [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Hyperparameter grid searches for the length metric across iterations 1–3. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: Hyperparameter grid searches for the dependency metric across iterations 1–3. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: ImProver 2 optimizes a proof for dependencies, reducing explicit dependencies by 3. [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: ImProver 2 optimizes a proof for dependencies, reducing explicit dependencies by 2. [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: ImProver 2 optimizes an AlphaProof proof for dependencies, reducing explicit dependen [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

**Figure 14.** Figure 14: ImProver 2 optimizes a proof for length, reducing tactic count by 19. [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗

**Figure 15.** Figure 15: ImProver 2 optimizes a proof for length, reducing tactic count by 8. [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗

**Figure 16.** Figure 16: ImProver 2 optimizes an AlphaProof proof for length, reducing tactic count by 26. [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗

**Figure 17.** Figure 17: ImProver 2 optimizes a proof for modularity, increasing modularity score (see A) by 2. [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗

**Figure 18.** Figure 18: ImProver 2 optimizes a proof for modularity, increasing modularity score (see A) by 2. [PITH_FULL_IMAGE:figures/full_fig_p031_18.png] view at source ↗

**Figure 19.** Figure 19: ImProver 2 optimizes an AlphaProof proof for modularity, increasing modularity score by [PITH_FULL_IMAGE:figures/full_fig_p033_19.png] view at source ↗

read the original abstract

Formal mathematics libraries are rapidly expanding, creating a growing need to refactor verified proofs for maintainability and to improve training data quality for neural provers. However, scalable proof optimization is hindered by heterogeneous and heuristically specified objectives, scarce data, and high training and inference costs. To overcome these challenges, we introduce ImProver 2, a neurosymbolic framework for automated proof optimization in Lean 4. ImProver 2 combines a data-efficient expert-iteration pipeline with a scaffold that exposes formal structure alongside lightweight informal abstractions. We further introduce a suite of metrics capturing structural proof properties. Using ImProver 2, we train a 7B-parameter model that outperforms orders-of-magnitude larger models within the same model family, and is competitive with mid-tier frontier models across metrics. We additionally demonstrate that our neurosymbolic scaffold significantly improves performance across both small and frontier models. We show that with proper scaffolding and training, small models can effectively restructure research-level proofs over complex and varied metrics, matching substantially larger systems and establishing proof optimization as a scalable, learnable task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ImProver 2 gives a concrete neurosymbolic pipeline that lets a 7B model beat much larger models on Lean proof refactoring, but the full paper is needed to check the metric suite and training details.

read the letter

The main takeaway is that this work shows a 7B model, trained via expert iteration on a mixed formal-informal scaffold, can restructure research-level Lean proofs and beat larger models in the same family on a new set of structural metrics. That is the concrete result worth noting first. The paper does a few things cleanly: it frames proof optimization as a learnable task rather than a pure search problem, adds lightweight informal abstractions to the formal Lean structure, and reports that the scaffold helps both small and frontier models. The empirical claim that the 7B model is competitive with mid-tier frontier systems across those metrics is the part that would matter to people building training data or refactoring large libraries. The new metric suite is presented as capturing maintainability and data quality, which is a reasonable direction even if the exact definitions need scrutiny. The main soft spot is that the abstract alone does not let us verify whether the reported gains hold after checking for post-hoc choices, whether the metrics introduce their own biases, or how stable the improvements are across different libraries. The circularity risk looks low from what is shown, and the work does not appear to rest on self-referential loops. This paper is aimed at researchers in neural theorem proving and formal mathematics tooling who already care about scaling proof maintenance. A reader working on Lean data pipelines or self-improvement loops would get direct value from the scaffold description and the 7B results. It deserves a serious referee because the problem is real, the setup is reproducible in principle, and the central empirical direction is falsifiable even if the current numbers need more context.

Referee Report

0 major / 0 minor

Summary. The manuscript introduces ImProver 2, a neurosymbolic framework for automated proof optimization in Lean 4. It combines a data-efficient expert-iteration pipeline with a scaffold exposing formal structure alongside lightweight informal abstractions, introduces a suite of metrics for structural proof properties, and reports that a 7B-parameter model trained via this method outperforms orders-of-magnitude larger models in the same family while remaining competitive with mid-tier frontier models; the scaffold is shown to improve performance across both small and frontier models.

Significance. If the empirical claims hold under rigorous verification, the work would be significant for formal mathematics and neural theorem proving: it positions proof optimization (refactoring for maintainability and training-data improvement) as a scalable, learnable task rather than a purely heuristic one, and demonstrates that modest-sized models can restructure research-level proofs across heterogeneous objectives when given appropriate neurosymbolic scaffolding.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their summary of the manuscript and for acknowledging the potential significance of positioning proof optimization as a scalable, learnable task with neurosymbolic scaffolding. We note the 'uncertain' recommendation and are happy to provide any additional verification, code, or experimental details that would help resolve concerns about the empirical claims.

Circularity Check

0 steps flagged

No significant circularity in empirical training pipeline

full rationale

The abstract and description present ImProver 2 as an empirical neurosymbolic training framework that produces performance gains on introduced metrics. No equations, derivations, or self-citations are shown that would make any reported result equivalent to its inputs by construction. The central claims (model training and metric improvements) are outcomes of an external training process rather than reductions to fitted parameters or self-referential definitions. This matches the default expectation of a non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no concrete equations, fitting procedures, or postulated entities; therefore the ledger is empty.

pith-pipeline@v0.9.0 · 5740 in / 1203 out tokens · 42384 ms · 2026-05-25T06:08:39.216151+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 3 internal anchors

[1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URLhttps://arxiv.org/abs/2501.12948. Simon Frieder, Jonas Bayer, Sam Looi, Jacob Loader, Julius Berner, Katherine M. Collins, András Juhász, Fabian Ruehle, Sean Welleck, Gabriel Poesia, Ryan-Rhys Griffiths, Adrian Weller, Anirudh Goyal, Cameron Freer, Thomas Lukasiewicz, and Timothy Gowers. Data for mathematical copilots: Better ways of presenting proofs ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

URL https://aclanthology.org/2025

Association for Computational Linguistics. URL https://aclanthology.org/2025. inlg-main.23/. Thomas Hubert, Rishi Mehta, Laurent Sartran, Miklós Z. Horváth, Goran Žuži´c, Eric Wieser, Aja Huang, Julian Schrittwieser, Yannick Schroecker, Hussain Masoom, Ottavia Bertolli, Tom Zahavy, Amol Mandhane, Jessica Yung, Iuliya Beloshapka, Borja Ibarz, Vivek Veeriah...

work page doi:10.1038/s41586-025-09833-y 2025
[3]

Zhaoyu Li, Jialiang Sun, Logan Murphy, Qidong Su, Zenan Li, Xian Zhang, Kaiyu Yang, and Xujie Si

URLhttps://doi.org/10.48550/arXiv.2210.12283. Zhaoyu Li, Jialiang Sun, Logan Murphy, Qidong Su, Zenan Li, Xian Zhang, Kaiyu Yang, and Xujie Si. A survey on deep learning for theorem proving. InFirst Conference on Language Modeling,

work page doi:10.48550/arxiv.2210.12283
[4]

URLhttps://openreview.net/forum?id=zlw6AHwukB. 14 Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, Jiayun Wu, Jiri Gesi, Ximing Lu, David Acuna, Kaiyu Yang, Hongzhou Lin, Yejin Choi, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-Prover-V2: Scaling formal theorem proving with scaffolde...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.acl-long.817 2025
[5]

Generative Language Modeling for Automated Theorem Proving

URLhttps://arxiv.org/abs/2009.03393. Shogo Saito and Mashu Noguchi. Foundation, 2026. URL https://github.com/ FormalizedFormalLogic/Foundation. Ivan Sergeev, Martin Dvorak, Tristan Figueroa-Reid, Rida Hamadani, Byung-Hak Hwang, Evgenia Karunus, Vladimir Kolmogorov, Alex Meiburg, Peter Nelson, and Mark Sandey. Regularity of 1-, 2-, and 3-sums of matroids, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-024-07566-y 2009
[6]

Focuses a goal?m i ∈ A i,

work page
[7]

Produces an assignmentσ i+1(?mi) =t i,

work page
[8]

The set of newly created metavariables is exactly the set of free metavariables occurring inti after assignment

Potentially introduces new metavariables corresponding to subgoals. The set of newly created metavariables is exactly the set of free metavariables occurring inti after assignment. A.2 Direct Children vs. Spawned Goals Let Children(τi) denote the set of metavariables that occur free in ti and therefore representdirect subgoalsof the focused goal?m i. Howe...

work page
[9]

Its sequent variant hash matches any previously seen goal (global duplicate), or

work page
[10]

This eliminates the potential for many types of reward hacking cases, such as trivial ∀-introductions, restatements of the parent goal, and redundant lemma spawning

Its target is a definitional wrapper of its parent goal, i.e.target(gchild)∈targetVariants(g parent) or vice versa. This eliminates the potential for many types of reward hacking cases, such as trivial ∀-introductions, restatements of the parent goal, and redundant lemma spawning. A.6 Nontriviality Filter LetT g be the subtree of steps rooted at a spawned...

work page
[11]

the “easiest”) until the replay fraction isρ(cap atπ max)

Mark:We first mark problems in D(t) nr asREPLAYorFRONTIERso that the fraction of replay equalsρ: (a) Start with the subset of problems that are replay-eligible; if their fraction exceeds ρ, downsample them by removing highest-πT items (i.e. the “easiest”) until the replay fraction isρ(cap atπ max). (b) If their fraction is below ρ, replace uniformly chose...

work page 2024
[12]

The optimized proof removes a discarded intermediate lemma and folds repeated divisibility and 30 Euler-theorem reasoning into the main argument. The resulting proof is still dense, but it is materially shorter while preserving the same number-theoretic spine: prove the singleton characterization, force divisibility byab+ 1, and derivea=b= 1. F.2.3 Modula...

work page

[1] [1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URLhttps://arxiv.org/abs/2501.12948. Simon Frieder, Jonas Bayer, Sam Looi, Jacob Loader, Julius Berner, Katherine M. Collins, András Juhász, Fabian Ruehle, Sean Welleck, Gabriel Poesia, Ryan-Rhys Griffiths, Adrian Weller, Anirudh Goyal, Cameron Freer, Thomas Lukasiewicz, and Timothy Gowers. Data for mathematical copilots: Better ways of presenting proofs ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

URL https://aclanthology.org/2025

Association for Computational Linguistics. URL https://aclanthology.org/2025. inlg-main.23/. Thomas Hubert, Rishi Mehta, Laurent Sartran, Miklós Z. Horváth, Goran Žuži´c, Eric Wieser, Aja Huang, Julian Schrittwieser, Yannick Schroecker, Hussain Masoom, Ottavia Bertolli, Tom Zahavy, Amol Mandhane, Jessica Yung, Iuliya Beloshapka, Borja Ibarz, Vivek Veeriah...

work page doi:10.1038/s41586-025-09833-y 2025

[3] [3]

Zhaoyu Li, Jialiang Sun, Logan Murphy, Qidong Su, Zenan Li, Xian Zhang, Kaiyu Yang, and Xujie Si

URLhttps://doi.org/10.48550/arXiv.2210.12283. Zhaoyu Li, Jialiang Sun, Logan Murphy, Qidong Su, Zenan Li, Xian Zhang, Kaiyu Yang, and Xujie Si. A survey on deep learning for theorem proving. InFirst Conference on Language Modeling,

work page doi:10.48550/arxiv.2210.12283

[4] [4]

URLhttps://openreview.net/forum?id=zlw6AHwukB. 14 Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, Jiayun Wu, Jiri Gesi, Ximing Lu, David Acuna, Kaiyu Yang, Hongzhou Lin, Yejin Choi, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-Prover-V2: Scaling formal theorem proving with scaffolde...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.acl-long.817 2025

[5] [5]

Generative Language Modeling for Automated Theorem Proving

URLhttps://arxiv.org/abs/2009.03393. Shogo Saito and Mashu Noguchi. Foundation, 2026. URL https://github.com/ FormalizedFormalLogic/Foundation. Ivan Sergeev, Martin Dvorak, Tristan Figueroa-Reid, Rida Hamadani, Byung-Hak Hwang, Evgenia Karunus, Vladimir Kolmogorov, Alex Meiburg, Peter Nelson, and Mark Sandey. Regularity of 1-, 2-, and 3-sums of matroids, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-024-07566-y 2009

[6] [6]

Focuses a goal?m i ∈ A i,

work page

[7] [7]

Produces an assignmentσ i+1(?mi) =t i,

work page

[8] [8]

The set of newly created metavariables is exactly the set of free metavariables occurring inti after assignment

Potentially introduces new metavariables corresponding to subgoals. The set of newly created metavariables is exactly the set of free metavariables occurring inti after assignment. A.2 Direct Children vs. Spawned Goals Let Children(τi) denote the set of metavariables that occur free in ti and therefore representdirect subgoalsof the focused goal?m i. Howe...

work page

[9] [9]

Its sequent variant hash matches any previously seen goal (global duplicate), or

work page

[10] [10]

This eliminates the potential for many types of reward hacking cases, such as trivial ∀-introductions, restatements of the parent goal, and redundant lemma spawning

Its target is a definitional wrapper of its parent goal, i.e.target(gchild)∈targetVariants(g parent) or vice versa. This eliminates the potential for many types of reward hacking cases, such as trivial ∀-introductions, restatements of the parent goal, and redundant lemma spawning. A.6 Nontriviality Filter LetT g be the subtree of steps rooted at a spawned...

work page

[11] [11]

the “easiest”) until the replay fraction isρ(cap atπ max)

Mark:We first mark problems in D(t) nr asREPLAYorFRONTIERso that the fraction of replay equalsρ: (a) Start with the subset of problems that are replay-eligible; if their fraction exceeds ρ, downsample them by removing highest-πT items (i.e. the “easiest”) until the replay fraction isρ(cap atπ max). (b) If their fraction is below ρ, replace uniformly chose...

work page 2024

[12] [12]

The optimized proof removes a discarded intermediate lemma and folds repeated divisibility and 30 Euler-theorem reasoning into the main argument. The resulting proof is still dense, but it is materially shorter while preserving the same number-theoretic spine: prove the singleton characterization, force divisibility byab+ 1, and derivea=b= 1. F.2.3 Modula...

work page