hub Canonical reference

Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao · 2025 · cs.LG · arXiv 2508.03613

Canonical reference. 100% of citing Pith papers cite this work as background.

26 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 26 citing papers arXiv PDF

abstract

We introduce Goedel-Prover-V2, a series of open-source language models that set a new state-of-the-art in automated theorem proving. Built on the standard expert iteration and reinforcement learning pipeline, our approach incorporates three key innovations: (1) Scaffolded data synthesis: We generate synthetic tasks of increasing difficulty to train the model to master increasingly complex theorems; (2) Verifier-guided self-correction: We enable the model to iteratively revise its proofs by leveraging feedback from the Lean compiler; (3) Model averaging: We merge model checkpoints to mitigate the decrease in model output diversity in later stages of training. Our small model, Goedel-Prover-V2-8B, reaches 84.6% pass@32 on MiniF2F and outperforms DeepSeek-Prover-V2-671B under the same metric, despite being 80X smaller. Our flagship model, Goedel-Prover-V2-32B, achieves 88.1% on MiniF2F at pass@32 in standard mode and 90.4% in self-correction mode, outperforming prior SOTA by a large margin. Additionally, our flagship model solves 86 problems on PutnamBench at pass@184, securing the first place among open-source models on the leaderboard, surpassing DeepSeek-Prover-V2-671B's record of solving 47 problems by pass@1024 with a significantly smaller model size and compute budget. At the time of its release (July-August 2025), Goedel-Prover-V2 achieves the strongest overall performance among all open-source theorem provers. It also ranks among the top-performing models--including closed-source systems with publicly reported performance--under a constrained test-time compute budget. Our models, code, and data are released at https://github.com/Goedel-LM/Goedel-Prover-V2.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5

citation-polarity summary

background 5

representative citing papers

MathAtlas: A Benchmark for Autoformalization in the Wild

cs.AI · 2026-05-13 · accept · novelty 8.0

MathAtlas is the first large-scale benchmark for autoformalizing graduate mathematics, where even strong models reach only 9.8% correctness on theorem statements and drop to 2.6% on the hardest dependency-deep subset.

Advancing Mathematics Research with AI-Driven Formal Proof Search

cs.AI · 2026-05-21 · unverdicted · novelty 7.0

LLM-based agents in Lean solved 9 of 353 open Erdős problems and proved 44 of 492 OEIS conjectures at a few hundred dollars each.

Self-Distillation is Optimal Among Spectral Shrinkage Estimators in Spiked Covariance Models

math.ST · 2026-05-18 · unverdicted · novelty 7.0

s-step self-distillation is optimal among spectral shrinkage estimators for s-spiked covariance matrices and necessary for optimality.

CAM-Bench: A Benchmark for Computational and Applied Mathematics in Lean

cs.AI · 2026-05-17 · accept · novelty 7.0

CAM-Bench is a new Lean 4 theorem-proving benchmark of 1,000 problems in computational and applied mathematics, built from textbook exercises using a dependency-recovery pipeline to reconstruct local context.

Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

LLM proofs for hard math problems show large differences in quality metrics like conciseness and cognitive simplicity that correctness-only tests miss, along with trade-offs between quality and correctness.

AI co-mathematician: Accelerating mathematicians with agentic AI

cs.AI · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.

Automatic Textbook Formalization

cs.AI · 2026-04-03 · accept · novelty 7.0

Multi-agent AI system formalizes entire 500-page graduate algebraic combinatorics textbook into Lean, creating 130K lines of code in one week at human-expert cost.

ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

ImProver 2 combines a data-efficient expert-iteration pipeline with a neurosymbolic scaffold to train a 7B model that outperforms larger models in Lean 4 proof optimization across structural metrics.

OProver: A Unified Framework for Agentic Formal Theorem Proving

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.

Rethinking Supervision Granularity: Segment-Level Learning for LLM-Based Theorem Proving

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Segment-level supervision extracts coherent proof segments to train policy models that achieve 61-66% success on miniF2F, outperforming step-level and whole-proof methods while also improving existing provers.

On Time, Within Budget: Constraint-Driven Online Resource Allocation for Agentic Workflows

cs.AI · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

MCPP uses Monte Carlo simulations of workflow executions to dynamically allocate resources and replan, raising constrained completion probability over baselines on CodeFlow and ProofFlow.

Evaluating the Architectural Reasoning Capabilities of LLM Provers via the Obfuscated Natural Number Game

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

The Obfuscated Natural Number Game shows reasoning LLMs keep proof accuracy without semantic cues while general models degrade, establishing a metric for architectural reasoning in alien math domains.

Ablation and the Meno: Tools for Empirical Metamathematics

cs.LO · 2026-04-24 · unverdicted · novelty 6.0

Meno and tactic ablation on Tao's Analysis I generate proof populations that embed on low one- or two-dimensional submanifolds far from human constructions in Goedel Prover space.

Scaling Self-Play with Self-Guidance

cs.LG · 2026-04-22 · unverdicted · novelty 6.0

SGS adds self-guidance to LLM self-play for Lean4 theorem proving, surpassing RL baselines and enabling a 7B model to outperform a 671B model after 200 rounds.

The Topological Dual of a Dataset: A Logic-to-Topology Encoding for AlphaGeometry-Style Data

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

The topological dual of a dataset is introduced as a transformation that encodes logical structures into topological ones to expose invariants in neural latent spaces for AlphaGeometry-style reasoning.

Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR

cs.LG · 2026-04-06 · unverdicted · novelty 6.0

Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.

A Minimal Agent for Automated Theorem Proving

cs.AI · 2026-02-27 · unverdicted · novelty 6.0

A minimal agentic system achieves competitive performance in automated theorem proving with a simpler design and lower cost than state-of-the-art methods.

R$^3$L: Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification

cs.LG · 2026-01-07 · unverdicted · novelty 6.0

R³L combines reflect-then-retry exploration, pivotal credit assignment, and positive amplification in RL for LLMs, reporting 5-52% relative gains on agentic and reasoning tasks with stable training.

Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics

cs.AI · 2025-10-14 · unverdicted · novelty 6.0

Ax-Prover is a tool-using multi-agent LLM system that matches state-of-the-art provers on public math benchmarks and outperforms them on new abstract-algebra and quantum-theory benchmarks while also assisting an expert with a cryptography proof.

Discovering New Theorems via LLMs with In-Context Proof Learning in Lean

cs.LG · 2025-09-16 · unverdicted · novelty 6.0

LLMs in a conjecturing-proving loop that conditions on their own prior verified Lean proofs discover more hard-to-prove theorems than baselines that generate statements and proofs together.

Pseudo-Formalization for Automatic Proof Verification

cs.LO · 2026-05-19 · unverdicted · novelty 5.0

Pseudo-Formalization decomposes natural language proofs into modular blocks for independent LLM verification via Block Verification, outperforming LLM-as-judge baselines on error detection in olympiad and research math benchmarks.

Code as Agent Harness

cs.CL · 2026-05-18 · accept · novelty 5.0

A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.

OptProver: Bridging Olympiad and Optimization through Continual Training in Formal Theorem Proving

cs.LG · 2026-04-26 · unverdicted · novelty 5.0

OptProver transfers formal theorem proving from Olympiad math to optimization via continual training, achieving SOTA Pass@1 and Pass@32 on a new Lean 4 benchmark while retaining general performance.

On Reasoning-Centric LLM-based Automated Theorem Proving

cs.SE · 2026-04-21 · unverdicted · novelty 5.0

ReCent-Prover achieves a 22.58% relative improvement over prior state-of-the-art in proved theorems on the CoqStoq benchmark by using reasoning-centric techniques under a fixed LLM invocation budget.

citing papers explorer

Showing 26 of 26 citing papers.

MathAtlas: A Benchmark for Autoformalization in the Wild cs.AI · 2026-05-13 · accept · none · ref 18 · internal anchor
MathAtlas is the first large-scale benchmark for autoformalizing graduate mathematics, where even strong models reach only 9.8% correctness on theorem statements and drop to 2.6% on the hardest dependency-deep subset.
Advancing Mathematics Research with AI-Driven Formal Proof Search cs.AI · 2026-05-21 · unverdicted · none · ref 40 · internal anchor
LLM-based agents in Lean solved 9 of 353 open Erdős problems and proved 44 of 492 OEIS conjectures at a few hundred dollars each.
Self-Distillation is Optimal Among Spectral Shrinkage Estimators in Spiked Covariance Models math.ST · 2026-05-18 · unverdicted · none · ref 4 · internal anchor
s-step self-distillation is optimal among spectral shrinkage estimators for s-spiked covariance matrices and necessary for optimality.
CAM-Bench: A Benchmark for Computational and Applied Mathematics in Lean cs.AI · 2026-05-17 · accept · none · ref 20 · internal anchor
CAM-Bench is a new Lean 4 theorem-proving benchmark of 1,000 problems in computational and applied mathematics, built from textbook exercises using a dependency-recovery pipeline to reconstruct local context.
Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness cs.CL · 2026-05-11 · unverdicted · none · ref 33 · internal anchor
LLM proofs for hard math problems show large differences in quality metrics like conciseness and cognitive simplicity that correctness-only tests miss, along with trade-offs between quality and correctness.
AI co-mathematician: Accelerating mathematicians with agentic AI cs.AI · 2026-05-07 · unverdicted · none · ref 27 · 2 links · internal anchor
An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
Automatic Textbook Formalization cs.AI · 2026-04-03 · accept · none · ref 12 · internal anchor
Multi-agent AI system formalizes entire 500-page graduate algebraic combinatorics textbook into Lean, creating 130K lines of code in one week at human-expert cost.
ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization cs.AI · 2026-05-21 · unverdicted · none · ref 4 · internal anchor
ImProver 2 combines a data-efficient expert-iteration pipeline with a neurosymbolic scaffold to train a 7B model that outperforms larger models in Lean 4 proof optimization across structural metrics.
OProver: A Unified Framework for Agentic Formal Theorem Proving cs.CL · 2026-05-17 · unverdicted · none · ref 172 · internal anchor
OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.
Rethinking Supervision Granularity: Segment-Level Learning for LLM-Based Theorem Proving cs.AI · 2026-05-12 · unverdicted · none · ref 5 · internal anchor
Segment-level supervision extracts coherent proof segments to train policy models that achieve 61-66% success on miniF2F, outperforming step-level and whole-proof methods while also improving existing provers.
On Time, Within Budget: Constraint-Driven Online Resource Allocation for Agentic Workflows cs.AI · 2026-05-07 · unverdicted · none · ref 5 · 2 links · internal anchor
MCPP uses Monte Carlo simulations of workflow executions to dynamically allocate resources and replan, raising constrained completion probability over baselines on CodeFlow and ProofFlow.
Evaluating the Architectural Reasoning Capabilities of LLM Provers via the Obfuscated Natural Number Game cs.LG · 2026-05-01 · unverdicted · none · ref 12 · internal anchor
The Obfuscated Natural Number Game shows reasoning LLMs keep proof accuracy without semantic cues while general models degrade, establishing a metric for architectural reasoning in alien math domains.
Ablation and the Meno: Tools for Empirical Metamathematics cs.LO · 2026-04-24 · unverdicted · none · ref 11 · internal anchor
Meno and tactic ablation on Tao's Analysis I generate proof populations that embed on low one- or two-dimensional submanifolds far from human constructions in Goedel Prover space.
Scaling Self-Play with Self-Guidance cs.LG · 2026-04-22 · unverdicted · none · ref 20 · internal anchor
SGS adds self-guidance to LLM self-play for Lean4 theorem proving, surpassing RL baselines and enabling a 7B model to outperform a 671B model after 200 rounds.
The Topological Dual of a Dataset: A Logic-to-Topology Encoding for AlphaGeometry-Style Data cs.AI · 2026-04-20 · unverdicted · none · ref 10 · internal anchor
The topological dual of a dataset is introduced as a transformation that encodes logical structures into topological ones to expose invariants in neural latent spaces for AlphaGeometry-style reasoning.
Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR cs.LG · 2026-04-06 · unverdicted · none · ref 15 · internal anchor
Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.
A Minimal Agent for Automated Theorem Proving cs.AI · 2026-02-27 · unverdicted · none · ref 22 · internal anchor
A minimal agentic system achieves competitive performance in automated theorem proving with a simpler design and lower cost than state-of-the-art methods.
R$^3$L: Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification cs.LG · 2026-01-07 · unverdicted · none · ref 2 · internal anchor
R³L combines reflect-then-retry exploration, pivotal credit assignment, and positive amplification in RL for LLMs, reporting 5-52% relative gains on agentic and reasoning tasks with stable training.
Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics cs.AI · 2025-10-14 · unverdicted · none · ref 39 · internal anchor
Ax-Prover is a tool-using multi-agent LLM system that matches state-of-the-art provers on public math benchmarks and outperforms them on new abstract-algebra and quantum-theory benchmarks while also assisting an expert with a cryptography proof.
Discovering New Theorems via LLMs with In-Context Proof Learning in Lean cs.LG · 2025-09-16 · unverdicted · none · ref 8 · internal anchor
LLMs in a conjecturing-proving loop that conditions on their own prior verified Lean proofs discover more hard-to-prove theorems than baselines that generate statements and proofs together.
Pseudo-Formalization for Automatic Proof Verification cs.LO · 2026-05-19 · unverdicted · none · ref 18 · internal anchor
Pseudo-Formalization decomposes natural language proofs into modular blocks for independent LLM verification via Block Verification, outperforming LLM-as-judge baselines on error detection in olympiad and research math benchmarks.
Code as Agent Harness cs.CL · 2026-05-18 · accept · none · ref 89 · internal anchor
A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.
OptProver: Bridging Olympiad and Optimization through Continual Training in Formal Theorem Proving cs.LG · 2026-04-26 · unverdicted · none · ref 15 · internal anchor
OptProver transfers formal theorem proving from Olympiad math to optimization via continual training, achieving SOTA Pass@1 and Pass@32 on a new Lean 4 benchmark while retaining general performance.
On Reasoning-Centric LLM-based Automated Theorem Proving cs.SE · 2026-04-21 · unverdicted · none · ref 13 · internal anchor
ReCent-Prover achieves a 22.58% relative improvement over prior state-of-the-art in proved theorems on the CoqStoq benchmark by using reasoning-centric techniques under a fixed LLM invocation budget.
Agentic Proving for Program Verification cs.AI · 2026-05-22 · unverdicted · none · ref 19 · internal anchor
Agentic Claude reaches 98.8% valid specs, 87.5% implementation certification, and 98.1% end-to-end success on CLEVER, revealing a mismatch between benchmark difficulty and current prover performance.
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI cs.LG · 2026-05-09 · unreviewed · ref 53 · internal anchor

Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer