arxiv: 2605.13171 · v1 · submitted 2026-05-13 · 💻 cs.AI

Recognition: no theorem link

Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics

Moritz Firsching , Paul Lezeau , Salvatore Mercuri , Mikl\'os Z. Horv\'ath , Ya\"el Dillies , Calle S\"onne , Eric Wieser , Fred Zhang

show 3 more authors

Thomas Hubert Blaise Ag\"uera y Arcas Pushmeet Kohli

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:17 UTC · model grok-4.3

classification 💻 cs.AI

keywords Lean 4formal mathematicsmathematical conjecturesautomated theorem provingbenchmarksproof discoveryautoformalizationAI for mathematics

0 comments

The pith

Formal Conjectures supplies 2615 Lean 4 statements of active research problems, 1029 of them open conjectures, as a contamination-free testbed for automated proof discovery.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Formal Conjectures, an open, evolving collection of mathematical statements written in Lean 4. It contains both solved problems suited for autoformalization and 1029 unresolved research conjectures that serve as a clean benchmark for systems attempting to find proofs. The repository is maintained through community contributions and uses AI-generated proofs as an auditing tool to catch errors in the formalizations. The authors report that the benchmark has already produced new mathematical results, including the resolution of previously open conjectures. They also supply a fixed evaluation protocol and baseline scores that track progress on research-level tasks.

Core claim

Formal Conjectures is a collaborative Lean 4 repository of 2615 statements drawn from current mathematical research. It separates 1029 open conjectures, which function as a zero-contamination benchmark for proof discovery, from 836 solved problems intended for autoformalization. The project maintains a structured interface between mathematicians who contribute and clarify statements and the humans or AI systems that attempt solutions. AI-generated proofs and disproofs are used to audit and refine the formalizations. Baseline results on frozen subsets show measurable improvement in automated reasoning performance on these problems.

What carries the argument

The Formal Conjectures repository, a structured Lean 4 dataset of research-level statements together with a community contribution and AI-auditing interface.

If this is right

AI systems can be tested on genuine open research problems without risk of training-data contamination.
New conjectures can be resolved by submitting them to the benchmark and using automated search.
Community contributions allow the set of available problems to grow and improve over time.
AI-generated proofs serve as an ongoing check on the accuracy of the formalizations.
Standardized frozen evaluation subsets give a repeatable signal of progress in automated reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the benchmark is adopted by the wider community, it could become a shared public ledger for tracking which open problems have been attempted and by which methods.
The same interface pattern might be applied to other formal systems or to domains outside pure mathematics where verified discovery is valued.
Success on the open-conjecture portion would indicate that current automated systems can move from textbook exercises to frontier questions.
Regular updates to the benchmark would create a moving target that forces continuous improvement in proof-search techniques.

Load-bearing premise

The Lean 4 formalizations correctly capture the original mathematical statements without introducing errors, simplifications, or changes that alter their difficulty or meaning.

What would settle it

A documented mismatch between any formal statement in the benchmark and the published conjecture it claims to represent, such as an omitted hypothesis or altered quantifier scope that changes the logical content.

Figures

Figures reproduced from arXiv: 2605.13171 by Blaise Ag\"uera y Arcas, Calle S\"onne, Eric Wieser, Fred Zhang, Mikl\'os Z. Horv\'ath, Moritz Firsching, Paul Lezeau, Pushmeet Kohli, Salvatore Mercuri, Thomas Hubert, Ya\"el Dillies.

**Figure 1.** Figure 1: Left: informal source distribution. Right: formal category distribution, including open (1029) and solved (836) problems. evaluation but became increasingly saturated as models improved. This shift was defined by AlphaProof [Hubert et al., 2025] and AlphaGeometry 2 [Chervonyi et al., 2025] reaching silver-medal performance at the 2024 IMO, where AlphaProof solved Lean problems and introduced the Formal-IMO… view at source ↗

**Figure 2.** Figure 2: Repository growth over time. The benchmark’s problems are sourced from diverse mathematical literature for broad coverage. Key sources include the collection of problems posed by Paul Erdős, as catalogued on erdosproblems.com [Bloom, 2023]; the Kourovka Notebook of unsolved problems in group theory [Khukhro and Mazurov, 2026]; the IQOQI Vienna list of open quantum problems [IQOQI Vienna and Werner, 2017];… view at source ↗

**Figure 3.** Figure 3: A statement in Formal Conjectures, an open conjecture from [Hartshorne, 1974]. It includes an informal statement as a comment, tags, and a Lean statement. The collection is divided into two primary categories: Unsolved Conjectures, open research problems where a formal or informal proof would represent a new mathematical discovery; and Solved Problems, established theorems with known informal proofs that … view at source ↗

**Figure 4.** Figure 4: Iterative pipeline for Formal Conjectures. Formalized statements are tested by periodic AI runs, solving problems or triggering a revision loop. and semantic (35%) errors being the most common ( [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Code diff showing an example of a syntactical error, and the correction of parentheses. A.5.2 Semantic Errors Figures 6, 7 and 8 illustrate various semantic errors encountered during formalization. /-- `Polynomial.HasOddCoeffs f` means that all coefficients of `f : Polynomial ℤ` are odd. -/ def Polynomial.HasOddCoeffs (f : Polynomial ℤ) : Prop := - ∀ i : ℕ, Odd (f.coeff i) + ∀ i ∈ f.support, Odd (f.coeff … view at source ↗

**Figure 6.** Figure 6: Code diff showing an example of a semantic error, with the fix requiring quantification over the support of the polynomial. /-- A group `HasPolynomialGrowth` if there exists a finite generating set such that the growth function is bounded above by a polynomial. -/ def HasPolynomialGrowth (G : Type*) [Group G] : Prop := ∃ (S : Set G), Set.Finite S ∧ Subgroup.closure S = ⊤ ∧ ∃ (C : ℝ) (d : ℕ), C > 0 ∧ - ∀ n… view at source ↗

**Figure 7.** Figure 7: Code diff showing an example of a semantic error, where the formalization had failed to account for behavior at the natural number 0. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Code diff showing an example of a semantic error, where the formalization missed the subtlety that Partition invokes sup for the covering condition, which means something different for subgroups than for sets. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: provides an example of a misrepresentation error where a hypothesis from the informal text was omitted, and [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Code diff showing an example of a misrepresentation error, where the vertex type V was incorrectly quantified. In the old version V was a section variable and thus appeared out of the scope of the existential quantifier. A.5.4 Implicit Conventions Examples of errors stemming from implicit conventions in the source material can be seen in Figures 11 and 12. @[category research open, AMS 11] theorem erdos_5… view at source ↗

**Figure 11.** Figure 11: Code diff showing an example of an implicit convention, where the source8 implicitly assumes that the inequality holds for sufficiently large N. This is a common implicit convention across analytic number theory where one is typically interested in the asymptotic behavior of functions or sequences of natural numbers. 8 https://www.erdosproblems.com/510 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Code diff showing an example of an implicit convention, where the source11 wrote “subgraph” rather than “induced subgraph”. The comments12 demonstrate that the problem is trivial if one does not restrict to induced subgraphs, which is why we class this as implicit convention. It might also be considered to be a reporting error since the original paper explicitly writes “induced subgraph”, [Erdős, 1993, p.… view at source ↗

**Figure 13.** Figure 13: demonstrates a reporting error caused by ambiguities in the historical source material regarding induced subgraphs. /-- Is there a graph with $\aleph_2$ vertices and chromatic number $\aleph_2$ such that every subgraph on $\aleph_1$ vertices has chromatic number $\leq\aleph_0$? -/ -- Formalisation note: source material [ErHa68b] uses only induced subgraphs @[category research open, AMS 5] theorem erdos_91… view at source ↗

read the original abstract

As automated reasoning systems advance rapidly, there is a growing need for research-level formal mathematical problems to accurately evaluate their capabilities. To address this, we present Formal Conjectures, an evolving benchmark of currently 2615 mathematical problem statements formalized in Lean 4. Sourced from areas of active mathematical research, the dataset features 1029 open research conjectures providing a zero-contamination benchmark for mathematical proof discovery, and 836 solved problems for proof autoformalization. Notably, the repository provides a structured interface connecting mathematicians who formalize and clarify problems with the AI systems and humans attempting to solve them. Demonstrating its immediate utility, the benchmark has already been leveraged to make new mathematical discoveries, including the resolution of open research conjectures. We describe our approach to ensuring the correctness of these formalizations in a collaborative open-source project where contributions stem from an active community. In this framework, AI-generated proofs and disproofs serve as a valuable auditing mechanism to iteratively improve the fidelity of the benchmark. Finally, we provide a standardized evaluation setup and report baseline results on frozen evaluation subsets, demonstrating a climbable signal that measures the current frontier of automated reasoning on research-level mathematics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper introduces a new Lean 4 benchmark of 1029 open research conjectures with a collaborative interface, but the evidence for faithful formalizations and actual new discoveries remains thin.

read the letter

The paper's core offering is Formal Conjectures, a dataset of 2615 Lean 4 statements drawn from active research areas, with 1029 of them open conjectures and 836 solved ones. The setup includes a structured way for mathematicians to contribute and clarify problems while AI systems attempt solutions, and it emphasizes zero-contamination for proof discovery tasks. This is genuinely new compared to earlier formal math collections that mostly used solved or synthetic problems. The collaborative open-source workflow and the idea of using AI-generated proofs as an auditing loop are practical additions that could help keep the benchmark accurate over time. Baseline results on frozen subsets also give a measurable starting point for tracking automated reasoning progress on harder problems. The soft spots are in the verification and utility claims. The abstract states that the benchmark has already enabled resolutions of open conjectures, yet it supplies no concrete examples, no side-by-side original versus formal statements, and no independent expert sign-offs. Without those, it is hard to confirm that the Lean versions preserve the original difficulty and meaning. The stress-test concern holds up here: the paper describes the process but does not show data on how often audits caught semantic drift or quantify the fidelity checks. This leaves the central usefulness assertion under-supported from the text. The work is aimed at people building and testing automated theorem provers on research-level mathematics. Readers who need fresh, uncontaminated problems for evaluation will find the dataset itself worth looking at, even if the paper's results section stays light. It deserves a serious referee because a well-constructed benchmark in this space can move the field, provided the authors add the missing verification details and examples. I recommend sending it to peer review with requests for those expansions.

Referee Report

2 major / 2 minor

Summary. The paper introduces Formal Conjectures, an evolving benchmark of 2615 Lean 4 formalized mathematical statements drawn from active research areas. It includes 1029 open research conjectures intended as a zero-contamination testbed for proof discovery and 836 solved problems for autoformalization, together with a collaborative open-source interface linking mathematicians and AI systems. The manuscript describes an auditing workflow that uses AI-generated proofs and disproofs to improve formalization fidelity, provides a standardized evaluation protocol, and reports baseline results on frozen subsets while claiming that the benchmark has already enabled new mathematical discoveries, including resolutions of open conjectures.

Significance. If the formalizations faithfully preserve the original statements and the reported discoveries are independently verifiable, the benchmark would supply a much-needed, community-maintained resource for measuring progress on research-level automated reasoning. Its open, evolving design and explicit separation of open versus solved items could support reproducible evaluation and incremental community contributions, addressing a clear gap between current synthetic benchmarks and frontier mathematical problems.

major comments (2)

The abstract asserts that the benchmark 'has already been leveraged to make new mathematical discoveries, including the resolution of open research conjectures,' yet supplies no concrete examples, proof artifacts, or independent verification details. Without side-by-side source-to-formalization comparisons or mathematician sign-off on any resolved item, the central utility claim cannot be assessed.
The manuscript states that AI-generated proofs serve as an auditing mechanism to ensure fidelity of the 1029 open conjectures, but provides no quantification of how often such audits caught versus missed semantic drift, nor any reported error rates or expert cross-check statistics. This leaves the zero-contamination guarantee unverified.

minor comments (2)

The total of 2615 statements is stated, yet the breakdown (1029 open + 836 solved) leaves 750 items unaccounted for; a brief table or sentence clarifying the remaining categories would improve clarity.
Baseline results are mentioned but no numerical performance figures, model names, or evaluation metrics appear in the provided text; including at least summary statistics for the frozen subsets would strengthen the evaluation section.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive comments on the manuscript. We address the major points below and describe the changes we will make in revision.

read point-by-point responses

Referee: The abstract asserts that the benchmark 'has already been leveraged to make new mathematical discoveries, including the resolution of open research conjectures,' yet supplies no concrete examples, proof artifacts, or independent verification details. Without side-by-side source-to-formalization comparisons or mathematician sign-off on any resolved item, the central utility claim cannot be assessed.

Authors: We agree that the manuscript would be strengthened by concrete supporting details for this claim. In the revised version we will add a dedicated subsection with at least two specific examples of resolved conjectures. Each example will include the original informal statement, the corresponding Lean formalization, the discovered proof artifact (with repository link), and any available mathematician sign-off or verification note. Side-by-side comparisons will be provided for the first example to illustrate fidelity. revision: yes
Referee: The manuscript states that AI-generated proofs serve as an auditing mechanism to ensure fidelity of the 1029 open conjectures, but provides no quantification of how often such audits caught versus missed semantic drift, nor any reported error rates or expert cross-check statistics. This leaves the zero-contamination guarantee unverified.

Authors: We recognize that the current text lacks quantitative evaluation of the auditing process. In revision we will expand the auditing section to describe the workflow with concrete qualitative examples of semantic issues caught by AI-generated proofs or disproofs, and we will explicitly note the absence of systematic catch/miss statistics or expert cross-check rates. We will also clarify that the zero-contamination property relies on the open-source, community-audited nature of the repository rather than on quantified audit performance. revision: partial

standing simulated objections not resolved

Quantitative statistics on how often AI audits caught versus missed semantic drift, including error rates and expert cross-check results, are not available because such systematic tracking has not been performed.

Circularity Check

0 steps flagged

No circularity: benchmark sourced externally with independent verification claims

full rationale

The paper constructs Formal Conjectures from externally sourced open conjectures and solved problems contributed by the mathematical community, with no self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the central claims to tautology. The utility demonstration (new discoveries via the benchmark) is presented as an empirical outcome of external use rather than an internal derivation; formalization fidelity is asserted through an open collaborative process and AI auditing loop, but this does not equate any result to its inputs by construction. No uniqueness theorems, ansatzes, or renamings of known results are invoked in a self-referential manner. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the assumption that the provided Lean 4 formalizations faithfully represent the original mathematical problems and that these problems are suitable for evaluating AI reasoning capabilities.

axioms (1)

domain assumption Lean 4 formalizations correctly and completely capture the mathematical meaning of the included conjectures and problems.
The benchmark's value for AI evaluation depends entirely on the accuracy of these formal statements.

pith-pipeline@v0.9.0 · 5555 in / 1255 out tokens · 48097 ms · 2026-05-14T20:17:32.804574+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 16 canonical work pages · 4 internal anchors

[1]

Aristotle: IMO-level Automated Theorem Proving

Tudor Achim, Alex Best, Alberto Bietti, Kevin Der, Mathïs Fédérico, Sergei Gukov, Daniel Halpern-Leistner, Kirsten Henningsgard, Yury Kudryashov, Alexander Meiburg, et al. Aristotle: Imo-level automated theorem proving. arXiv preprint arXiv:2510.01346 ,

work page internal anchor Pith review arXiv
[2]

Alexeev and D

URL https://arxiv.org/abs/2510.19804. Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin T. Vechev. MathArena: Evaluating LLMs on Uncontaminated Math Competitions,

work page arXiv
[3]

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

URL https: //arxiv.org/abs/2505.23281. Thomas Bloom. Erdős Problems,

work page internal anchor Pith review arXiv
[4]

Accessed: 2026-03-24

URL https://www.erdosproblems.com/. Accessed: 2026-03-24. Thomas Bloom. Erdős Problem #124,

2026
[5]

See comment section

URL https://www.erdosproblems.com/124. See comment section. Accessed: 2026-04-20. Center for AI Safety, Scale AI, and HLE Contributors Consortium. A benchmark of expert- level academic questions to assess AI capabilities. Nature, 649:1139–1146,

2026
[6]

Humanity's Last Exam

doi: 10.1038/s41586-025-09962-4. URL https://arxiv.org/abs/2501.14249. Jiangjie Chen, Wenxiang Chen, Jiacheng Du, Jinyi Hu, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Wenlei Shi, et al. Seed-prover 1.5: Mastering undergraduate- level theorem proving via learning from experience. arXiv preprint arXiv:2512.17260 , 2025a. Luoxin Chen, Ji...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-025-09962-4
[7]

ARC Prize 2024: Technical Report,

Francois Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers. ARC Prize 2024: Technical Report,

2024
[8]

Arc prize 2024: Technical report

URL https://arxiv.org/abs/2412.04604. 10 François Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers. ARC Prize 2025: Technical Report,

work page arXiv 2025
[9]

Google DeepMind

URL https://arxiv.org/abs/2601.10904. Google DeepMind. DeepMind prover agent. In preparation,

work page arXiv
[10]

URL https: //doi.org/10.1080/16073606.1993.9631741

doi: 10.1080/16073606.1993.9631741. URL https: //doi.org/10.1080/16073606.1993.9631741. Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, Olli Järviniemi, Matthew Barnett, Robert Sandler, Matej Vrzala, Jaime Sevilla, Qiuyu Ren, Elizabeth ...

work page doi:10.1080/16073606.1993.9631741 1993
[11]

FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI.arXiv preprint arXiv:2411.04872, 2024

URL https: //arxiv.org/abs/2411.04872. Robin Hartshorne. Varieties of small codimension in projective space. Bulletin of the American Mathematical Society , 80(6):1017–1032, November

work page arXiv
[12]

Olympiad- level formal mathematical reasoning with reinforcement learning

doi: 10.1038/s41586-025-09833-y. URL https://doi.org/10.1038/s41586-025-09833-y . IQOQI Vienna and Reinhard Werner. Open Quantum Problems,

work page doi:10.1038/s41586-025-09833-y
[13]

Jiang, W

Accessed: 2026-03-24. Jiedong Jiang, Wanyi He, Yuefeng Wang, Guoxiong Gao, Yongle Hu, Jingting Wang, Nailin Guan, Peihao Wu, Chunbo Dai, Liang Xiao, et al. Fate: A formal benchmark series for frontier algebra of multiple diﬀiculty levels. arXiv preprint arXiv:2511.02872 ,

work page arXiv 2026
[14]

Unsolved Problems in Group Theory. The Kourovka Notebook

URL https://arxiv.org/abs/1401.0300v41. 21st edition. First published in 1965; continuously updated on arXiv since

work page internal anchor Pith review Pith/arXiv arXiv 1965
[15]

Sarra, A

URL https://arxiv.org/abs/2603.02668. 11 Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yi- han Geng, Jiawei Ge, Jingruo Sun, et al. Goedel-prover-v2: Scaling formal theorem prov- ing with scaffolded data synthesis and self-correction. arXiv preprint arXiv:2508.03613 ,

work page arXiv
[16]

Numina-lean-agent: An open and general agentic reasoning system for formal mathematics.arXiv preprint arXiv:2601.14027,

Junqi Liu, Zihao Zhou, Zekai Zhu, Marco Dos Santos, Weikun He, Jiawei Liu, Ran Wang, Yunzhou Xie, Junqiao Zhao, Qiufeng Wang, et al. Numina-lean-agent: An open and gen- eral agentic reasoning system for formal mathematics. arXiv preprint arXiv:2601.14027 ,

work page arXiv
[17]

The Lean Mathematical Library

The mathlib Community. The Lean Mathematical Library. In Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs , CPP 2020, New Orleans, LA, USA, January

2020
[18]

In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs

ACM. doi: 10.1145/3372885.3373824. URL https: //doi.org/10.1145/3372885.3373824. George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jennings, Amitayush Thakur, and Swarat Chaudhuri. PutnamBench: Evaluating Neural Theorem- Provers on the Putnam Mathematical Competition,

work page doi:10.1145/3372885.3373824
[19]

PutnamBench: Evaluating Neural Theorem‑Provers on the Putnam Mathematical Com- petition, 2024

URL https://arxiv.org/abs/ 2407.11214. Zhouliang Yu, Ruotian Peng, Keyi Ding, Yizhe Li, Zhongyuan Peng, Minghao Liu, Yifan Zhang, Zheng Yuan, Huajian Xin, Wenhao Huang, et al. Formalmath: Benchmarking formal mathematical reasoning of large language models. arXiv preprint arXiv:2505.02735,

work page arXiv
[20]

Minif2f: a cross-system benchmark for formal olympiad-level mathematics, 2022

Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics. arXiv preprint arXiv:2109.00110 ,

work page arXiv
[21]

subgraph

→ ¬(G.CliqueFree 3)) ↔ answer(sorry) := by sorry Figure 12: Code diff showing an example of an implicit convention, where the source 11 wrote “subgraph” rather than “induced subgraph” . The comments 12 demonstrate that the problem is trivial if one does not restrict to induced subgraphs, which is why we class this as implicit convention. It might also be ...

1993
[22]

In the former citation the explicit use of ≤ ℵ 0 is used, while this does not appear in the latter citation

and Erdős [1969]. In the former citation the explicit use of ≤ ℵ 0 is used, while this does not appear in the latter citation. This ambiguity was only clarified post draft formalization with the discovery of trivial solutions – in the (non-induced) subgraph case by the website comments 14 and by AlphaProof in the induced subgraph case. Note also that Erdő...

1969
[23]

and the website do not. B Experimental Evaluation Details B.1 F rozen Subsets The frozen subsets FC100SolvedSet1 and FC100OpenSet1 consist of 100 problems each, sampled uniformly at random from statements in the research solved and research open categories, respectively. They are defined by the files FormalConjectures/Subsets/FC100OpenSet1.lean and Formal...

2025