Recognition: no theorem link
Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics
Pith reviewed 2026-05-14 20:17 UTC · model grok-4.3
The pith
Formal Conjectures supplies 2615 Lean 4 statements of active research problems, 1029 of them open conjectures, as a contamination-free testbed for automated proof discovery.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Formal Conjectures is a collaborative Lean 4 repository of 2615 statements drawn from current mathematical research. It separates 1029 open conjectures, which function as a zero-contamination benchmark for proof discovery, from 836 solved problems intended for autoformalization. The project maintains a structured interface between mathematicians who contribute and clarify statements and the humans or AI systems that attempt solutions. AI-generated proofs and disproofs are used to audit and refine the formalizations. Baseline results on frozen subsets show measurable improvement in automated reasoning performance on these problems.
What carries the argument
The Formal Conjectures repository, a structured Lean 4 dataset of research-level statements together with a community contribution and AI-auditing interface.
If this is right
- AI systems can be tested on genuine open research problems without risk of training-data contamination.
- New conjectures can be resolved by submitting them to the benchmark and using automated search.
- Community contributions allow the set of available problems to grow and improve over time.
- AI-generated proofs serve as an ongoing check on the accuracy of the formalizations.
- Standardized frozen evaluation subsets give a repeatable signal of progress in automated reasoning.
Where Pith is reading between the lines
- If the benchmark is adopted by the wider community, it could become a shared public ledger for tracking which open problems have been attempted and by which methods.
- The same interface pattern might be applied to other formal systems or to domains outside pure mathematics where verified discovery is valued.
- Success on the open-conjecture portion would indicate that current automated systems can move from textbook exercises to frontier questions.
- Regular updates to the benchmark would create a moving target that forces continuous improvement in proof-search techniques.
Load-bearing premise
The Lean 4 formalizations correctly capture the original mathematical statements without introducing errors, simplifications, or changes that alter their difficulty or meaning.
What would settle it
A documented mismatch between any formal statement in the benchmark and the published conjecture it claims to represent, such as an omitted hypothesis or altered quantifier scope that changes the logical content.
Figures
read the original abstract
As automated reasoning systems advance rapidly, there is a growing need for research-level formal mathematical problems to accurately evaluate their capabilities. To address this, we present Formal Conjectures, an evolving benchmark of currently 2615 mathematical problem statements formalized in Lean 4. Sourced from areas of active mathematical research, the dataset features 1029 open research conjectures providing a zero-contamination benchmark for mathematical proof discovery, and 836 solved problems for proof autoformalization. Notably, the repository provides a structured interface connecting mathematicians who formalize and clarify problems with the AI systems and humans attempting to solve them. Demonstrating its immediate utility, the benchmark has already been leveraged to make new mathematical discoveries, including the resolution of open research conjectures. We describe our approach to ensuring the correctness of these formalizations in a collaborative open-source project where contributions stem from an active community. In this framework, AI-generated proofs and disproofs serve as a valuable auditing mechanism to iteratively improve the fidelity of the benchmark. Finally, we provide a standardized evaluation setup and report baseline results on frozen evaluation subsets, demonstrating a climbable signal that measures the current frontier of automated reasoning on research-level mathematics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Formal Conjectures, an evolving benchmark of 2615 Lean 4 formalized mathematical statements drawn from active research areas. It includes 1029 open research conjectures intended as a zero-contamination testbed for proof discovery and 836 solved problems for autoformalization, together with a collaborative open-source interface linking mathematicians and AI systems. The manuscript describes an auditing workflow that uses AI-generated proofs and disproofs to improve formalization fidelity, provides a standardized evaluation protocol, and reports baseline results on frozen subsets while claiming that the benchmark has already enabled new mathematical discoveries, including resolutions of open conjectures.
Significance. If the formalizations faithfully preserve the original statements and the reported discoveries are independently verifiable, the benchmark would supply a much-needed, community-maintained resource for measuring progress on research-level automated reasoning. Its open, evolving design and explicit separation of open versus solved items could support reproducible evaluation and incremental community contributions, addressing a clear gap between current synthetic benchmarks and frontier mathematical problems.
major comments (2)
- The abstract asserts that the benchmark 'has already been leveraged to make new mathematical discoveries, including the resolution of open research conjectures,' yet supplies no concrete examples, proof artifacts, or independent verification details. Without side-by-side source-to-formalization comparisons or mathematician sign-off on any resolved item, the central utility claim cannot be assessed.
- The manuscript states that AI-generated proofs serve as an auditing mechanism to ensure fidelity of the 1029 open conjectures, but provides no quantification of how often such audits caught versus missed semantic drift, nor any reported error rates or expert cross-check statistics. This leaves the zero-contamination guarantee unverified.
minor comments (2)
- The total of 2615 statements is stated, yet the breakdown (1029 open + 836 solved) leaves 750 items unaccounted for; a brief table or sentence clarifying the remaining categories would improve clarity.
- Baseline results are mentioned but no numerical performance figures, model names, or evaluation metrics appear in the provided text; including at least summary statistics for the frozen subsets would strengthen the evaluation section.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on the manuscript. We address the major points below and describe the changes we will make in revision.
read point-by-point responses
-
Referee: The abstract asserts that the benchmark 'has already been leveraged to make new mathematical discoveries, including the resolution of open research conjectures,' yet supplies no concrete examples, proof artifacts, or independent verification details. Without side-by-side source-to-formalization comparisons or mathematician sign-off on any resolved item, the central utility claim cannot be assessed.
Authors: We agree that the manuscript would be strengthened by concrete supporting details for this claim. In the revised version we will add a dedicated subsection with at least two specific examples of resolved conjectures. Each example will include the original informal statement, the corresponding Lean formalization, the discovered proof artifact (with repository link), and any available mathematician sign-off or verification note. Side-by-side comparisons will be provided for the first example to illustrate fidelity. revision: yes
-
Referee: The manuscript states that AI-generated proofs serve as an auditing mechanism to ensure fidelity of the 1029 open conjectures, but provides no quantification of how often such audits caught versus missed semantic drift, nor any reported error rates or expert cross-check statistics. This leaves the zero-contamination guarantee unverified.
Authors: We recognize that the current text lacks quantitative evaluation of the auditing process. In revision we will expand the auditing section to describe the workflow with concrete qualitative examples of semantic issues caught by AI-generated proofs or disproofs, and we will explicitly note the absence of systematic catch/miss statistics or expert cross-check rates. We will also clarify that the zero-contamination property relies on the open-source, community-audited nature of the repository rather than on quantified audit performance. revision: partial
- Quantitative statistics on how often AI audits caught versus missed semantic drift, including error rates and expert cross-check results, are not available because such systematic tracking has not been performed.
Circularity Check
No circularity: benchmark sourced externally with independent verification claims
full rationale
The paper constructs Formal Conjectures from externally sourced open conjectures and solved problems contributed by the mathematical community, with no self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the central claims to tautology. The utility demonstration (new discoveries via the benchmark) is presented as an empirical outcome of external use rather than an internal derivation; formalization fidelity is asserted through an open collaborative process and AI auditing loop, but this does not equate any result to its inputs by construction. No uniqueness theorems, ansatzes, or renamings of known results are invoked in a self-referential manner. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Lean 4 formalizations correctly and completely capture the mathematical meaning of the included conjectures and problems.
Reference graph
Works this paper leans on
-
[1]
Aristotle: IMO-level Automated Theorem Proving
Tudor Achim, Alex Best, Alberto Bietti, Kevin Der, Mathïs Fédérico, Sergei Gukov, Daniel Halpern-Leistner, Kirsten Henningsgard, Yury Kudryashov, Alexander Meiburg, et al. Aristotle: Imo-level automated theorem proving. arXiv preprint arXiv:2510.01346 ,
work page internal anchor Pith review arXiv
-
[2]
URL https://arxiv.org/abs/2510.19804. Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin T. Vechev. MathArena: Evaluating LLMs on Uncontaminated Math Competitions,
-
[3]
MathArena: Evaluating LLMs on Uncontaminated Math Competitions
URL https: //arxiv.org/abs/2505.23281. Thomas Bloom. Erdős Problems,
work page internal anchor Pith review arXiv
-
[4]
Accessed: 2026-03-24
URL https://www.erdosproblems.com/. Accessed: 2026-03-24. Thomas Bloom. Erdős Problem #124,
2026
-
[5]
See comment section
URL https://www.erdosproblems.com/124. See comment section. Accessed: 2026-04-20. Center for AI Safety, Scale AI, and HLE Contributors Consortium. A benchmark of expert- level academic questions to assess AI capabilities. Nature, 649:1139–1146,
2026
-
[6]
doi: 10.1038/s41586-025-09962-4. URL https://arxiv.org/abs/2501.14249. Jiangjie Chen, Wenxiang Chen, Jiacheng Du, Jinyi Hu, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Wenlei Shi, et al. Seed-prover 1.5: Mastering undergraduate- level theorem proving via learning from experience. arXiv preprint arXiv:2512.17260 , 2025a. Luoxin Chen, Ji...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-025-09962-4
-
[7]
ARC Prize 2024: Technical Report,
Francois Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers. ARC Prize 2024: Technical Report,
2024
-
[8]
Arc prize 2024: Technical report
URL https://arxiv.org/abs/2412.04604. 10 François Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers. ARC Prize 2025: Technical Report,
-
[9]
URL https://arxiv.org/abs/2601.10904. Google DeepMind. DeepMind prover agent. In preparation,
-
[10]
URL https: //doi.org/10.1080/16073606.1993.9631741
doi: 10.1080/16073606.1993.9631741. URL https: //doi.org/10.1080/16073606.1993.9631741. Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, Olli Järviniemi, Matthew Barnett, Robert Sandler, Matej Vrzala, Jaime Sevilla, Qiuyu Ren, Elizabeth ...
-
[11]
URL https: //arxiv.org/abs/2411.04872. Robin Hartshorne. Varieties of small codimension in projective space. Bulletin of the American Mathematical Society , 80(6):1017–1032, November
-
[12]
Olympiad- level formal mathematical reasoning with reinforcement learning
doi: 10.1038/s41586-025-09833-y. URL https://doi.org/10.1038/s41586-025-09833-y . IQOQI Vienna and Reinhard Werner. Open Quantum Problems,
-
[13]
Accessed: 2026-03-24. Jiedong Jiang, Wanyi He, Yuefeng Wang, Guoxiong Gao, Yongle Hu, Jingting Wang, Nailin Guan, Peihao Wu, Chunbo Dai, Liang Xiao, et al. Fate: A formal benchmark series for frontier algebra of multiple difficulty levels. arXiv preprint arXiv:2511.02872 ,
-
[14]
Unsolved Problems in Group Theory. The Kourovka Notebook
URL https://arxiv.org/abs/1401.0300v41. 21st edition. First published in 1965; continuously updated on arXiv since
work page internal anchor Pith review Pith/arXiv arXiv 1965
-
[15]
URL https://arxiv.org/abs/2603.02668. 11 Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yi- han Geng, Jiawei Ge, Jingruo Sun, et al. Goedel-prover-v2: Scaling formal theorem prov- ing with scaffolded data synthesis and self-correction. arXiv preprint arXiv:2508.03613 ,
-
[16]
Junqi Liu, Zihao Zhou, Zekai Zhu, Marco Dos Santos, Weikun He, Jiawei Liu, Ran Wang, Yunzhou Xie, Junqiao Zhao, Qiufeng Wang, et al. Numina-lean-agent: An open and gen- eral agentic reasoning system for formal mathematics. arXiv preprint arXiv:2601.14027 ,
-
[17]
The Lean Mathematical Library
The mathlib Community. The Lean Mathematical Library. In Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs , CPP 2020, New Orleans, LA, USA, January
2020
-
[18]
In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs
ACM. doi: 10.1145/3372885.3373824. URL https: //doi.org/10.1145/3372885.3373824. George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jennings, Amitayush Thakur, and Swarat Chaudhuri. PutnamBench: Evaluating Neural Theorem- Provers on the Putnam Mathematical Competition,
-
[19]
PutnamBench: Evaluating Neural Theorem‑Provers on the Putnam Mathematical Com- petition, 2024
URL https://arxiv.org/abs/ 2407.11214. Zhouliang Yu, Ruotian Peng, Keyi Ding, Yizhe Li, Zhongyuan Peng, Minghao Liu, Yifan Zhang, Zheng Yuan, Huajian Xin, Wenhao Huang, et al. Formalmath: Benchmarking formal mathematical reasoning of large language models. arXiv preprint arXiv:2505.02735,
-
[20]
Minif2f: a cross-system benchmark for formal olympiad-level mathematics, 2022
Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics. arXiv preprint arXiv:2109.00110 ,
-
[21]
subgraph
→ ¬(G.CliqueFree 3)) ↔ answer(sorry) := by sorry Figure 12: Code diff showing an example of an implicit convention, where the source 11 wrote “subgraph” rather than “induced subgraph” . The comments 12 demonstrate that the problem is trivial if one does not restrict to induced subgraphs, which is why we class this as implicit convention. It might also be ...
1993
-
[22]
In the former citation the explicit use of ≤ ℵ 0 is used, while this does not appear in the latter citation
and Erdős [1969]. In the former citation the explicit use of ≤ ℵ 0 is used, while this does not appear in the latter citation. This ambiguity was only clarified post draft formalization with the discovery of trivial solutions – in the (non-induced) subgraph case by the website comments 14 and by AlphaProof in the induced subgraph case. Note also that Erdő...
1969
-
[23]
and the website do not. B Experimental Evaluation Details B.1 F rozen Subsets The frozen subsets FC100SolvedSet1 and FC100OpenSet1 consist of 100 problems each, sampled uniformly at random from statements in the research solved and research open categories, respectively. They are defined by the files FormalConjectures/Subsets/FC100OpenSet1.lean and Formal...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.