Lean-GAP: A Dataset of Formalized Graduate Algebra Problems

Byung-Hak Hwang; Chul-hee Lee; Hyojae Lim; Hyukpyo Hong; Hyungryul Baik; Ilkyoo Choi; Jaeseong Heo; Jihoon Hyun; Jineon Baek; Keewoo Lee

arxiv: 2606.02588 · v1 · pith:GSE6ZYYPnew · submitted 2026-05-20 · 💻 cs.LO · cs.AI· cs.PL

Lean-GAP: A Dataset of Formalized Graduate Algebra Problems

Seewoo Lee , Byung-Hak Hwang , Hyojae Lim , Jihoon Hyun , Ilkyoo Choi , Yeachan Park , Jineon Baek , Hyukpyo Hong

show 5 more authors

Keewoo Lee Jaeseong Heo Hyungryul Baik Chul-hee Lee Kyu-Hwan Lee

This is my paper

Pith reviewed 2026-06-30 17:29 UTC · model grok-4.3

classification 💻 cs.LO cs.AIcs.PL

keywords Lean-GAPformalized problemsLean 4autoformalizationDummit and Footegraduate algebratheorem provingformal verification

0 comments

The pith

Lean-GAP supplies 430 graduate algebra problems formalized in Lean 4 from Dummit and Foote.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a dataset of 430 graduate-level algebra problems translated into Lean 4. It describes a pipeline that converts textbook PDFs into LaTeX, applies autoformalization, and then checks that each formal statement matches its informal source. The work shows that early pipeline stages can run with little human input while the final verification step demands sustained oversight. A reader would care because the dataset supplies concrete material for testing whether current formal systems can handle material beyond introductory courses. The analysis of recurring translation difficulties also points to where further tooling improvements are needed.

Core claim

Lean-GAP consists of 430 exercises drawn from Abstract Algebra by Dummit and Foote, each rendered as a verified Lean 4 statement. The authors built the collection through PDF-to-LaTeX preprocessing, model-driven autoformalization, and human review of informal-formal alignment. They report that verification remains the dominant labor cost and document patterns of difficulty that arise when moving graduate algebra into a proof assistant.

What carries the argument

The Lean-GAP dataset together with its three-stage pipeline of preprocessing, autoformalization, and correspondence verification.

If this is right

The dataset supplies training and test examples for autoformalization models aimed at graduate mathematics.
Systematic comparison of different models becomes possible on the same set of algebra statements.
Recurring formalization bottlenecks identified in the paper can guide development of new Lean tactics or translation aids.
The pipeline offers a template that can be replicated on other graduate textbooks in algebra or adjacent fields.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Larger collections built the same way could serve as benchmarks for measuring progress in automated formalization of entire courses.
The labor distribution reported here suggests that investment in verification assistants would yield the largest reduction in total effort.
If the same difficulties appear in other subjects, the dataset could help prioritize which areas of mathematics are currently hardest to formalize at scale.

Load-bearing premise

Human reviewers can reliably confirm that each Lean statement captures the intended meaning of the original textbook problem.

What would settle it

Inspection of a random sample of twenty problems that reveals multiple cases where the Lean formalization alters the logical content or scope of the original exercise.

Figures

Figures reproduced from arXiv: 2606.02588 by Byung-Hak Hwang, Chul-hee Lee, Hyojae Lim, Hyukpyo Hong, Hyungryul Baik, Ilkyoo Choi, Jaeseong Heo, Jihoon Hyun, Jineon Baek, Keewoo Lee, Kyu-Hwan Lee, Seewoo Lee, Yeachan Park.

**Figure 2.** Figure 2: Subgroup lattice for Exercise 11, Chapter 2.5 [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗

read the original abstract

We present Lean-GAP (Lean-Graduate Agebra Problems), 430 formalized graduate-level algebra problems from the textbook Abstract Algebra by Dummit and Foote. We develop a scalable pipeline consisting of PDF-to-LaTeX preprocessing, autoformalization into Lean 4, and verification of informal-formal correspondence. While the preprocessing and autoformalization stages can be largely automated, we find that verification remains the most subtle and labor-intensive component, requiring careful human oversight. Our contributions include (i) the construction of a structured dataset of formalized exercises, (ii) a systematic methodology for formalizing textbook mathematics, and (iii) an analysis of recurring challenges in the formalization process. We also compare the performance of different autoformalization models and highlight key bottlenecks in translating informal statements into formal language.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Lean-GAP ships 430 graduate algebra problems in Lean 4 with a pipeline that still leans heavily on human verification.

read the letter

The core contribution is a new dataset of 430 formalized exercises pulled from Dummit and Foote, done in Lean 4. They lay out a three-part pipeline—PDF cleanup, autoformalization, and human checking of the informal-to-formal match—and they run some model comparisons plus notes on common sticking points like definitions and notation.

This is genuinely new at the graduate level. Most prior formal math datasets stop at undergrad or contest problems, so having this scale of material from a standard algebra text is a step up. The authors are straightforward that verification is the part that resists automation and needs careful oversight, which matches what people actually see when they try this work.

The soft spot is the missing numbers. The abstract and description give no success rates on the autoformalization stage, no breakdown of error types, and no sense of how many hours the human verification took per problem. Without those details it is hard to judge dataset quality or whether the pipeline is practical to repeat.

The paper is for people working on autoformalization tools, Lean benchmarks, or AI-for-math datasets. A reader who wants higher-level formal problems or who is building on existing Lean resources will get direct use from the release itself.

It is worth sending to peer review. The dataset is new and the level is higher than before, even if the quantitative evaluation of the process needs strengthening.

Referee Report

1 major / 2 minor

Summary. The manuscript presents Lean-GAP, a dataset of 430 formalized graduate-level algebra problems drawn from Dummit and Foote's Abstract Algebra textbook. It describes a three-stage pipeline (PDF-to-LaTeX preprocessing, autoformalization into Lean 4, and human verification of informal-formal correspondence), a systematic methodology for textbook formalization, an analysis of recurring challenges, and a comparison of autoformalization model performance.

Significance. A verified, publicly released dataset of this size and level would be a useful benchmark resource for autoformalization research and for studying the gap between informal textbook statements and formal proofs. The explicit identification of verification as the dominant remaining bottleneck is a constructive methodological observation. The absence of any quantitative verification statistics, however, prevents a clear assessment of the dataset's reliability or of the pipeline's overall effectiveness.

major comments (1)

[Abstract] Abstract: the claim that the 430 problems have been 'verified' is presented without any accompanying quantitative data on verification success rate, error types, inter-annotator agreement, or the fraction of problems that required substantial human correction. These metrics are load-bearing for the central contribution of a reliable formalized dataset.

minor comments (2)

[Abstract] Abstract, title: 'Lean-Graduate Agebra Problems' contains a typographical error ('Agebra' for 'Algebra').
[Abstract] The manuscript states that model comparisons are performed, yet the abstract supplies no numerical results or table references; if such results exist they should be summarized with at least one concrete performance figure.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the need for quantitative verification metrics. We agree this is a substantive point and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the 430 problems have been 'verified' is presented without any accompanying quantitative data on verification success rate, error types, inter-annotator agreement, or the fraction of problems that required substantial human correction. These metrics are load-bearing for the central contribution of a reliable formalized dataset.

Authors: We agree that the abstract would benefit from more precise language and that quantitative metrics on verification would strengthen the contribution. In the current work the verification step consisted of author-led manual inspection to confirm informal-formal correspondence for each of the 430 problems; however, the process was iterative and performed within a small team, so inter-annotator agreement was not computed and per-problem correction statistics were not systematically logged. We will revise the abstract to replace the unqualified claim of verification with a description of the pipeline that includes human review, and we will add a short paragraph in the methods section describing the verification procedure and its limitations without introducing new fabricated numbers. This addresses the concern while remaining faithful to the data we collected. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is a dataset construction effort describing a pipeline for formalizing graduate algebra problems from Dummit and Foote into Lean 4. It reports on preprocessing, autoformalization, and human verification steps without any claimed mathematical derivations, predictions from fitted parameters, or first-principles results. No equations, self-citations as load-bearing premises, or renamings of known results appear. The central claim (release of 430 verified formalizations) is supported by explicit methodological description and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that human verification can establish reliable informal-formal correspondence for the chosen problems.

axioms (1)

domain assumption Graduate algebra problems from Dummit and Foote can be formalized in Lean 4 while preserving their mathematical content.
This assumption is required for the dataset to be usable as a benchmark.

pith-pipeline@v0.9.1-grok · 5725 in / 1155 out tokens · 18840 ms · 2026-06-30T17:29:29.011446+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 14 canonical work pages · 4 internal anchors

[1]

Aristotle: IMO-level Automated Theorem Proving

Achim, T., Best, A., Bietti, A., Der, K., Fédérico, M., Gukov, S., Halpern-Leistner, D., Hen- ningsgard, K., Kudryashov, Y ., Meiburg, A., Michelsen, M., Patterson, R., Rodriguez, E., Scharff, L., Shanker, V ., Sicca, V ., Sowrirajan, H., Swope, A., Tamas, M., Tenev, V ., Thomm, J., Williams, H., and Wu, L. Aristotle: IMO-level automated theorem proving, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Ayers, Dragomir Radev, and Jeremy Avigad

Azerbayev, Z., Piotrowski, B., Schoelkopf, H., Ayers, E. W., Radev, D., and Avigad, J. ProofNet: Autoformalizing and formally proving undergraduate-level mathematics, 2023. URL https: //arxiv.org/abs/2302.12433

work page arXiv 2023
[3]

Fermat’s Last Theorem formalization project

Buzzard, K. Fermat’s Last Theorem formalization project. URL https:// imperialcollegelondon.github.io/FLT/
[4]

Dummit, D. S. and Foote, R. M.Abstract Algebra. John Wiley & Sons, 3rd edition, 2003

2003
[5]

Herald: A natural language annotated Lean 4 dataset

Gao, G., Wang, Y ., Jiang, J., Gao, Q., Qin, Z., Xu, T., and Dong, B. Herald: A natural language annotated Lean 4 dataset. InThe 13th International Conference on Learning Representations,
[6]

URLhttps://openreview.net/forum?id=Se6MgCtRhz
[7]

Glazer, E., Erdil, E., Besiroglu, T., Chicharro, D., Chen, E., Gunning, A., Olsson, C. F., Denain, J.-S., Ho, A., de Oliveira Santos, E., Järviniemi, O., Barnett, M., Sandler, R., Vrzala, M., Sevilla, J., Ren, Q., Pratt, E., Levine, L., Barkley, G., Stewart, N., Grechuk, B., Grechuk, T., Enugandla, S. V ., and Wildon, M. FrontierMath: A benchmark for eval...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Automatic Textbook Formalization

Gloeckle, F., Rammal, A., Arnal, C., Munos, R., Cabannes, V ., Synnaeve, G., and Hayat, A. Automatic textbook formalization, 2026. URLhttps://arxiv.org/abs/2604.03071

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

Hubert, T., Mehta, R., Sartran, L., Horváth, M. Z., Žuži ´c, G., Wieser, E., Huang, A., Schrittwieser, J., Schroecker, Y ., Masoom, H., Bertolli, O., Zahavy, T., Mandhane, A., Yung, J., Beloshapka, I., Ibarz, B., Veeriah, V ., Yu, L., Nash, O., Lezeau, P., Mercuri, S., Sönne, C., Mehta, B., Davies, A., Zheng, D., Pedregosa, F., Li, Y ., von Glehn, I., Row...

work page doi:10.1038/s41586-025-09833-y 2026
[10]

FATE: A formal benchmark series for frontier algebra of multiple difficulty levels

Jiang, J., He, W., Yuefeng, W., Gao, G., Hu, Y ., Wang, J., Guan, N., Wu, P., Dai, B., Xiao, L., and Dong, B. FATE: A formal benchmark series for frontier algebra of multiple difficulty levels. InThe 14th International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=3bD19r4jqh. 10

2026
[11]

Mathlib4: The Lean mathematical library

Lean Prover Community. Mathlib4: The Lean mathematical library. URL https://github. com/leanprover-community/mathlib4
[12]

Solving quantitative reasoning problems with language models

Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V ., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., Wu, Y ., Neyshabur, B., Gur-Ari, G., and Misra, V . Solving quantitative reasoning problems with language models. InProceedings of the 36th International Conference on Neural Information Processing Systems, 2022. URL https:...

2022
[13]

Autoformalize mathematical statements by symbolic equivalence and semantic consistency

Li, Z., Wu, Y ., Li, Z., Wei, X., Yang, F., Zhang, X., and Ma, X. Autoformalize mathematical statements by symbolic equivalence and semantic consistency. InProceedings of the 38th International Conference on Neural Information Processing Systems, 2024. URL https: //openreview.net/forum?id=8ihVBYpMV4

2024
[14]

FIMO: A challenge formal dataset for automated theorem proving, 2023

Liu, C., Shen, J., Xin, H., Liu, Z., Yuan, Y ., Wang, H., Ju, W., Zheng, C., Yin, Y ., Li, L., Zhang, M., and Liu, Q. FIMO: A challenge formal dataset for automated theorem proving, 2023. URL https://arxiv.org/abs/2309.04295

work page arXiv 2023
[15]

Rethinking and improving autoformalization: Towards a faithful metric and a dependency retrieval-based approach

Liu, Q., Zheng, X., Lu, X., Cao, Q., and Yan, J. Rethinking and improving autoformalization: Towards a faithful metric and a dependency retrieval-based approach. InThe 13th International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id= hUb2At2DsQ

2025
[16]

Beyond theorem proving: Formulation, framework and benchmark for formal problem-solving, 2025

Liu, Q., Zheng, X., Xia, R., Qi, X., Cao, Q., and Yan, J. Beyond theorem proving: Formulation, framework and benchmark for formal problem-solving, 2025. URL https://arxiv.org/abs/ 2505.04528

work page arXiv 2025
[17]

ASSESS: A semantic and structural evaluation framework for statement similarity

Liu, X., Zhu, T., Dong, Z., Liu, Y ., qingfeng, G., ZhaoXuan, L., Chen, Y ., and Luo, T. ASSESS: A semantic and structural evaluation framework for statement similarity. InThe 14th International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=avwNGWtiHF

2026
[18]

Process-driven autoformalization in Lean 4, 2024

Lu, J., Wan, Y ., Liu, Z., Huang, Y ., Xiong, J., Liu, C., Shen, J., Jin, H., Zhang, J., Wang, H., Yang, Z., Tang, J., and Guo, Z. Process-driven autoformalization in Lean 4, 2024. URL https://doi.org/10.48550/arXiv.2406.01940

work page doi:10.48550/arxiv.2406.01940 2024
[19]

StrongPNT: Formalizing the Prime Number Theorem

Math Inc. StrongPNT: Formalizing the Prime Number Theorem. https://math-inc.github. io/strongpnt/
[20]

FormalQualBench.https://github.com/math-inc/FormalQualBench, 2026

Math Inc. FormalQualBench.https://github.com/math-inc/FormalQualBench, 2026

2026
[21]

Autoformalizing Euclidean geometry

Murphy, L., Yang, K., Sun, J., Li, Z., Anandkumar, A., and Si, X. Autoformalizing Euclidean geometry. InProceedings of the 41st International Conference on Machine Learning, 2024. URLhttps://dl.acm.org/doi/10.5555/3692070.3693567

work page doi:10.5555/3692070.3693567 2024
[22]

Reliable evaluation and benchmarks for statement autoformalization

Poiroux, A., Weiss, G., Kunˇcak, V ., and Bosselut, A. Reliable evaluation and benchmarks for statement autoformalization. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 17947–17969, 2025. URL https://aclanthology.org/ 2025.emnlp-main.907/

2025
[23]

DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

Ren, Z. Z., Shao, Z., Song, J., Xin, H., Wang, H., Zhao, W., Zhang, L., Fu, Z., Zhu, Q., Yang, D., Wu, Z. F., Gou, Z., Ma, S., Tang, H., Liu, Y ., Gao, W., Guo, D., and Ruan, C. DeepSeek- Prover-V2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition, 2025. URLhttps://arxiv.org/abs/2504.21801

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Liquid tensor experiment, 2020

Scholze, P. Liquid tensor experiment, 2020. Available athttps://xenaproject.wordpress. com/2020/12/05/liquid-tensor-experiment/

2020
[25]

REAL-Prover: Retrieval augmented Lean prover for mathematical reasoning, 2025

Shen, Z., Huang, N., Yang, F., Wang, Y ., Gao, G., Xu, T., Jiang, J., He, W., Yang, P., Sun, M., Ju, H., Wu, P., Dai, B., and Dong, B. REAL-Prover: Retrieval augmented Lean prover for mathematical reasoning, 2025. URLhttps://arxiv.org/abs/2505.20613. 11

work page arXiv 2025
[26]

K., Zhang, J., Ji, E., Sahai, V ., Deng, H., Chen, Y ., Yuan, Y ., Wu, D., Gu, J.-C., Chang, K.-W., Peng, N., Sahai, A., and Wang, W

Taylor, A. K., Zhang, J., Ji, E., Sahai, V ., Deng, H., Chen, Y ., Yuan, Y ., Wu, D., Gu, J.-C., Chang, K.-W., Peng, N., Sahai, A., and Wang, W. TaoBench: Do automated theorem prover LLMs generalize beyond Mathlib?, 2026. URLhttps://arxiv.org/abs/2603.12744

work page arXiv 2026
[27]

PutnamBench: Evaluating neural theorem-provers on the Putnam mathematical competition

Tsoukalas, G., Lee, J., Jennings, J., Xin, J., Ding, M., Jennings, M., Thakur, A., and Chaudhuri, S. PutnamBench: Evaluating neural theorem-provers on the Putnam mathematical competition. InProceedings of the 38th Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URLhttps://openreview.net/forum?id=ChKCF75Ocd

2024
[28]

M2F: Automated formalization of mathematical literature at scale, 2026

Wang, Z., Ma, W., Ming, Z., Zhang, G., Yuan, K., and Wen, Z. M2F: Automated formalization of mathematical literature at scale, 2026. URLhttps://arxiv.org/abs/2602.17016

work page arXiv 2026
[29]

Advancing theorem proving in LLMs through large-scale synthetic data

Xin, H., Guo, D., Shao, Z., Ren, Z., Zhu, Q., Liu, B., Ruan, C., Li, W., and Liang, X. Advancing theorem proving in LLMs through large-scale synthetic data. InThe 4th Workshop on Mathematical Reasoning and AI at NeurIPS’24, 2024. URL https://openreview.net/ forum?id=TPtXLihkny

2024
[30]

Lean Workbook: A large-scale Lean problem set formalized from natural language math problems

Ying, H., Wu, Z., Geng, Y ., Wang, J., Lin, D., and Chen, K. Lean Workbook: A large-scale Lean problem set formalized from natural language math problems. InProceedings of the 38th Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URLhttps://openreview.net/forum?id=Vcw3vzjHDb

2024
[31]

FormalMATH: Benchmarking formal mathematical reasoning of large language models, 2025

Yu, Z., Peng, R., Ding, K., Li, Y ., Peng, Z., Liu, M., Zhang, Y ., Yuan, Z., Xin, H., Huang, W., Wen, Y ., Zhang, G., and Liu, W. FormalMATH: Benchmarking formal mathematical reasoning of large language models, 2025. URLhttps://arxiv.org/abs/2505.02735

work page arXiv 2025
[32]

ifXthenY

Zheng, K., Han, J. M., and Polu, S. miniF2F: a cross-system benchmark for formal Olympiad- level mathematics. InInternational Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=9ZPegFuFTFv. A Appendix A.1 Suspicious-statement patterns Successful Lean elaboration is a necessary but not a sufficient condition for a correct aut...

2022

[1] [1]

Aristotle: IMO-level Automated Theorem Proving

Achim, T., Best, A., Bietti, A., Der, K., Fédérico, M., Gukov, S., Halpern-Leistner, D., Hen- ningsgard, K., Kudryashov, Y ., Meiburg, A., Michelsen, M., Patterson, R., Rodriguez, E., Scharff, L., Shanker, V ., Sicca, V ., Sowrirajan, H., Swope, A., Tamas, M., Tenev, V ., Thomm, J., Williams, H., and Wu, L. Aristotle: IMO-level automated theorem proving, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Ayers, Dragomir Radev, and Jeremy Avigad

Azerbayev, Z., Piotrowski, B., Schoelkopf, H., Ayers, E. W., Radev, D., and Avigad, J. ProofNet: Autoformalizing and formally proving undergraduate-level mathematics, 2023. URL https: //arxiv.org/abs/2302.12433

work page arXiv 2023

[3] [3]

Fermat’s Last Theorem formalization project

Buzzard, K. Fermat’s Last Theorem formalization project. URL https:// imperialcollegelondon.github.io/FLT/

[4] [4]

Dummit, D. S. and Foote, R. M.Abstract Algebra. John Wiley & Sons, 3rd edition, 2003

2003

[5] [5]

Herald: A natural language annotated Lean 4 dataset

Gao, G., Wang, Y ., Jiang, J., Gao, Q., Qin, Z., Xu, T., and Dong, B. Herald: A natural language annotated Lean 4 dataset. InThe 13th International Conference on Learning Representations,

[6] [6]

URLhttps://openreview.net/forum?id=Se6MgCtRhz

[7] [7]

Glazer, E., Erdil, E., Besiroglu, T., Chicharro, D., Chen, E., Gunning, A., Olsson, C. F., Denain, J.-S., Ho, A., de Oliveira Santos, E., Järviniemi, O., Barnett, M., Sandler, R., Vrzala, M., Sevilla, J., Ren, Q., Pratt, E., Levine, L., Barkley, G., Stewart, N., Grechuk, B., Grechuk, T., Enugandla, S. V ., and Wildon, M. FrontierMath: A benchmark for eval...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Automatic Textbook Formalization

Gloeckle, F., Rammal, A., Arnal, C., Munos, R., Cabannes, V ., Synnaeve, G., and Hayat, A. Automatic textbook formalization, 2026. URLhttps://arxiv.org/abs/2604.03071

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [9]

Hubert, T., Mehta, R., Sartran, L., Horváth, M. Z., Žuži ´c, G., Wieser, E., Huang, A., Schrittwieser, J., Schroecker, Y ., Masoom, H., Bertolli, O., Zahavy, T., Mandhane, A., Yung, J., Beloshapka, I., Ibarz, B., Veeriah, V ., Yu, L., Nash, O., Lezeau, P., Mercuri, S., Sönne, C., Mehta, B., Davies, A., Zheng, D., Pedregosa, F., Li, Y ., von Glehn, I., Row...

work page doi:10.1038/s41586-025-09833-y 2026

[10] [10]

FATE: A formal benchmark series for frontier algebra of multiple difficulty levels

Jiang, J., He, W., Yuefeng, W., Gao, G., Hu, Y ., Wang, J., Guan, N., Wu, P., Dai, B., Xiao, L., and Dong, B. FATE: A formal benchmark series for frontier algebra of multiple difficulty levels. InThe 14th International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=3bD19r4jqh. 10

2026

[11] [11]

Mathlib4: The Lean mathematical library

Lean Prover Community. Mathlib4: The Lean mathematical library. URL https://github. com/leanprover-community/mathlib4

[12] [12]

Solving quantitative reasoning problems with language models

Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V ., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., Wu, Y ., Neyshabur, B., Gur-Ari, G., and Misra, V . Solving quantitative reasoning problems with language models. InProceedings of the 36th International Conference on Neural Information Processing Systems, 2022. URL https:...

2022

[13] [13]

Autoformalize mathematical statements by symbolic equivalence and semantic consistency

Li, Z., Wu, Y ., Li, Z., Wei, X., Yang, F., Zhang, X., and Ma, X. Autoformalize mathematical statements by symbolic equivalence and semantic consistency. InProceedings of the 38th International Conference on Neural Information Processing Systems, 2024. URL https: //openreview.net/forum?id=8ihVBYpMV4

2024

[14] [14]

FIMO: A challenge formal dataset for automated theorem proving, 2023

Liu, C., Shen, J., Xin, H., Liu, Z., Yuan, Y ., Wang, H., Ju, W., Zheng, C., Yin, Y ., Li, L., Zhang, M., and Liu, Q. FIMO: A challenge formal dataset for automated theorem proving, 2023. URL https://arxiv.org/abs/2309.04295

work page arXiv 2023

[15] [15]

Rethinking and improving autoformalization: Towards a faithful metric and a dependency retrieval-based approach

Liu, Q., Zheng, X., Lu, X., Cao, Q., and Yan, J. Rethinking and improving autoformalization: Towards a faithful metric and a dependency retrieval-based approach. InThe 13th International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id= hUb2At2DsQ

2025

[16] [16]

Beyond theorem proving: Formulation, framework and benchmark for formal problem-solving, 2025

Liu, Q., Zheng, X., Xia, R., Qi, X., Cao, Q., and Yan, J. Beyond theorem proving: Formulation, framework and benchmark for formal problem-solving, 2025. URL https://arxiv.org/abs/ 2505.04528

work page arXiv 2025

[17] [17]

ASSESS: A semantic and structural evaluation framework for statement similarity

Liu, X., Zhu, T., Dong, Z., Liu, Y ., qingfeng, G., ZhaoXuan, L., Chen, Y ., and Luo, T. ASSESS: A semantic and structural evaluation framework for statement similarity. InThe 14th International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=avwNGWtiHF

2026

[18] [18]

Process-driven autoformalization in Lean 4, 2024

Lu, J., Wan, Y ., Liu, Z., Huang, Y ., Xiong, J., Liu, C., Shen, J., Jin, H., Zhang, J., Wang, H., Yang, Z., Tang, J., and Guo, Z. Process-driven autoformalization in Lean 4, 2024. URL https://doi.org/10.48550/arXiv.2406.01940

work page doi:10.48550/arxiv.2406.01940 2024

[19] [19]

StrongPNT: Formalizing the Prime Number Theorem

Math Inc. StrongPNT: Formalizing the Prime Number Theorem. https://math-inc.github. io/strongpnt/

[20] [20]

FormalQualBench.https://github.com/math-inc/FormalQualBench, 2026

Math Inc. FormalQualBench.https://github.com/math-inc/FormalQualBench, 2026

2026

[21] [21]

Autoformalizing Euclidean geometry

Murphy, L., Yang, K., Sun, J., Li, Z., Anandkumar, A., and Si, X. Autoformalizing Euclidean geometry. InProceedings of the 41st International Conference on Machine Learning, 2024. URLhttps://dl.acm.org/doi/10.5555/3692070.3693567

work page doi:10.5555/3692070.3693567 2024

[22] [22]

Reliable evaluation and benchmarks for statement autoformalization

Poiroux, A., Weiss, G., Kunˇcak, V ., and Bosselut, A. Reliable evaluation and benchmarks for statement autoformalization. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 17947–17969, 2025. URL https://aclanthology.org/ 2025.emnlp-main.907/

2025

[23] [23]

DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

Ren, Z. Z., Shao, Z., Song, J., Xin, H., Wang, H., Zhao, W., Zhang, L., Fu, Z., Zhu, Q., Yang, D., Wu, Z. F., Gou, Z., Ma, S., Tang, H., Liu, Y ., Gao, W., Guo, D., and Ruan, C. DeepSeek- Prover-V2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition, 2025. URLhttps://arxiv.org/abs/2504.21801

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Liquid tensor experiment, 2020

Scholze, P. Liquid tensor experiment, 2020. Available athttps://xenaproject.wordpress. com/2020/12/05/liquid-tensor-experiment/

2020

[25] [25]

REAL-Prover: Retrieval augmented Lean prover for mathematical reasoning, 2025

Shen, Z., Huang, N., Yang, F., Wang, Y ., Gao, G., Xu, T., Jiang, J., He, W., Yang, P., Sun, M., Ju, H., Wu, P., Dai, B., and Dong, B. REAL-Prover: Retrieval augmented Lean prover for mathematical reasoning, 2025. URLhttps://arxiv.org/abs/2505.20613. 11

work page arXiv 2025

[26] [26]

K., Zhang, J., Ji, E., Sahai, V ., Deng, H., Chen, Y ., Yuan, Y ., Wu, D., Gu, J.-C., Chang, K.-W., Peng, N., Sahai, A., and Wang, W

Taylor, A. K., Zhang, J., Ji, E., Sahai, V ., Deng, H., Chen, Y ., Yuan, Y ., Wu, D., Gu, J.-C., Chang, K.-W., Peng, N., Sahai, A., and Wang, W. TaoBench: Do automated theorem prover LLMs generalize beyond Mathlib?, 2026. URLhttps://arxiv.org/abs/2603.12744

work page arXiv 2026

[27] [27]

PutnamBench: Evaluating neural theorem-provers on the Putnam mathematical competition

Tsoukalas, G., Lee, J., Jennings, J., Xin, J., Ding, M., Jennings, M., Thakur, A., and Chaudhuri, S. PutnamBench: Evaluating neural theorem-provers on the Putnam mathematical competition. InProceedings of the 38th Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URLhttps://openreview.net/forum?id=ChKCF75Ocd

2024

[28] [28]

M2F: Automated formalization of mathematical literature at scale, 2026

Wang, Z., Ma, W., Ming, Z., Zhang, G., Yuan, K., and Wen, Z. M2F: Automated formalization of mathematical literature at scale, 2026. URLhttps://arxiv.org/abs/2602.17016

work page arXiv 2026

[29] [29]

Advancing theorem proving in LLMs through large-scale synthetic data

Xin, H., Guo, D., Shao, Z., Ren, Z., Zhu, Q., Liu, B., Ruan, C., Li, W., and Liang, X. Advancing theorem proving in LLMs through large-scale synthetic data. InThe 4th Workshop on Mathematical Reasoning and AI at NeurIPS’24, 2024. URL https://openreview.net/ forum?id=TPtXLihkny

2024

[30] [30]

Lean Workbook: A large-scale Lean problem set formalized from natural language math problems

Ying, H., Wu, Z., Geng, Y ., Wang, J., Lin, D., and Chen, K. Lean Workbook: A large-scale Lean problem set formalized from natural language math problems. InProceedings of the 38th Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URLhttps://openreview.net/forum?id=Vcw3vzjHDb

2024

[31] [31]

FormalMATH: Benchmarking formal mathematical reasoning of large language models, 2025

Yu, Z., Peng, R., Ding, K., Li, Y ., Peng, Z., Liu, M., Zhang, Y ., Yuan, Z., Xin, H., Huang, W., Wen, Y ., Zhang, G., and Liu, W. FormalMATH: Benchmarking formal mathematical reasoning of large language models, 2025. URLhttps://arxiv.org/abs/2505.02735

work page arXiv 2025

[32] [32]

ifXthenY

Zheng, K., Han, J. M., and Polu, S. miniF2F: a cross-system benchmark for formal Olympiad- level mathematics. InInternational Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=9ZPegFuFTFv. A Appendix A.1 Suspicious-statement patterns Successful Lean elaboration is a necessary but not a sufficient condition for a correct aut...

2022