Lean-GAP: A Dataset of Formalized Graduate Algebra Problems
Pith reviewed 2026-06-30 17:29 UTC · model grok-4.3
The pith
Lean-GAP supplies 430 graduate algebra problems formalized in Lean 4 from Dummit and Foote.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Lean-GAP consists of 430 exercises drawn from Abstract Algebra by Dummit and Foote, each rendered as a verified Lean 4 statement. The authors built the collection through PDF-to-LaTeX preprocessing, model-driven autoformalization, and human review of informal-formal alignment. They report that verification remains the dominant labor cost and document patterns of difficulty that arise when moving graduate algebra into a proof assistant.
What carries the argument
The Lean-GAP dataset together with its three-stage pipeline of preprocessing, autoformalization, and correspondence verification.
If this is right
- The dataset supplies training and test examples for autoformalization models aimed at graduate mathematics.
- Systematic comparison of different models becomes possible on the same set of algebra statements.
- Recurring formalization bottlenecks identified in the paper can guide development of new Lean tactics or translation aids.
- The pipeline offers a template that can be replicated on other graduate textbooks in algebra or adjacent fields.
Where Pith is reading between the lines
- Larger collections built the same way could serve as benchmarks for measuring progress in automated formalization of entire courses.
- The labor distribution reported here suggests that investment in verification assistants would yield the largest reduction in total effort.
- If the same difficulties appear in other subjects, the dataset could help prioritize which areas of mathematics are currently hardest to formalize at scale.
Load-bearing premise
Human reviewers can reliably confirm that each Lean statement captures the intended meaning of the original textbook problem.
What would settle it
Inspection of a random sample of twenty problems that reveals multiple cases where the Lean formalization alters the logical content or scope of the original exercise.
Figures
read the original abstract
We present Lean-GAP (Lean-Graduate Agebra Problems), 430 formalized graduate-level algebra problems from the textbook Abstract Algebra by Dummit and Foote. We develop a scalable pipeline consisting of PDF-to-LaTeX preprocessing, autoformalization into Lean 4, and verification of informal-formal correspondence. While the preprocessing and autoformalization stages can be largely automated, we find that verification remains the most subtle and labor-intensive component, requiring careful human oversight. Our contributions include (i) the construction of a structured dataset of formalized exercises, (ii) a systematic methodology for formalizing textbook mathematics, and (iii) an analysis of recurring challenges in the formalization process. We also compare the performance of different autoformalization models and highlight key bottlenecks in translating informal statements into formal language.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Lean-GAP, a dataset of 430 formalized graduate-level algebra problems drawn from Dummit and Foote's Abstract Algebra textbook. It describes a three-stage pipeline (PDF-to-LaTeX preprocessing, autoformalization into Lean 4, and human verification of informal-formal correspondence), a systematic methodology for textbook formalization, an analysis of recurring challenges, and a comparison of autoformalization model performance.
Significance. A verified, publicly released dataset of this size and level would be a useful benchmark resource for autoformalization research and for studying the gap between informal textbook statements and formal proofs. The explicit identification of verification as the dominant remaining bottleneck is a constructive methodological observation. The absence of any quantitative verification statistics, however, prevents a clear assessment of the dataset's reliability or of the pipeline's overall effectiveness.
major comments (1)
- [Abstract] Abstract: the claim that the 430 problems have been 'verified' is presented without any accompanying quantitative data on verification success rate, error types, inter-annotator agreement, or the fraction of problems that required substantial human correction. These metrics are load-bearing for the central contribution of a reliable formalized dataset.
minor comments (2)
- [Abstract] Abstract, title: 'Lean-Graduate Agebra Problems' contains a typographical error ('Agebra' for 'Algebra').
- [Abstract] The manuscript states that model comparisons are performed, yet the abstract supplies no numerical results or table references; if such results exist they should be summarized with at least one concrete performance figure.
Simulated Author's Rebuttal
We thank the referee for the constructive comment on the need for quantitative verification metrics. We agree this is a substantive point and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that the 430 problems have been 'verified' is presented without any accompanying quantitative data on verification success rate, error types, inter-annotator agreement, or the fraction of problems that required substantial human correction. These metrics are load-bearing for the central contribution of a reliable formalized dataset.
Authors: We agree that the abstract would benefit from more precise language and that quantitative metrics on verification would strengthen the contribution. In the current work the verification step consisted of author-led manual inspection to confirm informal-formal correspondence for each of the 430 problems; however, the process was iterative and performed within a small team, so inter-annotator agreement was not computed and per-problem correction statistics were not systematically logged. We will revise the abstract to replace the unqualified claim of verification with a description of the pipeline that includes human review, and we will add a short paragraph in the methods section describing the verification procedure and its limitations without introducing new fabricated numbers. This addresses the concern while remaining faithful to the data we collected. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper is a dataset construction effort describing a pipeline for formalizing graduate algebra problems from Dummit and Foote into Lean 4. It reports on preprocessing, autoformalization, and human verification steps without any claimed mathematical derivations, predictions from fitted parameters, or first-principles results. No equations, self-citations as load-bearing premises, or renamings of known results appear. The central claim (release of 430 verified formalizations) is supported by explicit methodological description and does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Graduate algebra problems from Dummit and Foote can be formalized in Lean 4 while preserving their mathematical content.
Reference graph
Works this paper leans on
-
[1]
Aristotle: IMO-level Automated Theorem Proving
Achim, T., Best, A., Bietti, A., Der, K., Fédérico, M., Gukov, S., Halpern-Leistner, D., Hen- ningsgard, K., Kudryashov, Y ., Meiburg, A., Michelsen, M., Patterson, R., Rodriguez, E., Scharff, L., Shanker, V ., Sicca, V ., Sowrirajan, H., Swope, A., Tamas, M., Tenev, V ., Thomm, J., Williams, H., and Wu, L. Aristotle: IMO-level automated theorem proving, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Ayers, Dragomir Radev, and Jeremy Avigad
Azerbayev, Z., Piotrowski, B., Schoelkopf, H., Ayers, E. W., Radev, D., and Avigad, J. ProofNet: Autoformalizing and formally proving undergraduate-level mathematics, 2023. URL https: //arxiv.org/abs/2302.12433
-
[3]
Fermat’s Last Theorem formalization project
Buzzard, K. Fermat’s Last Theorem formalization project. URL https:// imperialcollegelondon.github.io/FLT/
-
[4]
Dummit, D. S. and Foote, R. M.Abstract Algebra. John Wiley & Sons, 3rd edition, 2003
2003
-
[5]
Herald: A natural language annotated Lean 4 dataset
Gao, G., Wang, Y ., Jiang, J., Gao, Q., Qin, Z., Xu, T., and Dong, B. Herald: A natural language annotated Lean 4 dataset. InThe 13th International Conference on Learning Representations,
-
[6]
URLhttps://openreview.net/forum?id=Se6MgCtRhz
-
[7]
Glazer, E., Erdil, E., Besiroglu, T., Chicharro, D., Chen, E., Gunning, A., Olsson, C. F., Denain, J.-S., Ho, A., de Oliveira Santos, E., Järviniemi, O., Barnett, M., Sandler, R., Vrzala, M., Sevilla, J., Ren, Q., Pratt, E., Levine, L., Barkley, G., Stewart, N., Grechuk, B., Grechuk, T., Enugandla, S. V ., and Wildon, M. FrontierMath: A benchmark for eval...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Automatic Textbook Formalization
Gloeckle, F., Rammal, A., Arnal, C., Munos, R., Cabannes, V ., Synnaeve, G., and Hayat, A. Automatic textbook formalization, 2026. URLhttps://arxiv.org/abs/2604.03071
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[9]
Hubert, T., Mehta, R., Sartran, L., Horváth, M. Z., Žuži ´c, G., Wieser, E., Huang, A., Schrittwieser, J., Schroecker, Y ., Masoom, H., Bertolli, O., Zahavy, T., Mandhane, A., Yung, J., Beloshapka, I., Ibarz, B., Veeriah, V ., Yu, L., Nash, O., Lezeau, P., Mercuri, S., Sönne, C., Mehta, B., Davies, A., Zheng, D., Pedregosa, F., Li, Y ., von Glehn, I., Row...
-
[10]
FATE: A formal benchmark series for frontier algebra of multiple difficulty levels
Jiang, J., He, W., Yuefeng, W., Gao, G., Hu, Y ., Wang, J., Guan, N., Wu, P., Dai, B., Xiao, L., and Dong, B. FATE: A formal benchmark series for frontier algebra of multiple difficulty levels. InThe 14th International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=3bD19r4jqh. 10
2026
-
[11]
Mathlib4: The Lean mathematical library
Lean Prover Community. Mathlib4: The Lean mathematical library. URL https://github. com/leanprover-community/mathlib4
-
[12]
Solving quantitative reasoning problems with language models
Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V ., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., Wu, Y ., Neyshabur, B., Gur-Ari, G., and Misra, V . Solving quantitative reasoning problems with language models. InProceedings of the 36th International Conference on Neural Information Processing Systems, 2022. URL https:...
2022
-
[13]
Autoformalize mathematical statements by symbolic equivalence and semantic consistency
Li, Z., Wu, Y ., Li, Z., Wei, X., Yang, F., Zhang, X., and Ma, X. Autoformalize mathematical statements by symbolic equivalence and semantic consistency. InProceedings of the 38th International Conference on Neural Information Processing Systems, 2024. URL https: //openreview.net/forum?id=8ihVBYpMV4
2024
-
[14]
FIMO: A challenge formal dataset for automated theorem proving, 2023
Liu, C., Shen, J., Xin, H., Liu, Z., Yuan, Y ., Wang, H., Ju, W., Zheng, C., Yin, Y ., Li, L., Zhang, M., and Liu, Q. FIMO: A challenge formal dataset for automated theorem proving, 2023. URL https://arxiv.org/abs/2309.04295
-
[15]
Rethinking and improving autoformalization: Towards a faithful metric and a dependency retrieval-based approach
Liu, Q., Zheng, X., Lu, X., Cao, Q., and Yan, J. Rethinking and improving autoformalization: Towards a faithful metric and a dependency retrieval-based approach. InThe 13th International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id= hUb2At2DsQ
2025
-
[16]
Beyond theorem proving: Formulation, framework and benchmark for formal problem-solving, 2025
Liu, Q., Zheng, X., Xia, R., Qi, X., Cao, Q., and Yan, J. Beyond theorem proving: Formulation, framework and benchmark for formal problem-solving, 2025. URL https://arxiv.org/abs/ 2505.04528
-
[17]
ASSESS: A semantic and structural evaluation framework for statement similarity
Liu, X., Zhu, T., Dong, Z., Liu, Y ., qingfeng, G., ZhaoXuan, L., Chen, Y ., and Luo, T. ASSESS: A semantic and structural evaluation framework for statement similarity. InThe 14th International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=avwNGWtiHF
2026
-
[18]
Process-driven autoformalization in Lean 4, 2024
Lu, J., Wan, Y ., Liu, Z., Huang, Y ., Xiong, J., Liu, C., Shen, J., Jin, H., Zhang, J., Wang, H., Yang, Z., Tang, J., and Guo, Z. Process-driven autoformalization in Lean 4, 2024. URL https://doi.org/10.48550/arXiv.2406.01940
-
[19]
StrongPNT: Formalizing the Prime Number Theorem
Math Inc. StrongPNT: Formalizing the Prime Number Theorem. https://math-inc.github. io/strongpnt/
-
[20]
FormalQualBench.https://github.com/math-inc/FormalQualBench, 2026
Math Inc. FormalQualBench.https://github.com/math-inc/FormalQualBench, 2026
2026
-
[21]
Autoformalizing Euclidean geometry
Murphy, L., Yang, K., Sun, J., Li, Z., Anandkumar, A., and Si, X. Autoformalizing Euclidean geometry. InProceedings of the 41st International Conference on Machine Learning, 2024. URLhttps://dl.acm.org/doi/10.5555/3692070.3693567
-
[22]
Reliable evaluation and benchmarks for statement autoformalization
Poiroux, A., Weiss, G., Kunˇcak, V ., and Bosselut, A. Reliable evaluation and benchmarks for statement autoformalization. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 17947–17969, 2025. URL https://aclanthology.org/ 2025.emnlp-main.907/
2025
-
[23]
Ren, Z. Z., Shao, Z., Song, J., Xin, H., Wang, H., Zhao, W., Zhang, L., Fu, Z., Zhu, Q., Yang, D., Wu, Z. F., Gou, Z., Ma, S., Tang, H., Liu, Y ., Gao, W., Guo, D., and Ruan, C. DeepSeek- Prover-V2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition, 2025. URLhttps://arxiv.org/abs/2504.21801
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Liquid tensor experiment, 2020
Scholze, P. Liquid tensor experiment, 2020. Available athttps://xenaproject.wordpress. com/2020/12/05/liquid-tensor-experiment/
2020
-
[25]
REAL-Prover: Retrieval augmented Lean prover for mathematical reasoning, 2025
Shen, Z., Huang, N., Yang, F., Wang, Y ., Gao, G., Xu, T., Jiang, J., He, W., Yang, P., Sun, M., Ju, H., Wu, P., Dai, B., and Dong, B. REAL-Prover: Retrieval augmented Lean prover for mathematical reasoning, 2025. URLhttps://arxiv.org/abs/2505.20613. 11
-
[26]
Taylor, A. K., Zhang, J., Ji, E., Sahai, V ., Deng, H., Chen, Y ., Yuan, Y ., Wu, D., Gu, J.-C., Chang, K.-W., Peng, N., Sahai, A., and Wang, W. TaoBench: Do automated theorem prover LLMs generalize beyond Mathlib?, 2026. URLhttps://arxiv.org/abs/2603.12744
-
[27]
PutnamBench: Evaluating neural theorem-provers on the Putnam mathematical competition
Tsoukalas, G., Lee, J., Jennings, J., Xin, J., Ding, M., Jennings, M., Thakur, A., and Chaudhuri, S. PutnamBench: Evaluating neural theorem-provers on the Putnam mathematical competition. InProceedings of the 38th Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URLhttps://openreview.net/forum?id=ChKCF75Ocd
2024
-
[28]
M2F: Automated formalization of mathematical literature at scale, 2026
Wang, Z., Ma, W., Ming, Z., Zhang, G., Yuan, K., and Wen, Z. M2F: Automated formalization of mathematical literature at scale, 2026. URLhttps://arxiv.org/abs/2602.17016
-
[29]
Advancing theorem proving in LLMs through large-scale synthetic data
Xin, H., Guo, D., Shao, Z., Ren, Z., Zhu, Q., Liu, B., Ruan, C., Li, W., and Liang, X. Advancing theorem proving in LLMs through large-scale synthetic data. InThe 4th Workshop on Mathematical Reasoning and AI at NeurIPS’24, 2024. URL https://openreview.net/ forum?id=TPtXLihkny
2024
-
[30]
Lean Workbook: A large-scale Lean problem set formalized from natural language math problems
Ying, H., Wu, Z., Geng, Y ., Wang, J., Lin, D., and Chen, K. Lean Workbook: A large-scale Lean problem set formalized from natural language math problems. InProceedings of the 38th Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URLhttps://openreview.net/forum?id=Vcw3vzjHDb
2024
-
[31]
FormalMATH: Benchmarking formal mathematical reasoning of large language models, 2025
Yu, Z., Peng, R., Ding, K., Li, Y ., Peng, Z., Liu, M., Zhang, Y ., Yuan, Z., Xin, H., Huang, W., Wen, Y ., Zhang, G., and Liu, W. FormalMATH: Benchmarking formal mathematical reasoning of large language models, 2025. URLhttps://arxiv.org/abs/2505.02735
-
[32]
ifXthenY
Zheng, K., Han, J. M., and Polu, S. miniF2F: a cross-system benchmark for formal Olympiad- level mathematics. InInternational Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=9ZPegFuFTFv. A Appendix A.1 Suspicious-statement patterns Successful Lean elaboration is a necessary but not a sufficient condition for a correct aut...
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.