MLFMF: Data Sets for Machine Learning for Mathematical Formalization

Andrej Bauer; Ljup\v{c}o Todorovski; Matej Petkovi\'c

arxiv: 2310.16005 · v1 · pith:CQKN3BEJnew · submitted 2023-10-24 · 💻 cs.LG

MLFMF: Data Sets for Machine Learning for Mathematical Formalization

Andrej Bauer , Matej Petkovi\'c , Ljup\v{c}o Todorovski This is my paper

classification 💻 cs.LG

keywords librarydataentriesmathematicsassistantscollectionformalizedlargest

0 comments

read the original abstract

We introduce MLFMF, a collection of data sets for benchmarking recommendation systems used to support formalization of mathematics with proof assistants. These systems help humans identify which previous entries (theorems, constructions, datatypes, and postulates) are relevant in proving a new theorem or carrying out a new construction. Each data set is derived from a library of formalized mathematics written in proof assistants Agda or Lean. The collection includes the largest Lean~4 library Mathlib, and some of the largest Agda libraries: the standard library, the library of univalent mathematics Agda-unimath, and the TypeTopology library. Each data set represents the corresponding library in two ways: as a heterogeneous network, and as a list of s-expressions representing the syntax trees of all the entries in the library. The network contains the (modular) structure of the library and the references between entries, while the s-expressions give complete and easily parsed information about every entry. We report baseline results using standard graph and word embeddings, tree ensembles, and instance-based learning algorithms. The MLFMF data sets provide solid benchmarking support for further investigation of the numerous machine learning approaches to formalized mathematics. The methodology used to extract the networks and the s-expressions readily applies to other libraries, and is applicable to other proof assistants. With more than $250\,000$ entries in total, this is currently the largest collection of formalized mathematical knowledge in machine learnable format.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TheoremGraph: Bridging Formal and Informal Mathematics
cs.IR 2026-06 unverdicted novelty 7.0

TheoremGraph builds a unified statement-level dependency graph across informal arXiv math and formal Lean code via parsing, embeddings, and LLM validation, releasing the data and APIs for search and retrieval.
Artificial Intelligence and the Structure of Mathematics
cs.AI 2026-04 unverdicted novelty 4.0

AI agents exploring Platonic mathematical structures via proof hypergraphs may reveal the overall architecture of formal mathematics and what makes parts of it human-accessible.