pith. sign in

arxiv: 2606.25363 · v1 · pith:REPBKV37new · submitted 2026-06-24 · 💻 cs.IR · cs.AI· math.HO

TheoremGraph: Bridging Formal and Informal Mathematics

Pith reviewed 2026-06-25 20:44 UTC · model grok-4.3

classification 💻 cs.IR cs.AImath.HO
keywords TheoremGraphdependency graphinformal mathematicsformal mathematicsLeansemantic matchingmathematical retrievalarXiv parsing
0
0 comments X

The pith

TheoremGraph builds one statement-level dependency graph that spans informal arXiv papers and formal Lean libraries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that mathematical statements and their dependencies can be organized into a single graph that covers both human-written papers and machine-checked code. It extracts 18.3 million candidate directed dependencies from 11.7 million theorem-like environments on arXiv and 11.3 million typed edges from 388,105 Lean declarations, then uses slogan embeddings to identify 47,952 reliable cross-matches. A sympathetic reader would care because the resulting structure supports search and reasoning that can move fluidly between informal intuition and formal verification.

Core claim

TheoremGraph is a unified statement-level dependency graph spanning both informal and formal mathematics, constructed by parsing arXiv to recover 18.3M candidate directed dependencies labeled by extractor, extracting 388,105 nodes and 11.3M edges via a Lean 4 elaborator, and bridging the domains by embedding generated natural-language slogans to produce 47,952 LLM-affirmed matches above a 0.8 cosine threshold with acceptance rising to 87 percent at 0.9.

What carries the argument

TheoremGraph, the unified statement-level dependency graph that links informal extractions and formal Lean graphs through slogan embeddings in a shared semantic space.

Load-bearing premise

The candidate directed dependencies from the informal extractors and the LLM-accepted cross-matches are accurate enough to support downstream search and reasoning tasks without substantial noise or systematic error.

What would settle it

A manual audit of 100 randomly sampled matches from the 0.8-0.9 cosine tier that finds an acceptance rate below 70 percent would show the bridging links are too noisy for reliable use.

Figures

Figures reproduced from arXiv: 2606.25363 by Artemii Remizov, Eric Leonen, Evan Wang, Giovanni Inchiostro, Jarod Alper, Luke Alexander, Simon Kurgan, Sophie Szeto, Vasily Ilin.

Figure 1
Figure 1. Figure 1: The matching pipeline. Lean declarations and LA [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Verdict composition by cosine-similarity bin (candidates at [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: What each system embeds for a declaration [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Graph expansion on Lattice (Config E). Af￾ter retrieval we follow the dependency graph one hop up and add each candidate’s formal_dependency parents. Their names (SemilatticeInf, SemilatticeSup, Lattice.toSemilatticeInf) match name-like queries the bare slogan misses. LSv2 builds the same graph but uses it only for informalization, not at retrieval. Results The full ablation is in [PITH_FULL_IMAGE:figures… view at source ↗
Figure 5
Figure 5. Figure 5: MathlibQR fair-810, Recall@10 by declaration kind, comparing the baseline (Config A) with the full lever [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Three kinds of match the slogan embedding can surface, by the corpora of the query and its nearest [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Recall at k=10. Cross-paper graph expansion roughly doubles the trained policy’s recall. Citation-graph expansion recovers dependencies that embedding search misses. Cross-paper graph expansion raises recall from 7.9% to 13.4% ( [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Recall at k=10 by number of true dependencies. Cross-paper graph expansion is most effective for low-degree targets. Limitations. Differences within the 6–8% validation band should be treated cautiously because the trained-policy validation set has only n=150. However, cross-paper graph expansion adds another 5.5 points, which appears to be the more significant statistic. The graph result also uses a large… view at source ↗
Figure 10
Figure 10. Figure 10: Cosine similarity and node depth for informal and formal graphs. [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗
read the original abstract

Mathematical knowledge is organized around statements and their dependencies, but this structure is exposed unevenly: informal papers cite mostly at the document level, while formal libraries record fine-grained dependencies over a much smaller body of mathematics. We introduce TheoremGraph, a unified statement-level dependency graph spanning both informal and formal mathematics. On the informal side, we parse 11.7M theorem-like environments from mathematics arXiv and recover 18.3M candidate directed dependencies, each labeled by the extractor that proposed it so downstream users can trade coverage for precision. On the formal side, we release LeanGraph, a Lean 4 elaborator-level extractor producing 388,105 declaration nodes and 11.3M typed edges across 25 Lean projects. We bridge the two graphs by embedding generated natural-language slogans into a shared semantic space, linking related statements across papers and across the informal/formal divide; an LLM judge affirms 47,952 such matches above a 0.8 cosine floor, with the judge-acceptance rate rising from 48% across the floor to 87% in the >=0.9 tier. On formal concept retrieval, our name-and-signature representation with graph expansion comes within 0.5pp of LeanSearch v2's reranked Recall@10 (0.775 vs. 0.780) without an LM reranker. We release the dataset, extractors, HTTP API, and MCP interface as infrastructure for mathematical search, attribution, and retrieval-augmented reasoning, available at theoremsearch.com and huggingface.co/datasets/uw-math-ai/theorem-matching.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces TheoremGraph, a unified statement-level dependency graph spanning informal mathematics (parsed from 11.7M theorem-like environments in arXiv yielding 18.3M candidate directed dependencies) and formal mathematics (via the released LeanGraph extractor producing 388,105 nodes and 11.3M typed edges across 25 Lean projects). Informal-formal bridging is performed by embedding generated slogans into a shared space and applying an LLM judge to affirm 47,952 matches above a 0.8 cosine threshold (with acceptance rising from 48% to 87% in the >=0.9 tier). A retrieval evaluation shows the name-and-signature representation with graph expansion achieving 0.775 Recall@10 on formal concept retrieval, within 0.5pp of LeanSearch v2 without an LM reranker. The dataset, extractors, HTTP API, and MCP interface are released at theoremsearch.com and Hugging Face.

Significance. If the extracted dependencies and cross-matches prove sufficiently accurate, TheoremGraph would supply useful shared infrastructure for mathematical search, attribution, and retrieval-augmented reasoning. The explicit labeling of candidate dependencies by extractor (allowing users to trade coverage for precision), the release of the Lean 4 elaborator-level extractor, the public API, and the reproducible retrieval baseline are concrete strengths supporting downstream use and verification.

major comments (1)
  1. [Abstract] Abstract: The central claim that TheoremGraph supplies a usable unified dependency graph for search and RAG tasks rests on the premise that the 18.3M informal candidates and 47,952 LLM-accepted matches contain acceptably low noise. However, the manuscript reports only the tiered judge-acceptance rate and a single downstream Recall@10 figure; it provides no precision, recall, or human-validated accuracy metrics for either the informal dependency extractors or the slogan-embedding matches against any gold-standard set. This validation gap directly affects whether the released graph can be treated as reliable infrastructure.
minor comments (1)
  1. The retrieval comparison (0.775 vs. 0.780 Recall@10) is presented without error bars, variance estimates, or statistical significance testing, which would clarify whether the 0.5pp gap is meaningful.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for emphasizing the need for stronger validation evidence. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that TheoremGraph supplies a usable unified dependency graph for search and RAG tasks rests on the premise that the 18.3M informal candidates and 47,952 LLM-accepted matches contain acceptably low noise. However, the manuscript reports only the tiered judge-acceptance rate and a single downstream Recall@10 figure; it provides no precision, recall, or human-validated accuracy metrics for either the informal dependency extractors or the slogan-embedding matches against any gold-standard set. This validation gap directly affects whether the released graph can be treated as reliable infrastructure.

    Authors: We acknowledge that the manuscript does not report direct precision or recall figures against gold-standard annotations for the dependency extractors or the LLM-validated matches. The design instead labels every candidate dependency by its source extractor, explicitly enabling users to select higher-precision subsets, and reports tiered LLM-judge acceptance rates (48% overall, rising to 87% at cosine ≥0.9) together with competitive downstream retrieval performance. These choices reflect the scale of the data (18.3 M candidates) and the absence of existing gold standards at this breadth. We agree that human-validated accuracy metrics would strengthen the infrastructure claim and will add a targeted human evaluation on sampled dependencies and matches in the revision. revision: yes

Circularity Check

0 steps flagged

Data-construction paper; no derivations or self-referential predictions

full rationale

The manuscript describes a pipeline that parses arXiv papers, extracts candidate dependencies, embeds slogans, applies an LLM judge to accept matches above a cosine threshold, and releases the resulting graph plus LeanGraph. All reported quantities (18.3M candidates, 47,952 matches, acceptance rates 48%→87%, Recall@10 0.775) are direct outputs of this construction process rather than predictions derived from fitted parameters or prior self-citations. No equations, uniqueness theorems, or ansatzes are invoked that reduce to the inputs by definition. The central claim is the release of the dataset and tools; it does not rest on any load-bearing self-citation chain or renaming of known results. This is the normal, non-circular outcome for an infrastructure paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied data-extraction and infrastructure paper with no mathematical axioms, free parameters, or invented entities required by a derivation.

pith-pipeline@v0.9.1-grok · 5847 in / 1202 out tokens · 37146 ms · 2026-06-25T20:44:12.344607+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 8 canonical work pages

  1. [1]

    Akiko Aizawa, Michael Kohlhase, Iadh Ounis, and Moritz Schubotz. 2014. Ntcir-11 math-2 task overview

  2. [2]

    Jesse Alama, Lionel Mamane, and Josef Urban. 2012. https://arxiv.org/abs/1109.3687 Dependencies in formal mathematics: Applications and extraction for Coq and Mizar . Preprint, arXiv:1109.3687

  3. [3]

    Luke Alexander, Eric Leonen, Sophie Szeto, Artemii Remizov, Ignacio Tejeda, Jarod Alper, Giovanni Inchiostro, and Vasily Ilin. 2026. https://arxiv.org/abs/2602.05216 Semantic search over 9 million mathematical theorems . arXiv preprint

  4. [4]

    Artifex Software, Inc. 2026. https://pymupdf.readthedocs.io/en/latest/ PyMuPDF Documentation . Accessed: 2026-06-22

  5. [5]

    Michael Artin. 2011. Algebra, 2 edition. Pearson

  6. [6]

    Andrej Bauer, Matej Petković, and Ljupčo Todorovski. 2023. https://arxiv.org/abs/2310.16005 Mlfmf: Data sets for machine learning for mathematical formalization . Preprint, arXiv:2310.16005

  7. [7]

    Carl Bergstrom. 2007. https://doi.org/10.5860/crln.68.5.7804 Eigenfactor: Measuring the value and prestige of scholarly journals . College and Research Libraries News, 68(5):314--316

  8. [8]

    Rong Bian, Yu Geng, Zijian Yang, and Bing Cheng. 2025. https://arxiv.org/abs/2505.13406 Automathkg: The automated mathematical knowledge graph based on llm and vector database . Preprint, arXiv:2505.13406

  9. [9]

    Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. 2023. https://arxiv.org/abs/2308.13418 Nougat: Neural optical understanding for academic documents . Preprint, arXiv:2308.13418

  10. [10]

    Jiangjie Chen, Wenxiang Chen, Jiacheng Du, Jinyi Hu, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Wenlei Shi, Zhihong Wang, Mingxuan Wang, Chenrui Wei, Shufa Wei, Huajian Xin, Fan Yang, Weihao Gao, Zheng Yuan, Tianyang Zhan, and 3 others. 2025. https://arxiv.org/abs/2512.17260 Seed-prover 1.5: Mastering undergraduate-level theorem provi...

  11. [11]

    frenzymath . 2024. jixia: A static analysis tool for lean 4. https://github.com/frenzymath/jixia. GitHub repository, accessed 2026-05-23

  12. [12]

    Guoxiong Gao, Haocheng Ju, Jiedong Jiang, Zihan Qin, and Bin Dong. 2025. https://arxiv.org/abs/2403.13310 A semantic search engine for mathlib4 . Preprint, arXiv:2403.13310

  13. [13]

    Guoxiong Gao, Zeming Sun, Jiedong Jiang, Yutong Wang, Jingda Xu, Peihao Wu, Bryan Dai, and Bin Dong. 2026. https://arxiv.org/abs/2605.13137 Leansearch v2: Global premise retrieval for lean 4 theorem proving . Preprint, arXiv:2605.13137

  14. [14]

    Eugene Garfield. 1972. https://doi.org/10.1126/science.178.4060.471 Citation analysis as a tool in journal evaluation . Science, 178(4060):471--479

  15. [15]

    Xiaojun Guo, Ang Li, Yifei Wang, Stefanie Jegelka, and Yisen Wang. 2025. https://arxiv.org/abs/2505.18499 G1: Teaching llms to reason on graphs with reinforcement learning . Preprint, arXiv:2505.18499

  16. [16]

    J. E. Hirsch. 2005. https://doi.org/10.1073/pnas.0507655102 An index to quantify an individual's scientific research output . Proceedings of the National Academy of Sciences, 102(46):16569--16572

  17. [17]

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. 2026. https://arxiv.org/abs/2601.20802 Reinforcement learning via self-distillation . Preprint, arXiv:2601.20802

  18. [18]

    Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen tau Yih. 2020. https://arxiv.org/abs/2004.04906 Dense passage retrieval for open-domain question answering . Preprint, arXiv:2004.04906

  19. [19]

    Lean FRO . 2026. https://github.com/leanprover/lean4export lean4export : Plain-text declaration export for lean 4 . GitHub repository. Accessed: 2026-05-25

  20. [20]

    Lean Prover Community . 2026. https://github.com/leanprover/doc-gen4 doc-gen4 : Document generator for lean 4 . Accessed May 25, 2026

  21. [21]

    Xinze Li, Nanyun Peng, Simone Severini, and Patrick Shafto. 2026. https://arxiv.org/abs/2604.24797 The network structure of mathlib . Preprint, arXiv:2604.24797

  22. [22]

    Minhua Lin, Zongyu Wu, Zhichao Xu, Hui Liu, Xianfeng Tang, Qi He, Charu Aggarwal, Hui Liu, Xiang Zhang, and Suhang Wang. 2025. https://arxiv.org/abs/2510.16724 A comprehensive survey on reinforcement learning-based agentic search: Foundations, roles, optimizations, evaluations, and applications . Preprint, arXiv:2510.16724

  23. [23]

    Junqi Liu, Zihao Zhou, Zekai Zhu, Marco Dos Santos, Weikun He, Jiawei Liu, Ran Wang, Yunzhou Xie, Junqiao Zhao, Qiufeng Wang, Lihong Zhi, Jia Li, and Wenda Li. 2026. https://arxiv.org/abs/2601.14027 Numina-lean-agent: An open and general agentic reasoning system for formal mathematics . Preprint, arXiv:2601.14027

  24. [24]

    Yu. A. Malkov and D. A. Yashunin. 2018. https://arxiv.org/abs/1603.09320 Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs . Preprint, arXiv:1603.09320

  25. [25]

    Oard, and Richard Zanibbi

    Behrooz Mansouri, V\' t Novotn\' y , Anurag Agarwal, Douglas W. Oard, and Richard Zanibbi. 2022. https://doi.org/10.1007/978-3-031-13643-6_20 Overview of arqmath-3 (2022): Third clef lab on answer retrieval for questions on math . In Experimental IR Meets Multilinguality, Multimodality, and Interaction: 13th International Conference of the CLEF Associatio...

  26. [26]

    Patrick Massot. 2020. leanblueprint: A plastex plugin for writing blueprints of lean formalization projects. https://github.com/PatrickMassot/leanblueprint. GitHub repository; originally developed for the Sphere Eversion Project. Accessed 2026-05-26

  27. [27]

    Mathpix . 2026. Mathpix Markdown : Markdown for math and science. https://mathpix.com/mathpix-markdown. Accessed: 2026-06-22

  28. [28]

    Shrey Mishra, Yacine Brihmouche, Th\' e o Delemazure, Antoine Gauquier, and Pierre Senellart. 2024. https://aclanthology.org/2024.sdp-1.16 First steps in building a knowledge base of mathematical results . In Proceedings of the Fourth Workshop on Scholarly Document Processing

  29. [29]

    Azim Ospanov, Farzan Farnia, and Roozbeh Yousefzadeh. 2025. https://arxiv.org/abs/2511.03108 minif2f-lean revisited: Reviewing limitations and charting a path forward . Preprint, arXiv:2511.03108

  30. [30]

    Gabriel Pinski and Francis Narin. 1976. https://doi.org/10.1016/0306-4573(76)90048-0 Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics . Information Processing and Management, 12(5):297--312

  31. [31]

    Stanislas Polu and Ilya Sutskever. 2020. https://arxiv.org/abs/2009.03393 Generative language modeling for automated theorem proving . Preprint, arXiv:2009.03393

  32. [32]

    Delip Rao, Jonathan Young, Thomas Dietterich, and Chris Callison-Burch. 2024. https://arxiv.org/abs/2412.03775 Withdrarxiv: A large-scale dataset for retraction study . Preprint, arXiv:2412.03775

  33. [33]

    Borja Requena, Austin Letson, Krystian Nowakowski, Izan Beltran-Ferreiro, and Leopoldo Sarra. 2026. https://arxiv.org/abs/2602.24273 A minimal agent for automated theorem proving . Preprint, arXiv:2602.24273

  34. [34]

    Christian Steinfeldt and Helena Mihaljevi\' c . 2024. https://doi.org/10.1007/978-3-031-66997-2_14 Evaluation and domain adaptation of similarity models for short mathematical texts . In Intelligent Computer Mathematics: 17th International Conference, CICM 2024, Montr\' e al, QC, Canada, August 5–9, 2024, Proceedings , page 241–260, Berlin, Heidelberg. Sp...

  35. [35]

    Qwen Team. 2025. https://arxiv.org/abs/2505.09388 Qwen3 technical report . Preprint, arXiv:2505.09388

  36. [36]

    George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jennings, Amitayush Thakur, and Swarat Chaudhuri. 2024. https://arxiv.org/abs/2407.11214 Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition . Preprint, arXiv:2407.11214

  37. [37]

    Dylan Walker, Huafeng Xie, Koon-Kiu Yan, and Sergei Maslov. 2007. https://doi.org/10.1088/1742-5468/2007/06/p06010 Ranking scientific publications using a model of network traffic . Journal of Statistical Mechanics: Theory and Experiment, 2007(06):P06010–P06010

  38. [38]

    Sean Welleck, Jiacheng Liu, Ronan Le Bras, Hannaneh Hajishirzi, Yejin Choi, and Kyunghyun Cho. 2021. https://arxiv.org/abs/2104.01112 Naturalproofs: Mathematical theorem proving in natural language . Preprint, arXiv:2104.01112

  39. [39]

    Banri Yanahama and Akiyoshi Sannai. 2026. https://arxiv.org/abs/2604.16347 Lean atlas: An integrated proof environment for scalable human-ai collaborative formalization . Preprint, arXiv:2604.16347

  40. [40]

    Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, and Anima Anandkumar

    Kaiyu Yang, Aidan M. Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, and Anima Anandkumar. 2023. https://arxiv.org/abs/2306.15626 Leandojo: Theorem proving with retrieval-augmented language models . Preprint, arXiv:2306.15626

  41. [41]

    Richard Zanibbi, Akiko Aizawa, Michael Kohlhase, Iadh Ounis, Goran Topic, and Kenny Davila. 2016. https://api.semanticscholar.org/CorpusID:9102694 Ntcir-12 mathir task overview . In NTCIR Conference on Evaluation of Information Access Technologies

  42. [42]

    Richard Zanibbi, Kenny Davila, Andrew Kane, and Frank Tompa. 2015. https://arxiv.org/abs/1507.06235 The tangent search engine: Improved similarity metrics and scalability for math formula search . Preprint, arXiv:1507.06235

  43. [43]

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. 2025 a . https://arxiv.org/abs/2506.05176 Qwen3 embedding: Advancing text embedding and reranking through foundation models . Preprint, arXiv:2506.05176

  44. [44]

    Yizhuo Zhang, Heng Wang, Shangbin Feng, Zhaoxuan Tan, Xinyun Liu, and Yulia Tsvetkov. 2025 b . https://arxiv.org/abs/2506.00845 Generalizable llm learning of graph synthetic data with post-training alignment . Preprint, arXiv:2506.00845

  45. [45]

    Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. 2022. https://arxiv.org/abs/2109.00110 Minif2f: a cross-system benchmark for formal olympiad-level mathematics . Preprint, arXiv:2109.00110

  46. [46]

    Karol Zyczkowski. 2009. https://doi.org/10.1007/s11192-010-0208-6 Citation graph, weighted impact factors and performance indices . Scientometrics, 85