Recognition: no theorem link
CSLibPremiseBench: Structure-Guided Premise Retrieval and Label Robustness for Lean 4 Computer-Science Theorems
Pith reviewed 2026-05-15 01:27 UTC · model grok-4.3
The pith
CSLibPremiseBench shows structure-guided reranking gives only modest early-rank gains over lexical baselines in Lean 4 computer-science premise retrieval.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CSLibPremiseBench is a pinned, source-visible benchmark for premise retrieval over Lean 4 CSLib declarations that compares lexical, symbolic, graph, and hybrid retrievers; under strict policies the structure-guided CSG-Rerank reranker produces a modest early-rank MRR lift over BM25 but fails to outperform BM25 plus symbol overlap or to raise Recall@10, while context-packet audits show stronger module concentration without corresponding top-k coverage or token-utility gains.
What carries the argument
CSLibPremiseBench, a fixed Lean 4 benchmark of 801 proxy-labelable tasks over 1875 CSLib declarations that uses source-visible proof-reference proxies and strict import/source-order policies to compare lexical, symbolic, graph-heuristic, and hybrid retrievers including CSG-Rerank.
If this is right
- Repository structure and candidate-policy design materially shape measured premise-retrieval performance on CSLib.
- Proxy labels drawn from proof references must carry explicit caveats when used for evaluation.
- Benchmark results on this task set do not demonstrate or claim gains for automated proof generation or repair.
Where Pith is reading between the lines
- Broad mathematical theorem-proving benchmarks may systematically misrepresent retrieval behavior inside domain-specific libraries such as CSLib.
- Symbol-overlap signals may already encode much of the structural information that more elaborate graph rerankers attempt to exploit.
- Replacing proxy labels with actual dependency traces extracted from completed proofs would provide a direct test of label robustness.
Load-bearing premise
Source-visible proxy labels taken from proof references accurately stand in for the premises that would actually be required in realistic proof-generation or proof-repair settings.
What would settle it
Direct inspection of whether the top-ranked premises from each method appear as actual dependencies inside manually completed or machine-generated proofs for a random sample of the 801 tasks would show whether the proxy labels match real usage.
read the original abstract
CSLib is an emerging Lean 4 library for computer-science formalization, but its premise-retrieval behavior is not well represented by broad mathematical theorem-proving benchmarks. We introduce CSLibPremiseBench, a reproducible CSLib-specific benchmark and empirical study for source-visible premise retrieval over Lean 4 theorem and lemma declarations. The benchmark pins CSLib v4.29.0 at commit 0d37cc7fcc985cfc53b155e7eef2453f846c6da2, builds with Lean 4.29.0, and evaluates a strict import/source-order task set with 801 proxy-labelable tasks and 1875 CSLib candidate declarations. The labels are source-visible CSLib proof-reference proxies, not elaborated Lean dependency traces. We audit label robustness using stricter source-visible matching and a 300-task Lean environment expression probe, then compare BM25, symbol/name overlap, namespace/module and import-graph heuristics, PageRank-style module priors, fixed hybrids, and CSG-Rerank, a structure-guided graph-lexical reranker. CSG-Rerank gives a modest early-rank MRR gain over lexical BM25 under the strict policy, but does not reliably outperform BM25+symbol and does not improve Recall@10. A context-packet audit similarly finds stronger module/family concentration without reliable top-k proxy-gold coverage or token-utility gains. We position CSLibPremiseBench as a benchmark and audit paper: repository structure and candidate-policy design materially shape CSLib premise retrieval, proxy labels require explicit caveats, and proof-generation or proof-repair performance is not claimed.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CSLibPremiseBench, a reproducible benchmark for source-visible premise retrieval over Lean 4 theorem declarations in the emerging CSLib library. It pins CSLib v4.29.0 at commit 0d37cc7fcc985cfc53b155e7eef2453f846c6da2, defines a strict import/source-order task set with 801 proxy-labelable tasks and 1875 candidates using source-visible proof-reference proxies, audits label robustness via stricter matching and a 300-task expression probe, and compares BM25, symbol/name overlap, namespace/import-graph heuristics, PageRank priors, fixed hybrids, and the new CSG-Rerank structure-guided graph-lexical reranker. The central empirical finding is that CSG-Rerank yields a modest early-rank MRR gain over lexical BM25 under the strict policy but does not reliably outperform BM25+symbol and does not improve Recall@10; a context-packet audit shows stronger module/family concentration without reliable top-k coverage or token-utility gains. The work is explicitly positioned as a benchmark-and-audit study with disclaimers that no proof-generation performance is claimed.
Significance. If the reported results hold, the paper supplies a much-needed, narrowly scoped and reproducible benchmark for premise retrieval in computer-science formalizations, a domain poorly served by broad mathematical theorem-proving suites. The explicit pinning of commit and Lean version, the label-robustness audits, the policy-sensitivity analysis, and the clear disclaimers about proxy labels constitute genuine strengths that increase the reliability and utility of the empirical findings for the interactive theorem-proving community.
minor comments (2)
- [Abstract] Abstract and §4: the precise construction of the graph used by CSG-Rerank (nodes, edge types, weighting) is only sketched; a one-sentence definition or pointer to the exact algorithm would improve reproducibility.
- [§5.3] §5.3: the 300-task Lean environment probe is described only at high level; reporting the exact distribution of expression sizes or the fraction of tasks where proxy labels matched elaborated dependencies would strengthen the audit.
Simulated Author's Rebuttal
We thank the referee for their careful reading, accurate summary of the benchmark design and empirical findings, and for recommending acceptance. We are pleased that the reproducibility measures, label-robustness audits, and explicit disclaimers were viewed as strengths.
read point-by-point responses
-
Referee: No major comments provided.
Authors: We have no specific points to address from the major comments section. The referee's positive evaluation aligns with our positioning of the work as a narrowly scoped benchmark-and-audit study, and we do not plan any revisions on this basis. revision: no
Circularity Check
No significant circularity
full rationale
The paper is an empirical benchmark study that defines CSLibPremiseBench, pins a specific CSLib commit, constructs proxy labels from source-visible proof references, and reports retrieval metrics (MRR, Recall@10) for BM25, symbol overlap, import-graph heuristics, and CSG-Rerank. No equations, fitted parameters, or predictions are present; all results are direct experimental outputs under explicitly scoped policies with label-robustness audits. No self-citations serve as load-bearing premises, and no derivation reduces to its own inputs by construction. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Leanexplore: A search engine for lean 4 declarations, 2025
Justin Asher. Leanexplore: A search engine for lean 4 declarations, 2025
work page 2025
-
[2]
Ayers, Dragomir Radev, and Jeremy Avigad
Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, and Jeremy Avigad. Proofnet: Autoformalizing and formally proving undergraduate- level mathematics, 2023
work page 2023
-
[3]
Cslib: The lean computer science library, 2026
Clark Barrett, Swarat Chaudhuri, Fabrizio Montesi, Jim Grundy, Pushmeet Kohli, Leonardo de Moura, Alexandre Rademaker, and Sorrachai Yingchareonthawornchai. Cslib: The lean computer science library, 2026
work page 2026
-
[4]
The lean 4 theorem prover and programming language, 2021
Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language, 2021
work page 2021
-
[5]
Computer science as infrastructure: the spine of the lean computer science library (cslib), 2026
Christopher Henson and Fabrizio Montesi. Computer science as infrastructure: the spine of the lean computer science library (cslib), 2026
work page 2026
-
[6]
Leandojo- v2: A comprehensive library for ai-assisted theorem proving in lean
Ryan Hsiang, William Adkisson, Robert Joseph George, and Anima Anandkumar. Leandojo- v2: A comprehensive library for ai-assisted theorem proving in lean. OpenReview, 2025. NeurIPS Mathematical Reasoning and AI 2025 Poster
work page 2025
-
[7]
Sorrydb: Can ai provers complete real-world lean theorems?, 2026
Austin Letson, Leopoldo Sarra, Auguste Poiroux, Oliver Dressler, Paul Lezeau, Dhyan Aranha, Frederick Pu, Aaron Hill, Miguel Corredera Hidalgo, Julian Berman, George Tsoukalas, and Lenny Taelman. Sorrydb: Can ai provers complete real-world lean theorems?, 2026
work page 2026
-
[8]
Lean-auto: An interface between lean 4 and automated theorem provers, 2025
Yicheng Qian, Joshua Clune, Clark Barrett, and Jeremy Avigad. Lean-auto: An interface between lean 4 and automated theorem provers, 2025
work page 2025
-
[9]
Lean copilot: Large language models as copilots for theorem proving in lean, 2024
Peiyang Song, Kaiyu Yang, and Anima Anandkumar. Lean copilot: Large language models as copilots for theorem proving in lean, 2024. NeuS 2025
work page 2024
-
[10]
The mathlib Community. The lean mathematical library. InProceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, 2020
work page 2020
-
[11]
Learning to repair lean proofs from compiler feedback, 2026
Evan Wang, Simon Chess, Daniel Lee, Siyuan Ge, Ajit Mallavarapu, Jarod Alper, and Vasily Ilin. Learning to repair lean proofs from compiler feedback, 2026. ICLR VerifAI Workshop 2026
work page 2026
-
[12]
Kimina-prover preview: Towards large formal reasoning models with reinforcement learning, 2025
Haiming Wang, Mert Unsal, Xiaohan Lin, Mantas Baksys, Junqi Liu, Marco Dos Santos, Flood Sung, Marina Vinyes, Zhenzhe Ying, Zekai Zhu, Jianqiao Lu, Hugues de Saxce, Bolton Bailey, Chendong Song, Chenjun Xiao, Dehao Zhang, Ebony Zhang, Frederick Pu, Han Zhu, Jiawei Liu, Jonas Bayer, Julien Michel, Longhui Yu, Leo Dreyfus-Schmidt, Lewis Tunstall, Luigi Paga...
work page 2025
-
[13]
Verisoftbench: Repository-scale formal verification benchmarks for lean, 2026
Yutong Xin, Qiaochu Chen, Greg Durrett, and Isil Dillig. Verisoftbench: Repository-scale formal verification benchmarks for lean, 2026. 11
work page 2026
-
[14]
Kaiyu Yang, Aidan M. Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, and Anima Anandkumar. Leandojo: Theorem proving with retrieval- augmented language models, 2023. NeurIPS 2023 Datasets and Benchmarks Track
work page 2023
-
[15]
Formalml: A benchmark for evaluating formal subgoal completion in machine learning theory, 2025
Xiao-Wen Yang, Zihao Zhang, Jianuo Cao, Zhi Zhou, Zenan Li, Lan-Zhe Guo, Yuan Yao, Taolue Chen, Yu-Feng Li, and Xiaoxing Ma. Formalml: A benchmark for evaluating formal subgoal completion in machine learning theory, 2025
work page 2025
-
[16]
Minif2f: A cross-system benchmark for formal olympiad-level mathematics, 2021
Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. Minif2f: A cross-system benchmark for formal olympiad-level mathematics, 2021. ICLR 2022. 12
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.