arxiv: 2605.14549 · v1 · submitted 2026-05-14 · 💻 cs.LO

Recognition: no theorem link

CSLibPremiseBench: Structure-Guided Premise Retrieval and Label Robustness for Lean 4 Computer-Science Theorems

Junye Ji

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:27 UTC · model grok-4.3

classification 💻 cs.LO

keywords premise retrievalLean 4CSLibbenchmarkformal verificationinformation retrievaltheorem proving

0 comments

The pith

CSLibPremiseBench shows structure-guided reranking gives only modest early-rank gains over lexical baselines in Lean 4 computer-science premise retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CSLibPremiseBench, a reproducible benchmark pinned to CSLib v4.29.0 that evaluates source-visible premise retrieval across 801 tasks and 1875 candidate declarations using proxy labels drawn from proof references. It tests lexical methods such as BM25, symbol and name overlap, import-graph heuristics, module priors, and a structure-guided graph-lexical reranker called CSG-Rerank under strict import and source-order policies. Results indicate that CSG-Rerank improves mean reciprocal rank at early positions relative to plain BM25, yet it does not reliably exceed BM25 combined with symbol features and leaves Recall@10 unchanged. The study treats the benchmark as an audit tool that exposes how repository structure and label choices affect measured performance while explicitly avoiding claims about downstream proof generation.

Core claim

CSLibPremiseBench is a pinned, source-visible benchmark for premise retrieval over Lean 4 CSLib declarations that compares lexical, symbolic, graph, and hybrid retrievers; under strict policies the structure-guided CSG-Rerank reranker produces a modest early-rank MRR lift over BM25 but fails to outperform BM25 plus symbol overlap or to raise Recall@10, while context-packet audits show stronger module concentration without corresponding top-k coverage or token-utility gains.

What carries the argument

CSLibPremiseBench, a fixed Lean 4 benchmark of 801 proxy-labelable tasks over 1875 CSLib declarations that uses source-visible proof-reference proxies and strict import/source-order policies to compare lexical, symbolic, graph-heuristic, and hybrid retrievers including CSG-Rerank.

If this is right

Repository structure and candidate-policy design materially shape measured premise-retrieval performance on CSLib.
Proxy labels drawn from proof references must carry explicit caveats when used for evaluation.
Benchmark results on this task set do not demonstrate or claim gains for automated proof generation or repair.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Broad mathematical theorem-proving benchmarks may systematically misrepresent retrieval behavior inside domain-specific libraries such as CSLib.
Symbol-overlap signals may already encode much of the structural information that more elaborate graph rerankers attempt to exploit.
Replacing proxy labels with actual dependency traces extracted from completed proofs would provide a direct test of label robustness.

Load-bearing premise

Source-visible proxy labels taken from proof references accurately stand in for the premises that would actually be required in realistic proof-generation or proof-repair settings.

What would settle it

Direct inspection of whether the top-ranked premises from each method appear as actual dependencies inside manually completed or machine-generated proofs for a random sample of the 801 tasks would show whether the proxy labels match real usage.

read the original abstract

CSLib is an emerging Lean 4 library for computer-science formalization, but its premise-retrieval behavior is not well represented by broad mathematical theorem-proving benchmarks. We introduce CSLibPremiseBench, a reproducible CSLib-specific benchmark and empirical study for source-visible premise retrieval over Lean 4 theorem and lemma declarations. The benchmark pins CSLib v4.29.0 at commit 0d37cc7fcc985cfc53b155e7eef2453f846c6da2, builds with Lean 4.29.0, and evaluates a strict import/source-order task set with 801 proxy-labelable tasks and 1875 CSLib candidate declarations. The labels are source-visible CSLib proof-reference proxies, not elaborated Lean dependency traces. We audit label robustness using stricter source-visible matching and a 300-task Lean environment expression probe, then compare BM25, symbol/name overlap, namespace/module and import-graph heuristics, PageRank-style module priors, fixed hybrids, and CSG-Rerank, a structure-guided graph-lexical reranker. CSG-Rerank gives a modest early-rank MRR gain over lexical BM25 under the strict policy, but does not reliably outperform BM25+symbol and does not improve Recall@10. A context-packet audit similarly finds stronger module/family concentration without reliable top-k proxy-gold coverage or token-utility gains. We position CSLibPremiseBench as a benchmark and audit paper: repository structure and candidate-policy design materially shape CSLib premise retrieval, proxy labels require explicit caveats, and proof-generation or proof-repair performance is not claimed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A narrow but reproducible benchmark for CSLib premise retrieval that reports modest gains honestly and audits its own labels.

read the letter

This paper gives us CSLibPremiseBench, a pinned-commit benchmark for source-visible premise retrieval over 801 tasks and 1875 candidates in the CSLib Lean 4 library. It enforces strict import and source-order rules, compares lexical, symbol, namespace, and graph-based methods, and includes a label-robustness audit with stricter matching plus a 300-task environment probe. CSG-Rerank shows a modest early-rank MRR lift over plain BM25 but does not beat BM25 plus symbol overlap reliably and adds nothing to Recall@10. Context packets show tighter module concentration without clear top-k coverage or token-utility gains. The authors are explicit that proxy labels from proof references are not the same as premises a proof generator would actually need, and they disclaim any claims about proof generation or repair performance. That scoping and the reproducibility details are the main strengths. The obvious soft spot is the proxy-label assumption itself; if those references do not track real proof needs, the benchmark numbers become harder to interpret outside the retrieval setting. The work is clearly scoped and the empirical claims line up with the reported design, so the modest results read as honest rather than overstated. It is aimed at people already working on Lean premise retrieval or automated tools for computer-science formalizations. The benchmark and audit are solid enough to deserve referee time even if the gains stay small and domain-specific.

Referee Report

0 major / 2 minor

Summary. The paper introduces CSLibPremiseBench, a reproducible benchmark for source-visible premise retrieval over Lean 4 theorem declarations in the emerging CSLib library. It pins CSLib v4.29.0 at commit 0d37cc7fcc985cfc53b155e7eef2453f846c6da2, defines a strict import/source-order task set with 801 proxy-labelable tasks and 1875 candidates using source-visible proof-reference proxies, audits label robustness via stricter matching and a 300-task expression probe, and compares BM25, symbol/name overlap, namespace/import-graph heuristics, PageRank priors, fixed hybrids, and the new CSG-Rerank structure-guided graph-lexical reranker. The central empirical finding is that CSG-Rerank yields a modest early-rank MRR gain over lexical BM25 under the strict policy but does not reliably outperform BM25+symbol and does not improve Recall@10; a context-packet audit shows stronger module/family concentration without reliable top-k coverage or token-utility gains. The work is explicitly positioned as a benchmark-and-audit study with disclaimers that no proof-generation performance is claimed.

Significance. If the reported results hold, the paper supplies a much-needed, narrowly scoped and reproducible benchmark for premise retrieval in computer-science formalizations, a domain poorly served by broad mathematical theorem-proving suites. The explicit pinning of commit and Lean version, the label-robustness audits, the policy-sensitivity analysis, and the clear disclaimers about proxy labels constitute genuine strengths that increase the reliability and utility of the empirical findings for the interactive theorem-proving community.

minor comments (2)

[Abstract] Abstract and §4: the precise construction of the graph used by CSG-Rerank (nodes, edge types, weighting) is only sketched; a one-sentence definition or pointer to the exact algorithm would improve reproducibility.
[§5.3] §5.3: the 300-task Lean environment probe is described only at high level; reporting the exact distribution of expression sizes or the fraction of tasks where proxy labels matched elaborated dependencies would strengthen the audit.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading, accurate summary of the benchmark design and empirical findings, and for recommending acceptance. We are pleased that the reproducibility measures, label-robustness audits, and explicit disclaimers were viewed as strengths.

read point-by-point responses

Referee: No major comments provided.

Authors: We have no specific points to address from the major comments section. The referee's positive evaluation aligns with our positioning of the work as a narrowly scoped benchmark-and-audit study, and we do not plan any revisions on this basis. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical benchmark study that defines CSLibPremiseBench, pins a specific CSLib commit, constructs proxy labels from source-visible proof references, and reports retrieval metrics (MRR, Recall@10) for BM25, symbol overlap, import-graph heuristics, and CSG-Rerank. No equations, fitted parameters, or predictions are present; all results are direct experimental outputs under explicitly scoped policies with label-robustness audits. No self-citations serve as load-bearing premises, and no derivation reduces to its own inputs by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces no free parameters, mathematical axioms, or invented entities; it relies on standard information-retrieval techniques and the existing Lean 4 and CSLib infrastructure.

pith-pipeline@v0.9.0 · 5599 in / 1160 out tokens · 52831 ms · 2026-05-15T01:27:08.827243+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[1]

Leanexplore: A search engine for lean 4 declarations, 2025

Justin Asher. Leanexplore: A search engine for lean 4 declarations, 2025

work page 2025
[2]

Ayers, Dragomir Radev, and Jeremy Avigad

Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, and Jeremy Avigad. Proofnet: Autoformalizing and formally proving undergraduate- level mathematics, 2023

work page 2023
[3]

Cslib: The lean computer science library, 2026

Clark Barrett, Swarat Chaudhuri, Fabrizio Montesi, Jim Grundy, Pushmeet Kohli, Leonardo de Moura, Alexandre Rademaker, and Sorrachai Yingchareonthawornchai. Cslib: The lean computer science library, 2026

work page 2026
[4]

The lean 4 theorem prover and programming language, 2021

Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language, 2021

work page 2021
[5]

Computer science as infrastructure: the spine of the lean computer science library (cslib), 2026

Christopher Henson and Fabrizio Montesi. Computer science as infrastructure: the spine of the lean computer science library (cslib), 2026

work page 2026
[6]

Leandojo- v2: A comprehensive library for ai-assisted theorem proving in lean

Ryan Hsiang, William Adkisson, Robert Joseph George, and Anima Anandkumar. Leandojo- v2: A comprehensive library for ai-assisted theorem proving in lean. OpenReview, 2025. NeurIPS Mathematical Reasoning and AI 2025 Poster

work page 2025
[7]

Sorrydb: Can ai provers complete real-world lean theorems?, 2026

Austin Letson, Leopoldo Sarra, Auguste Poiroux, Oliver Dressler, Paul Lezeau, Dhyan Aranha, Frederick Pu, Aaron Hill, Miguel Corredera Hidalgo, Julian Berman, George Tsoukalas, and Lenny Taelman. Sorrydb: Can ai provers complete real-world lean theorems?, 2026

work page 2026
[8]

Lean-auto: An interface between lean 4 and automated theorem provers, 2025

Yicheng Qian, Joshua Clune, Clark Barrett, and Jeremy Avigad. Lean-auto: An interface between lean 4 and automated theorem provers, 2025

work page 2025
[9]

Lean copilot: Large language models as copilots for theorem proving in lean, 2024

Peiyang Song, Kaiyu Yang, and Anima Anandkumar. Lean copilot: Large language models as copilots for theorem proving in lean, 2024. NeuS 2025

work page 2024
[10]

The lean mathematical library

The mathlib Community. The lean mathematical library. InProceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, 2020

work page 2020
[11]

Learning to repair lean proofs from compiler feedback, 2026

Evan Wang, Simon Chess, Daniel Lee, Siyuan Ge, Ajit Mallavarapu, Jarod Alper, and Vasily Ilin. Learning to repair lean proofs from compiler feedback, 2026. ICLR VerifAI Workshop 2026

work page 2026
[12]

Kimina-prover preview: Towards large formal reasoning models with reinforcement learning, 2025

Haiming Wang, Mert Unsal, Xiaohan Lin, Mantas Baksys, Junqi Liu, Marco Dos Santos, Flood Sung, Marina Vinyes, Zhenzhe Ying, Zekai Zhu, Jianqiao Lu, Hugues de Saxce, Bolton Bailey, Chendong Song, Chenjun Xiao, Dehao Zhang, Ebony Zhang, Frederick Pu, Han Zhu, Jiawei Liu, Jonas Bayer, Julien Michel, Longhui Yu, Leo Dreyfus-Schmidt, Lewis Tunstall, Luigi Paga...

work page 2025
[13]

Verisoftbench: Repository-scale formal verification benchmarks for lean, 2026

Yutong Xin, Qiaochu Chen, Greg Durrett, and Isil Dillig. Verisoftbench: Repository-scale formal verification benchmarks for lean, 2026. 11

work page 2026
[14]

Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, and Anima Anandkumar

Kaiyu Yang, Aidan M. Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, and Anima Anandkumar. Leandojo: Theorem proving with retrieval- augmented language models, 2023. NeurIPS 2023 Datasets and Benchmarks Track

work page 2023
[15]

Formalml: A benchmark for evaluating formal subgoal completion in machine learning theory, 2025

Xiao-Wen Yang, Zihao Zhang, Jianuo Cao, Zhi Zhou, Zenan Li, Lan-Zhe Guo, Yuan Yao, Taolue Chen, Yu-Feng Li, and Xiaoxing Ma. Formalml: A benchmark for evaluating formal subgoal completion in machine learning theory, 2025

work page 2025
[16]

Minif2f: A cross-system benchmark for formal olympiad-level mathematics, 2021

Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. Minif2f: A cross-system benchmark for formal olympiad-level mathematics, 2021. ICLR 2022. 12

work page 2021