arxiv: 2605.13137 · v1 · submitted 2026-05-13 · 💻 cs.IR · cs.AI

Recognition: unknown

LeanSearch v2: Global Premise Retrieval for Lean 4 Theorem Proving

Bin Dong, Bryan Dai, Guoxiong Gao, Jiedong Jiang, Jingda Xu, Peihao Wu, Yutong Wang, Zeming Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:28 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords Lean 4theorem provingpremise retrievalMathlibinformation retrievalautomated reasoningglobal premise retrievalreasoning retrieval

0 comments

The pith

LeanSearch v2 recovers full premise groups for Lean 4 theorems and raises proof success from 4% to 20% in controlled tests

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LeanSearch v2 to solve global premise retrieval, the task of identifying the entire scattered set of library lemmas needed for one complete theorem proof in Lean 4. Existing semantic search finds single matching declarations while premise selection predicts lemmas one tactic at a time; neither recovers the joint set required by a whole theorem. Standard mode uses an embedding-reranker pipeline on a hierarchy-informalized Mathlib corpus to achieve strong single-query results without domain-specific training. Reasoning mode adds iterative sketch-retrieve-reflect cycles on top of that substrate to target full premise groups. On 69 research-level Mathlib theorems the reasoning mode recovers 46.1% of ground-truth premise groups inside the top 10 candidates, and when substituted into a fixed prover loop it produces the highest proof success rate of 20%.

Core claim

LeanSearch v2 is a two-mode retrieval system for global premise retrieval in Lean 4. Standard mode applies an embedding-reranker pipeline to a hierarchy-informalized Mathlib corpus and reaches state-of-the-art single-query retrieval. Reasoning mode builds iterative sketch-retrieve-reflect cycles on that substrate to recover the full premise sets required by theorems. On a 69-query benchmark of research-level Mathlib theorems, reasoning mode recovers 46.1% of ground-truth premise groups within the top 10 candidates. In a controlled downstream evaluation with a fixed prover loop, replacing other retrievers with LeanSearch v2 yields 20% proof success compared with 16% for the next-best system.

What carries the argument

Reasoning mode's iterative sketch-retrieve-reflect cycles built on an embedding-reranker pipeline over a hierarchy-informalized Mathlib corpus

If this is right

Higher-quality global premise retrieval directly increases automated proof success rates inside a fixed Lean 4 prover loop.
The base retrieval pipeline works without domain-specific fine-tuning yet still outperforms prior systems on single-query and group-recovery tasks.
Measured recovery of ground-truth premise groups correlates with actual downstream proof-generation performance.
Open-sourced code, data, and benchmarks make the system reproducible and extensible for further Lean 4 work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the iterative cycles to handle larger or more interdependent premise sets could shrink the search space in interactive theorem proving for advanced mathematics.
Pairing the retriever with language-model tactic generators may produce stronger combined proof automation pipelines.
The global-retrieval approach could transfer to other formal libraries and interactive theorem provers that rely on large lemma collections.

Load-bearing premise

The 69-query benchmark and the fixed prover loop are representative of typical research-level Lean 4 usage.

What would settle it

Re-running the evaluation on a larger set of theorems drawn from recent Mathlib contributions or with a different prover configuration that relies less on premise selection would show whether the measured gains hold.

Figures

Figures reproduced from arXiv: 2605.13137 by Bin Dong, Bryan Dai, Guoxiong Gao, Jiedong Jiang, Jingda Xu, Peihao Wu, Yutong Wang, Zeming Sun.

**Figure 2.** Figure 2: Standard mode pipeline. (a) Jixia extracts each declaration and its dependencies from Mathlib. (b) Declarations are informalized bottom-up so that each description can draw on alreadyinformalized dependencies. (c) Queries and corpus items are embedded for cosine-similarity retrieval; a reranker refines the top candidates. Standard mode serves as both a standalone search engine and the retrieval substrate… view at source ↗

**Figure 3.** Figure 3: Hidden auto-generated declarations in Mathlib. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Reasoning mode pipeline. The sketch generator decomposes a target theorem into sub [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Proving theorems in Lean 4 often requires identifying a scattered set of library lemmas whose joint use enables a concise proof -- a task we call global premise retrieval. Existing tools address adjacent problems: semantic search engines find individual declarations matching a query, while premise-selection systems predict useful lemmas one tactic step at a time. Neither recovers the full premise set an entire theorem requires. We present LeanSearch v2, a two-mode retrieval system for this task. Its standard mode applies a hierarchy-informalized Mathlib corpus with an embedding-reranker pipeline, achieving state-of-the-art single-query retrieval without domain-specific fine-tuning (nDCG@10 of 0.62 vs. 0.53 for the next-best system). Its reasoning mode builds on standard mode as its retrieval substrate, targeting global premise retrieval through iterative sketch-retrieve-reflect cycles. On a 69-query benchmark of research-level Mathlib theorems, reasoning mode recovers 46.1% of ground-truth premise groups within 10 retrieved candidates, outperforming strong reasoning retrieval systems (38.0%) and premise-selection baselines (9.3%) on the same benchmark. In a controlled downstream evaluation with a fixed prover loop, replacing alternative retrievers with LeanSearch v2 yields the highest proof success (20% vs. 16% for the next-best system and 4% without retrieval), confirming that retrieval quality propagates to proof generation. We have open-sourced all code, data, and benchmarks. Code and data: https://github.com/frenzymath/LeanSearch-v2 . The standard mode is publicly available with API access at https://leansearch.net/ .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

LeanSearch v2 adds a practical two-mode retriever that lifts global premise recovery to 46% and proof success to 20% on a 69-query Mathlib set, with open code, though the narrow benchmark and fixed loop leave generalization open. The new piece is the explicit split: a standard embedding-reranker pipeline on an informalized Mathlib corpus for single-query search, plus a reasoning mode that runs iterative sketch-retrieve-reflect cycles to pull entire premise groups at once. Prior work handled either isolated declarations or per-tactic selection, so this global focus is distinct. They also release a 69-query benchmark of research-level theorems. Standard mode hits nDCG@10 of 0.62 against 0.53 for the next system. Reasoning mode recovers 46.1% of ground-truth groups in the top 10, beating 38% for other reasoning retrievers and 9.3% for premise-selection baselines. In a controlled prover loop it reaches 20% proof success versus 16% and 4% without retrieval. Releasing all code, data, and benchmarks is a real plus for anyone who wants to test or extend it. The soft spot is scale and setup. Sixty-nine queries is small, and the paper gives no variance numbers, significance tests, or runs on different theorem distributions. The fixed prover loop raises the same issue the stress-test note flags: if the loop was tuned to the retrieval distribution this system produces, the measured gaps may shrink elsewhere. No cross-validation or broader sampling is shown. This is for researchers building retrieval or proving tools in Lean and similar systems. Someone working on formal methods would get usable architecture details and released artifacts to try directly. I would send it for peer review. The experiments are concrete, the artifacts are public, and the downstream link to proof success is worth referee scrutiny even if the evaluation needs expansion.

Referee Report

2 major / 2 minor

Summary. The paper introduces LeanSearch v2, a two-mode retrieval system for global premise retrieval in Lean 4 theorem proving. Standard mode uses an embedding-reranker pipeline on a hierarchy-informalized Mathlib corpus to achieve nDCG@10 of 0.62. Reasoning mode adds iterative sketch-retrieve-reflect cycles on top of standard mode, recovering 46.1% of ground-truth premise groups within top-10 candidates on a 69-query benchmark of research-level Mathlib theorems (outperforming 38.0% for strong reasoning systems and 9.3% for premise-selection baselines) and yielding 20% proof success in a fixed downstream prover loop (vs. 16% next-best and 4% without retrieval). All code, data, and benchmarks are open-sourced.

Significance. If the reported empirical gains hold under scrutiny, the work addresses a genuine gap between single-lemma semantic search and per-tactic premise selection by targeting the full scattered premise sets needed for concise proofs. The downstream propagation from retrieval quality to proof success rate is a concrete strength, and the public release of code, data, and the 69-query benchmark supports reproducibility and follow-on work in the Lean community.

major comments (2)

[Experimental evaluation (benchmark and downstream sections)] The 69-query benchmark and fixed prover loop are load-bearing for the headline claims (46.1% group recovery and 20% proof success). The manuscript provides no description of query selection criteria, no variance estimates across runs, and no statistical significance tests comparing against the 38.0% and 16% baselines; without these, it is impossible to assess whether the measured gaps reflect typical Lean 4 usage or are artifacts of the chosen theorems and prover configuration.
[Reasoning mode description and results] The reasoning mode's iterative sketch-retrieve-reflect procedure is presented as the key innovation for global retrieval, yet the paper does not report an ablation isolating the contribution of the reflect step versus simply running standard mode multiple times; this leaves open whether the 46.1% figure is driven by the new cycle or by repeated application of the already-strong standard-mode retriever.

minor comments (2)

[Abstract and §3] The term 'hierarchy-informalized Mathlib corpus' appears in the abstract and method without an explicit definition or pointer to the preprocessing section; adding one sentence of clarification would aid readers unfamiliar with the corpus construction.
[Results tables] Tables reporting nDCG@10, recall@10, and proof success should include the number of runs or standard deviations if any stochasticity is present in retrieval or proving.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary, the recognition of the work's significance, and the recommendation for minor revision. We address the two major comments point by point below, indicating the revisions we will make.

read point-by-point responses

Referee: [Experimental evaluation (benchmark and downstream sections)] The 69-query benchmark and fixed prover loop are load-bearing for the headline claims (46.1% group recovery and 20% proof success). The manuscript provides no description of query selection criteria, no variance estimates across runs, and no statistical significance tests comparing against the 38.0% and 16% baselines; without these, it is impossible to assess whether the measured gaps reflect typical Lean 4 usage or are artifacts of the chosen theorems and prover configuration.

Authors: We agree that the benchmark details require expansion. In the revised manuscript we will add an explicit subsection describing the query selection criteria used to construct the 69-query set of research-level Mathlib theorems. The retrieval and reranking pipelines are deterministic given the fixed corpus, embedding model, and reranker, so variance across runs is exactly zero and does not need to be estimated. The downstream prover loop is likewise run under a fixed, deterministic configuration; we will add a clarifying sentence to this effect. We acknowledge the absence of statistical significance tests. With a benchmark of this size we will instead add a short discussion of the practical magnitude of the observed gains (e.g., lifting proof success from 4 % to 20 %) and note that formal hypothesis testing would have limited power; we do not plan to introduce new multi-seed experiments at this stage. revision: partial
Referee: [Reasoning mode description and results] The reasoning mode's iterative sketch-retrieve-reflect procedure is presented as the key innovation for global retrieval, yet the paper does not report an ablation isolating the contribution of the reflect step versus simply running standard mode multiple times; this leaves open whether the 46.1% figure is driven by the new cycle or by repeated application of the already-strong standard-mode retriever.

Authors: We accept this criticism. The reflect step is intended to enable dynamic refinement of the search sketch using information from prior retrievals, which is distinct from simply repeating the standard-mode pipeline. To isolate its contribution we will add a new ablation in the revised results section that directly compares (i) the full sketch-retrieve-reflect cycle against (ii) an equal number of iterations of the standard mode alone (without the reflect component). The ablation will be reported alongside the existing 46.1 % figure so that readers can judge the incremental value of the reflect mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results are direct measurements

full rationale

The paper reports retrieval metrics (nDCG@10, group recovery rates) and downstream proof success rates on a held-out 69-query Mathlib benchmark and a fixed prover loop. These are measured outcomes from running the described embedding-reranker pipeline and iterative sketch-retrieve-reflect cycles; they do not reduce by any equation or self-citation to quantities fitted inside the same run. No uniqueness theorem, ansatz, or self-definitional step is invoked to justify the central claims. The evaluation is self-contained against external baselines and open-sourced code/data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The system rests on standard pre-trained embedding models and the existing Mathlib corpus; no new free parameters are introduced in the abstract, and no new physical or mathematical entities are postulated.

axioms (1)

domain assumption Mathlib declarations and their informalizations form a representative corpus for research-level Lean 4 theorems
Invoked when the hierarchy-informalized corpus is used as the retrieval substrate

pith-pipeline@v0.9.0 · 5625 in / 1365 out tokens · 22737 ms · 2026-05-14T18:28:25.885912+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 6 internal anchors

[1]

Mm-bright: A multi-task multimodal benchmark for reasoning-intensive retrieval

Abdelrahman Abdallah, Mohamed Darwish Mounis, Mahmoud Abdalla, Mahmoud SalahEldin Kasem, Mostafa Farouk Senussi, Mohamed Mahmoud, Mohammed Ali, Adam Jatowt, and Hyun- Soo Kang. Mm-bright: A multi-task multimodal benchmark for reasoning-intensive retrieval. arXiv preprint arXiv:2601.09562,

work page arXiv
[2]

Achim, A

Tudor Achim, Alex Best, Alberto Bietti, Kevin Der, Mathïs Fédérico, Sergei Gukov, Daniel Halpern- Leistner, Kirsten Henningsgard, Yury Kudryashov, Alexander Meiburg, et al. Aristotle: Imo-level automated theorem proving.arXiv preprint arXiv:2510.01346,

work page arXiv
[3]

Leanexplore: A search engine for lean 4 declarations.arXiv preprint arXiv:2506.11085,

Justin Asher. Leanexplore: A search engine for lean 4 declarations.arXiv preprint arXiv:2506.11085,

work page arXiv
[4]

Seed-prover 1.5: Mastering undergraduate-level theorem proving via learning from experience.arXiv preprint arXiv:2512.17260,

Jiangjie Chen, Wenxiang Chen, Jiacheng Du, Jinyi Hu, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Wenlei Shi, et al. Seed-prover 1.5: Mastering undergraduate-level theorem proving via learning from experience.arXiv preprint arXiv:2512.17260,

work page arXiv
[5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capa- bilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Rader: Reasoning-aware dense retrieval models

Debrup Das, Sam O’Nuallain, and Razieh Rahimi. Rader: Reasoning-aware dense retrieval models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19981–20008,

work page 2025
[7]

A semantic search engine for mathlib4

Guoxiong Gao, Haocheng Ju, Jiedong Jiang, Zihan Qin, and Bin Dong. A semantic search engine for mathlib4. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 8001–8013, 2024a. Guoxiong Gao, Yutong Wang, Jiedong Jiang, Qi Gao, Zihan Qin, Tianyi Xu, and Bin Dong. Herald: A natural language annotated lean 4 dataset.arXiv preprint a...

work page arXiv 2024
[8]

Jiedong Jiang, Wanyi He, Yuefeng Wang, Guoxiong Gao, Yongle Hu, Jingting Wang, Nailin Guan, Peihao Wu, Chunbo Dai, Liang Xiao, et al

doi: 10.1038/ s41586-025-09833-y. Jiedong Jiang, Wanyi He, Yuefeng Wang, Guoxiong Gao, Yongle Hu, Jingting Wang, Nailin Guan, Peihao Wu, Chunbo Dai, Liang Xiao, et al. Fate: A formal benchmark series for frontier algebra of multiple difficulty levels.arXiv preprint arXiv:2511.02872,

work page arXiv
[9]

Mirb: Mathematical information retrieval benchmark.arXiv preprint arXiv:2505.15585,

Haocheng Ju and Bin Dong. Mirb: Mathematical information retrieval benchmark.arXiv preprint arXiv:2505.15585,

work page arXiv
[10]

Automated Conjecture Resolution with Formal Verification

Haocheng Ju, Guoxiong Gao, Jiedong Jiang, Bin Wu, Zeming Sun, Leheng Chen, Yutong Wang, Yuefeng Wang, Zichen Wang, Wanyi He, et al. Automated conjecture resolution with formal verification.arXiv preprint arXiv:2604.03789,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Benchmarking information retrieval models on complex retrieval tasks.arXiv preprint arXiv:2509.07253,

Julian Killingback and Hamed Zamani. Benchmarking information retrieval models on complex retrieval tasks.arXiv preprint arXiv:2509.07253,

work page arXiv
[12]

Numina-lean-agent: An open and gen- eral agentic reasoning system for formal mathematics

Junqi Liu, Zihao Zhou, Zekai Zhu, Marco Dos Santos, Weikun He, Jiawei Liu, Ran Wang, Yun- zhou Xie, Junqiao Zhao, Qiufeng Wang, et al. Numina-lean-agent: An open and general agentic reasoning system for formal mathematics.arXiv preprint arXiv:2601.14027,

work page arXiv
[13]

GroupRank: A Groupwise Paradigm for Effective and Efficient Passage Reranking with LLMs

Meixiu Long, Duolin Sun, Dan Yang, Yihan Jiao, Lei Liu, Jiahai Wang, BinBin Hu, Yue Shen, Jie Feng, Zhehao Tan, et al. Grouprank: A groupwise paradigm for effective and efficient passage reranking with llms.arXiv:2511.11653, 2025a. Meixiu Long, Duolin Sun, Dan Yang, Junjie Wang, Yecheng Luo, Yue Shen, Jian Wang, Hualei Zhou, Chunxiao Guo, Peng Wei, et al....

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Mteb: Massive text em- bedding benchmark

Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. Mteb: Massive text em- bedding benchmark. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014–2037,

work page 2014
[15]

Combining textual and struc- tural information for premise selection in lean.arXiv preprint arXiv:2510.23637,

Job Petrovˇciˇc, David Eliecer Narvaez Denis, and Ljup ˇco Todorovski. Combining textual and struc- tural information for premise selection in lean.arXiv preprint arXiv:2510.23637,

work page arXiv
[16]

Deepseek-prover-v2: Advancing formal mathematical rea- soning via reinforcement learning for subgoal decomposition.arXiv preprint arXiv:2504.21801,

ZZ Ren, Zhihong Shao, Junxiao Song, Huajian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, et al. Deepseek-prover-v2: Advancing formal mathematical rea- soning via reinforcement learning for subgoal decomposition.arXiv preprint arXiv:2504.21801,

work page arXiv
[17]

Reasonir: Training retrievers for reasoning tasks.arXiv preprint arXiv:2504.20595,

Rulin Shao, Rui Qiao, Varsha Kishore, Niklas Muennighoff, Xi Victoria Lin, Daniela Rus, Bryan Kian Hsiang Low, Sewon Min, Wen-tau Yih, Pang Wei Koh, et al. Reasonir: Training retrievers for reasoning tasks.arXiv preprint arXiv:2504.20595,

work page arXiv
[18]

Real-prover: Retrieval augmented lean prover for mathematical reasoning.arXiv preprint arXiv:2505.20613,

Ziju Shen, Naohao Huang, Fanyi Yang, Yutong Wang, Guoxiong Gao, Tianyi Xu, Jiedong Jiang, Wanyi He, Pu Yang, Mengzhou Sun, et al. Real-prover: Retrieval augmented lean prover for mathematical reasoning.arXiv preprint arXiv:2505.20613,

work page arXiv
[19]

Bright: A realistic and challenging bench- mark for reasoning-intensive retrieval.arXiv preprint arXiv:2407.12883,

Hongjin Su, Howard Yen, Mengzhou Xia, Weijia Shi, Niklas Muennighoff, Han-yu Wang, Haisu Liu, Quan Shi, Zachary S Siegel, Michael Tang, et al. Bright: A realistic and challenging bench- mark for reasoning-intensive retrieval.arXiv preprint arXiv:2407.12883,

work page arXiv
[20]

Learning an effective premise re- trieval model for efficient mathematical formalization.arXiv preprint arXiv:2501.13959,

11 Yicheng Tao, Haotian Liu, Shanwen Wang, and Hongteng Xu. Learning an effective premise re- trieval model for efficient mathematical formalization.arXiv preprint arXiv:2501.13959,

work page arXiv
[21]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

The Lean mathematical library

The mathlib Community. The Lean mathematical library. InProceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs (CPP 2020), pages 367–381. ACM,

work page 2020
[23]

doi: 10.1145/3372885.3373824

doi: 10.1145/3372885.3373824. Sean Welleck, Jiacheng Liu, Ronan Le Bras, Hannaneh Hajishirzi, Yejin Choi, and Kyunghyun Cho. Naturalproofs: Mathematical theorem proving in natural language.arXiv preprint arXiv:2104.01112,

work page doi:10.1145/3372885.3373824
[24]

Rank1: Test-time compute for reranking in information retrieval.arXiv preprint arXiv:2502.18418,

Orion Weller, Kathryn Ricci, Eugene Yang, Andrew Yates, Dawn Lawrie, and Benjamin Van Durme. Rank1: Test-time compute for reranking in information retrieval.arXiv preprint arXiv:2502.18418,

work page arXiv
[25]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

URLhttps://github.com/yaoyichen/ INF-X-Retriever. Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Premise selection for a lean hammer.arXiv preprint arXiv:2506.07477,

Thomas Zhu, Joshua Clune, Jeremy Avigad, Albert Qiaochu Jiang, and Sean Welleck. Premise selection for a lean hammer.arXiv preprint arXiv:2506.07477,

work page arXiv
[28]

CondensedMod satisfies AB4

12 A Benchmark Curation In this section, we present the curation methodology, key characteristics, and illustrative examples of our benchmark to provide readers with a clearer understanding of both the benchmark and the experimental results. We introduce two benchmarks:MathlibQRfor the classical formal theorem searching task, andMathlibMPRfor global premi...

work page 2024
[29]

Table 5: Mean rank conditioned on the system’s randomized label position

— but the permutation procedure assigns each system to each of the four label positions approximately equally often (574–640assignments per cell on2430judg- ments; uniform expectation is607.5), so the bias averages out at the system level and the reported overall mean ranks are debiased. Table 5: Mean rank conditioned on the system’s randomized label posi...

work page 2025
[30]

21 Table 10:MathlibQR by difficulty.Half the queries are Easy, half are Hard

Throughout the following tables we abbreviate column headers asLE= LeanEx- plore,LF= LeanFinder,LSv2(r)= LeanSearch v2 retriever-only, andLSv2(rr)= LeanSearch v2 with reranking. 21 Table 10:MathlibQR by difficulty.Half the queries are Easy, half are Hard. nDCG@10Recall@10 Difficulty LE LF LSv2(r) LSv2(rr) LE LF LSv2(r) LSv2(rr) Easy 0.410 0.568 0.5330.666...

work page arXiv
[31]

Best per column inbold. Covered (main only) Covered (main∨alt-all) System 5 10 20 30 50 5 10 20 30 50 Reasoning retrieval (informal+formal query) INF-X-Retriever 14.5 17.4 18.8 21.7 21.7 15.9 18.8 21.7 24.6 24.6 +query rewriter 10.1 10.1 14.5 15.9 23.2 10.1 10.1 17.4 17.4 24.6 ReasonIR 15.9 17.4 18.8 18.8 26.1 17.4 18.8 20.3 21.7 30.4 DIVER (full pipeline...

work page arXiv
[32]

The Sonnet column reproduces the L0 / L1 / reasoning-mode rows of Table 3; the Kimi column re-runs the same loops with Kimi K2 Instruct in the prover role only

Table 16:Prover-backbone ablation on FATE-H.Each cell is the count of solved problems out of 100(percentages in parentheses). The Sonnet column reproduces the L0 / L1 / reasoning-mode rows of Table 3; the Kimi column re-runs the same loops with Kimi K2 Instruct in the prover role only. Retrieval mode (prover input) Sonnet prover Kimi prover No retrieval 4...

work page 2020