arxiv: 2605.07147 · v2 · submitted 2026-05-08 · 💻 cs.LO · cs.AI· cs.LG

Recognition: no theorem link

MathlibPR: Pull Request Merge-Readiness Benchmark for Formal Mathematical Libraries

Zixuan Xie , Xinyu Liu , Shangtong Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:49 UTC · model grok-4.3

classification 💻 cs.LO cs.AIcs.LG

keywords Mathlibpull requestsLLM evaluationformal mathematicsLeanmerge readinessbenchmarkcode review

0 comments

The pith

LLM models and agents cannot reliably distinguish merge-ready Mathlib pull requests from build-passing ones that were never merged.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds MathlibPR, a benchmark drawn directly from historical Mathlib4 pull-request records in the Lean formal library. It introduces a staged evaluation protocol that tests both standalone LLM models and LLM agents on the task of judging whether a PR follows library conventions and deserves to be merged. The central result is that current systems perform poorly: they cannot separate PRs that human reviewers accepted from those that merely passed the build check yet were later revised or rejected. This turns past human review decisions into a supervised signal that can train reviewer assistants and reward models for formal mathematics.

Core claim

By turning real Mathlib4 PR histories into labeled data for merge-readiness, the benchmark shows that both LLM models (such as DeepSeek, Qwen, Goedel, and Kimina) and LLM agents (such as Codex and Claude Code) fail to tell apart truly merge-ready contributions from build-passing PRs that were revised or never merged.

What carries the argument

MathlibPR benchmark, constructed from historical PRs with merge decisions as supervised labels and evaluated through a staged protocol that measures discrimination between merge-ready and non-merge-ready cases.

If this is right

Historical PR data can supply training signals for LLM-based reviewer assistants.
Reward models trained on merge decisions could guide LLMs to generate contributions that integrate more easily into Mathlib.
The same approach offers a concrete way to measure and improve LLM performance on formal-library maintenance tasks.
Reducing reviewer workload could ease the current bottleneck in Mathlib's growth.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be adapted to other large formal libraries such as those in Coq or Isabelle to test cross-system generalization.
If label noise from past reviews is high, the performance ceiling of any model trained on this data will remain limited.
Embedding such a scorer into a continuous-integration pipeline might allow automated pre-filtering of PRs before human review.

Load-bearing premise

Historical merge decisions by Mathlib reviewers provide an unbiased, high-quality label for what counts as a merge-ready contribution.

What would settle it

An LLM or agent that, when evaluated on held-out Mathlib PRs using the staged protocol, achieves substantially higher accuracy at separating merge-ready from build-passing unrevised cases than the models tested in the paper.

Figures

Figures reproduced from arXiv: 2605.07147 by Shangtong Zhang, Xinyu Liu, Zixuan Xie.

**Figure 1.** Figure 1: Overview of the MATHLIBPR dataset construction pipeline. Starting from closed PRs in the Mathlib4 repository, we retain PRs with Lean file changes. Historical build-check records identify build-passing commits on PR branches, including the first build-passing commit and, for unmerged PRs, the last build-passing commit before closure. These states provide not merge ready examples when they are later revised… view at source ↗

read the original abstract

The ecosystem of Lean and Mathlib has become the de facto standard for large language model (LLM) assisted formal reasoning with remarkable successes in recent years. Those successes, however, only consume Mathlib as an essential dependency but do not directly contribute to it. In the meantime, the growth of Mathlib has recently been bottlenecked by the review process, which requires human reviewers to judge whether proposed pull requests (PRs) follow the Mathlib's conventions and are worth integrating as part of a shared mathematical infrastructure. This leads to our central question: can LLMs help review Mathlib PRs? To this end, we introduce MathlibPR, a benchmark built from real Mathlib4 PR histories. We further propose a staged evaluation protocol and use it to evaluate both LLM models (e.g., DeepSeek, Qwen, Goedel, and Kimina) and LLM agents (e.g., Codex and Claude Code). Surprisingly, both LLM models and LLM agents struggle to distinguish merge-ready PRs from build-passing PRs that were revised or never merged. By turning Mathlib PR histories into a supervised signal, MathlibPR provides a step toward reviewer assistants and reward models that could help evaluate PRs and steer LLMs toward producing merge-ready Mathlib contributions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MathlibPR turns real PR histories into a benchmark for LLM review assistance and finds current models struggle, but the historical merge labels may be too noisy to make that conclusion stick.

read the letter

The main point is that this paper builds MathlibPR from actual Mathlib4 pull request data and tests whether LLMs or agents can help spot merge-ready contributions. The reported result is that neither the models like DeepSeek or Qwen nor the agents like Claude Code do a good job separating the ones that merged from the build-passing ones that got revised or stayed unmerged. They also lay out a staged evaluation protocol that tries to match real review steps. That setup is new and directly targets the review bottleneck everyone in the Lean community talks about. Turning the library's own history into a supervised signal is a practical move that could support future reward models or reviewer tools. The motivation about speeding up Mathlib growth and thereby helping verified AI systems is reasonable. The soft spot is the label quality. Historical merge decisions are treated as ground truth, but those decisions often depend on reviewer availability, queue length, author reputation, or timing rather than pure technical merit. If the positive class includes PRs that merged mainly because they arrived at a good time, then models that flag high-quality but unmerged work could look worse than they are. The abstract gives no details on dataset filtering, inter-reviewer agreement, or controls for those confounders, so the negative result is hard to interpret cleanly. This paper is aimed at people building LLM tools for formal libraries and at Mathlib maintainers looking for ways to scale reviews. A reader working on agentic formal reasoning would get value from the benchmark artifact itself and could re-label or extend it. I would send it to peer review. The benchmark is a concrete new resource worth having in the open, and referees can push on the evaluation design and label validation to make the claims tighter.

Referee Report

2 major / 1 minor

Summary. The paper introduces MathlibPR, a benchmark built from historical Mathlib4 PR data, along with a staged evaluation protocol. It evaluates several LLM models (DeepSeek, Qwen, Goedel, Kimina) and agents (Codex, Claude Code) on the task of distinguishing merge-ready PRs from build-passing PRs that were revised or never merged, reporting that both categories of systems struggle on this distinction. The work positions the benchmark as a supervised signal toward LLM-assisted PR review and reward models for formal libraries.

Significance. If the negative result holds after addressing label validity, the benchmark would be a useful contribution to the formal mathematics and LLM-for-reasoning communities. It directly targets the review bottleneck in Mathlib growth and supplies real-world supervised data rather than synthetic tasks. The staged protocol and focus on merge-readiness (as opposed to pure theorem-proving success) are practical strengths that could inform future reviewer-assistant systems.

major comments (2)

[§3] §3 (implied dataset construction): The positive class is defined by historical merge decisions and the negative class by build-passing but revised or unmerged PRs. No evidence is supplied that these labels are independent of confounders such as reviewer queue length, author reputation, or timing; without inter-reviewer agreement statistics or controls for these factors, the reported model struggle cannot be unambiguously attributed to model limitations rather than label noise.
[Evaluation protocol] Evaluation protocol and results: The abstract states a clear negative finding, yet the manuscript provides no dataset size, class balance, exact metrics (e.g., accuracy, F1, or ranking metrics), baselines, or statistical significance tests. These omissions make it impossible to judge whether the observed struggle exceeds what would be expected from noisy labels or from trivial baselines such as always predicting the majority class.

minor comments (1)

[Abstract] The abstract would be strengthened by a single sentence stating the number of PRs, the train/test split, and the primary quantitative metric used to declare that models 'struggle'.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the thoughtful and constructive report. The comments highlight important issues around label validity and reporting completeness. We have revised the manuscript to add the requested details and a limitations discussion. Point-by-point responses follow.

read point-by-point responses

Referee: [§3] §3 (implied dataset construction): The positive class is defined by historical merge decisions and the negative class by build-passing but revised or unmerged PRs. No evidence is supplied that these labels are independent of confounders such as reviewer queue length, author reputation, or timing; without inter-reviewer agreement statistics or controls for these factors, the reported model struggle cannot be unambiguously attributed to model limitations rather than label noise.

Authors: We agree that historical merge decisions can be influenced by confounders including reviewer workload, author reputation, and timing. These factors are intrinsic to the real-world review process that MathlibPR seeks to model. In the revised manuscript we have added an explicit limitations subsection in §3 that discusses these confounders and frames the labels as a practical proxy for merge-readiness rather than a noise-free gold standard. We have also clarified that the benchmark is intended to surface the difficulty of the task under realistic conditions. Inter-reviewer agreement statistics are unfortunately not recorded in the public Mathlib PR history, so they cannot be supplied. revision: partial
Referee: [Evaluation protocol] Evaluation protocol and results: The abstract states a clear negative finding, yet the manuscript provides no dataset size, class balance, exact metrics (e.g., accuracy, F1, or ranking metrics), baselines, or statistical significance tests. These omissions make it impossible to judge whether the observed struggle exceeds what would be expected from noisy labels or from trivial baselines such as always predicting the majority class.

Authors: We apologize for the incomplete reporting in the original submission. The revised manuscript now includes the full dataset size, class balance, all metrics (accuracy, F1, AUC), explicit comparison against the majority-class baseline, and statistical significance tests. These additions show that model performance remains close to the trivial baseline, supporting the negative finding even after accounting for possible label noise. revision: yes

standing simulated objections not resolved

Inter-reviewer agreement statistics, which are not recorded in the historical Mathlib PR data.

Circularity Check

0 steps flagged

No circularity: empirical benchmark with independent evaluation

full rationale

The paper constructs MathlibPR from historical Mathlib PR data and reports an empirical observation that LLMs struggle to classify merge-ready versus build-passing PRs. No equations, derivations, fitted parameters, or predictions are present that reduce by construction to the paper's own inputs. The evaluation protocol is a standard supervised benchmark test against external historical labels; the result is an observation on model behavior rather than a tautological renaming or self-referential definition. No self-citation chains or ansatzes are load-bearing for the central claim.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper; the central claim rests on no mathematical axioms, free parameters, or invented entities.

pith-pipeline@v0.9.0 · 5536 in / 1012 out tokens · 31377 ms · 2026-05-14T21:49:53.641632+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

[1]

Use the rubric axes as the checklist

work page
[2]

Use the naming/style/doc digests as authoritative written guidance

work page
[3]

Use changed files and first-order imports to judge local conventions and code quality

work page
[4]

Use unknown instead of guessing when evidence is insufficient

work page
[5]

Output only the required JSON object. </instructions> Stage 2 user prompt.The Stage 2 user prompt extends the Stage 1 prompt by inserting a <diagnostics> block inside <context>, and replacing the third item in <instructions> with one that mentions diagnostics. The new block is: <diagnostics> {linter_diagnostics} {import_diagnostics} {location_diagnostics}...

work page
[6]

All other content is identical to Stage 1

Treat diagnostics as evidence, not as automatically decisive. All other content is identical to Stage 1. Stage 3 user prompt.The Stage 3 user prompt extends the Stage 2 prompt by adding a <pr intent> block at the end of <context>, and replacing the third instruction with one that mentions PR intent. The new block is: <pr_intent> title: {pr_title} descript...

work page
[7]

All other content is identical to Stage 2

Treat the PR title/description only as stated intent; code and diagnostics ,→control the judgment. All other content is identical to Stage 2. B.4 Prompt Templates for LLM Agents The stage-specific user-message templates are shared between Codex and Claude Code. The system messages share a common reviewer contract; Claude Code additionally inlines backend-...

work page
[8]

Start from the provided diff and changed files

work page
[12]

Do not edit files or run build, test, lint, or CI-like commands

work page
[14]

</repo_workflow> Stage 2<repo workflow>block

Use unknown instead of guessing when repo-wide evidence remains insufficient. </repo_workflow> Stage 2<repo workflow>block. <repo_workflow>

work page
[15]

Start from the provided diff, changed files, and diagnostics

work page
[22]

</repo_workflow> 21 Stage 3<repo workflow>block

Use unknown instead of guessing when repo-wide evidence remains insufficient. </repo_workflow> 21 Stage 3<repo workflow>block. <repo_workflow>

work page
[23]

Start from the provided diff, changed files, diagnostics, and PR intent

work page
[24]

Use PR title/description only to interpret intended scope and expected ,→API/documentation surface

work page
[25]

Inspect touched declarations and nearby declarations in the same files

work page
[26]

Search the repository snapshot for: - possible duplicate or more general existing results - naming analogues in the same namespace or nearby file families - plausible file/module homes for the changed declarations

work page
[27]

Treat diagnostics as evidence, but verify them against the code and repository ,→context

work page
[28]

Use read-only repository tools only

work page
[29]

Do not edit files or run build, test, lint, or CI-like commands beyond the ,→supplied diagnostics

work page
[30]

Do not use git history, workflow metadata, internet access, GitHub/mathlib4 PR ,→pages, or any remote API

work page
[31]

<tmp_locked_tools_dir>

Use unknown instead of guessing when repo-wide evidence remains insufficient. </repo_workflow> C Runtime, Tool Restrictions, and Validity Handling C.1 Fresh-Session Protocol Each evaluation example is processed in a fresh per-sample session. No state, history, or memory is carried across examples. This ensures that each judgment is grounded only in the su...

work page