Risk-Controlled Lean-as-Judge for Natural-Language Mathematical Reasoning

Haitham Bou Ammar; Matthieu Zimmer; Pauline Bourigault; Rasul Tutunov; Xiaotong Ji

arxiv: 2605.28365 · v1 · pith:WALU5PW2new · submitted 2026-05-27 · 💻 cs.AI · cs.CL· cs.LO

Risk-Controlled Lean-as-Judge for Natural-Language Mathematical Reasoning

Pauline Bourigault , Xiaotong Ji , Matthieu Zimmer , Rasul Tutunov , Haitham Bou Ammar This is my paper

Pith reviewed 2026-06-29 12:17 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LO

keywords risk controllean theorem proverautoformalizationmathematical reasoningselective predictionfinite-sample boundsnatural language to formal

0 comments

The pith

COVCAL turns partial Lean proof traces into a selector that accepts 48% of problems at 0.98 accuracy under finite-sample risk bounds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that Lean verification of natural-language math answers yields a signal whose reliability depends sharply on formalization coverage and is often unfaithful. It introduces COVCAL, a selector over Lean-trace diagnostics that either accepts answers while certifying a finite-sample upper bound on error rate or abstains entirely. With a prover-specialized formalizer reaching 79% coverage the method accepts roughly 48% of MATH-500 problems at 0.98 accuracy, whereas a 7B formalizer produces too few proofs and forces abstention on every partition. Self-consistency alone already reaches 91% accuracy, so the work focuses on the precise conditions under which the partial formal signal becomes trustworthy under risk control. The proof signal itself is coverage-dependent: proved answers are correct 96% of the time at high coverage but only 20% at low coverage.

Core claim

COVCAL is a selector over Lean-trace diagnostics that certifies a finite-sample selective-risk bound on accepted answers or abstains, under Bonferroni or dev-then-cal rules. Feasibility depends on autoformalization coverage: a prover-specialized formalizer at 79% coverage makes the procedure feasible on 17 of 20 bootstrap partitions and accepts approximately 48% of problems at 0.98 accepted accuracy, while a 7B formalizer leaves the signal too sparse for any acceptance. The contribution is a precise account of when, and with which formalizer, a partial formal signal can be trusted under risk control.

What carries the argument

COVCAL selector over Lean-trace diagnostics that certifies a finite-sample selective-risk bound or abstains.

If this is right

High autoformalization coverage is required before any answers can be accepted under the certified bound.
Accepted accuracy reaches 0.98 while self-consistency alone reaches 0.91.
The Bonferroni rule is more conservative and abstains more often than the dev-then-cal rule.
Only about 43% of generated proofs are faithful to the original answer in manual audit.
The proved-answer correctness rate drops from 96% at high coverage to 20% at low coverage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selector structure could be applied to other formal systems if comparable trace diagnostics exist.
Combining the Lean signal with self-consistency might raise acceptance rate while preserving the risk certificate.
The coverage threshold needed for feasibility may differ on problem distributions unlike MATH-500.
If statistical dependence across problems is stronger than assumed, the finite-sample bounds would require correction.

Load-bearing premise

The chosen Lean-trace diagnostics are sufficiently informative and the Bonferroni or dev-then-cal rule produces a valid finite-sample bound without unaccounted dependence or model misspecification on the MATH-500 distribution.

What would settle it

A fresh collection of math problems on which the accuracy of answers accepted by COVCAL falls below the certified bound would show the diagnostics or calibration rule do not deliver the claimed guarantee.

Figures

Figures reproduced from arXiv: 2605.28365 by Haitham Bou Ammar, Matthieu Zimmer, Pauline Bourigault, Rasul Tutunov, Xiaotong Ji.

**Figure 1.** Figure 1: COVCAL treats Lean as a partial-observation judge: trust formal evidence when coverage is sufficient, and abstain when coverage is too low. The method couples Lean-based formal traces with finite-sample selective prediction. this search. The dev-then-cal regime chooses the threshold on an independent development split and certifies that single threshold once on the calibration split. Both regimes certify … view at source ↗

**Figure 2.** Figure 2: Accepted fraction versus held-out selective [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

read the original abstract

Lean is increasingly used to judge natural-language mathematical answers, but its signal is partial: many answers never formalize, and a failed proof may reflect an ill-typed statement or a missing library fact, not a wrong answer. On MATH-500 we show this signal is (i) sharply coverage-dependent, that is the proof-winning answer is correct 96% of the time at high proved coverage but 20% at low, and (ii) sparse and often unfaithful: a 7B autoformalizer proves a class for only 28% of problems, and a manual audit finds only approximately 43% of those proofs faithful. We propose COVCAL, a selector over Lean-trace diagnostics that certifies a finite-sample selective-risk bound on accepted answers or abstains, under two regimes (a conservative Bonferroni bound and a tighter dev-then-cal rule). Feasibility depends on autoformalization coverage: with the 7B formalizer the signal is too sparse and Bonferroni abstains on all 20 bootstrap partitions, whereas a prover-specialized formalizer reaches 79% coverage and flips it to feasible on 17 of 20, accepting approximately 48% of problems at 0.98 accepted accuracy. Since self-consistency alone is already 91% accurate, our contribution is a precise account of when, and with which formalizer, a partial formal signal can be trusted under risk control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

COVCAL shows partial Lean signals can support risk-controlled selection on MATH-500 once coverage hits ~79%, but the finite-sample bound may not survive the observed sparsity and low faithfulness.

read the letter

This paper shows that you can get a certified selective-risk bound from Lean traces on natural-language math answers, but only when the autoformalizer reaches high coverage. With the prover-specialized formalizer at 79% coverage, COVCAL accepts about 48% of MATH-500 problems at 0.98 accepted accuracy under Bonferroni or dev-then-cal; the 7B formalizer is too sparse and the bound never triggers. Self-consistency alone hits 91% accuracy without any formal guarantee, so the contribution is the precise account of when the partial formal signal becomes usable under risk control.

What is new is the coverage-dependence analysis (proof success predicts correctness at 96% high vs 20% low), the manual faithfulness audit (only 43% of proved statements hold up), and the COVCAL selector that abstains when the bound cannot be met. They compare two formalizers on 20 bootstrap partitions and show feasibility flips from 0/20 to 17/20. The statistical tools are standard, but applying them to this sparse, coverage-dependent diagnostic is a concrete step.

The work is useful for anyone who needs bounded error rather than average accuracy in LLM math pipelines. The empirical numbers are clear and the comparison across formalizers is informative.

The soft spot is the validity of the bound itself. The abstract already flags that the signal is sparse and only 43% faithful; either fact can create dependence or bias that the Bonferroni or dev-then-cal rule does not automatically handle. If the paper does not derive the bound explicitly or test for misspecification on the MATH-500 distribution, the certification claim rests on an unverified assumption. No code or audit protocol is mentioned, which makes the 48% figure hard to reproduce.

This is for researchers working on safe LLM reasoning systems. It deserves a serious referee because the mechanism is practical and the experiments target a real deployment gap. I would send it to review but ask the referees to check the bound under the reported sparsity.

Referee Report

2 major / 1 minor

Summary. The paper claims that Lean-based judging of natural-language mathematical answers on MATH-500 yields a coverage-dependent and sparse/unfaithful signal (96% vs 20% correctness split, 28% proof rate, ~43% faithfulness). It introduces COVCAL, a selector over Lean-trace diagnostics that applies Bonferroni or dev-then-cal rules to certify finite-sample selective-risk bounds, enabling acceptance of ~48% of problems at 0.98 accepted accuracy (17/20 partitions feasible) when a prover-specialized formalizer reaches 79% coverage; this is positioned against self-consistency reaching only 91% accuracy without risk control.

Significance. If the selective-risk bounds remain valid after accounting for dependence, the work supplies a concrete mechanism for certifying reliability when leveraging partial formal verification signals in math reasoning, highlighting the decisive role of autoformalization coverage. The explicit comparison to self-consistency and the partition-wise feasibility results provide a useful empirical baseline for when formal signals can be trusted under risk control.

major comments (2)

[Abstract] Abstract: the central claim that COVCAL 'certifies a finite-sample selective-risk bound' at 0.98 accepted accuracy rests on the Bonferroni and dev-then-cal constructions, yet the text supplies no derivation of these bounds, no accounting for possible dependence across Lean traces or autoformalizer-induced correlations, and no discussion of misspecification relative to the MATH-500 distribution; this directly undermines the validity of the reported risk control.
[Abstract] Abstract: the reported figures (96%/20% split, 28% proof rate, 43% faithfulness, 48% acceptance, 17/20 partitions) are presented without error bars, audit protocol details, or code, leaving the empirical support for feasibility and the comparison to self-consistency unverified and therefore load-bearing for the paper's conclusions.

minor comments (1)

[Abstract] Abstract: the precise definition of 'accepted accuracy' and how the 0.98 threshold is computed from the diagnostics should be stated explicitly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on the risk-control claims and empirical transparency. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and additions.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that COVCAL 'certifies a finite-sample selective-risk bound' at 0.98 accepted accuracy rests on the Bonferroni and dev-then-cal constructions, yet the text supplies no derivation of these bounds, no accounting for possible dependence across Lean traces or autoformalizer-induced correlations, and no discussion of misspecification relative to the MATH-500 distribution; this directly undermines the validity of the reported risk control.

Authors: The Bonferroni and dev-then-cal bounds are derived in Sections 3.2 and 3.3 using standard finite-sample selective inference arguments (adapting the methods of Bates et al. and Angelopoulos et al. to Lean diagnostic scores). We will add a concise derivation sketch and explicit references to these sections in the abstract and introduction. On dependence: the current analysis treats the 20 bootstrap partitions as exchangeable, but we acknowledge potential correlations induced by the autoformalizer; we will add a new paragraph quantifying empirical sensitivity to dependence and noting it as a limitation. On misspecification: the bounds are distribution-free and require only i.i.d. sampling between calibration and test sets, with no parametric assumptions on MATH-500; we will explicitly state this and discuss robustness to distribution shift. revision: yes
Referee: [Abstract] Abstract: the reported figures (96%/20% split, 28% proof rate, 43% faithfulness, 48% acceptance, 17/20 partitions) are presented without error bars, audit protocol details, or code, leaving the empirical support for feasibility and the comparison to self-consistency unverified and therefore load-bearing for the paper's conclusions.

Authors: We will add error bars (standard deviation across the 20 bootstrap partitions) to all figures. The faithfulness audit protocol, including annotation guidelines and inter-annotator agreement, is described in Appendix B; we will expand this section with additional details. The full code for COVCAL, the bootstrap procedure, Lean diagnostics, and self-consistency baseline will be released in a public repository upon acceptance. These changes will enable independent verification of the 48% acceptance rate at 0.98 accuracy and the 17/20 partition feasibility result. revision: yes

Circularity Check

0 steps flagged

No circularity; standard statistical bounds applied to diagnostics

full rationale

The paper applies known finite-sample selective-risk bounds (Bonferroni and dev-then-cal) to observed Lean-trace diagnostics. These are external statistical constructions whose validity does not reduce to any equation or parameter fitted inside the paper. No self-definitional loop, no fitted input renamed as prediction, and no load-bearing self-citation chain appears in the derivation. The result is an empirical application whose coverage and accuracy numbers are measured on MATH-500 rather than forced by construction. Minor self-citation, if present, is not load-bearing for the central risk-control claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Paper rests on standard selective inference bounds and empirical observations from MATH-500; no free parameters, new axioms, or invented entities are introduced in the abstract.

axioms (1)

domain assumption The selective risk bound produced by Bonferroni or dev-then-cal on the chosen diagnostics is valid for the observed coverage regime.
Invoked when claiming certified risk on accepted answers.

pith-pipeline@v0.9.1-grok · 5799 in / 1313 out tokens · 29087 ms · 2026-06-29T12:17:37.762413+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 3 canonical work pages · 3 internal anchors

[1]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word prob- lems.arXiv preprint arXiv:2110.14168. Ran El-Yaniv and Yair Wiener. 2010. On the founda- tions of noise-free selective classification.Journal of Machine Learning Research, 11:1605–1641. Yonatan Geifman and Ran El-Yaniv. 2017. Selective classification for deep neural networks. InAdvances in Neural Information Pro...

work page internal anchor Pith review Pith/arXiv arXiv 2010
[2]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

Math-Shepherd: Verify and reinforce LLMs step-by-step without human annotations. InProceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 9426–9439. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-consistency impr...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

MiniF2F: A cross-system benchmark for for- mal Olympiad-level mathematics.arXiv preprint arXiv:2109.00110. Appendices A Proofs Proof of Theorem 1. Fix a threshold τ∈ T . Be- cause Aτ and gF are computed without calibra- tion labels, the accepted calibration examples are an i.i.d. sample from the conditional deployment distribution given Aτ(X) = 1, conditi...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Main dataset:all-levels MATH-500, 378 kept after filtering (500 raw minus 122 excluded for non-normalizable references, proof-only items, and diagram-dependent geometry)
[5]

Hard subset:MATH-500 levels 4–5 only, 199 kept after the same filters
[6]

Candidate generator:Qwen2.5-Math-7B- Instruct (bfloat16, vLLM 0.10.2), K= 32 sam- ples, temperature 0.7, top-p= 0.95 , maximum 2048 new tokens, final answer in boxed{...}

2048
[7]

Class aggregation:normalize final answers and computeQ c from self-consistency frequency
[8]

Formalization:Qwen2.5-Coder-7B-Instruct (bfloat16) autoformalization with four artifacts per top class and one error-feedback repair pass
[9]

Formalized classes:top four classes per prob- lem
[10]

Lean proving:Lean 4.21.0 with Mathlib at commit 308445d7; tactic portfolio norm_num, ring_nf, omega, linarith, nlinarith, simp, aesop, decide, and short combinations; 20 sec- onds per tactic script
[11]

Calibration:20/40/40 develop- ment/calibration/test split with seed 0; target selective risk ϵ= 0.15 on the all-levels run and ϵ= 0.10 on the hard subset; confidence 1−δ= 0.95 ; Clopper–Pearson bounds with Bonferroni correction over T (Appendix D; |T |= 125)
[12]

Baselines reported:self-consistency, confidence-only abstention, raw Lean+fallback, proof-existence abstention, typed-coverage only, proved-coverage only, margin-only, COVCAL, and COVCAL+fallback. D Threshold Grid Used in Experiments The grid used in all reported runs is Ttyp ={0,0.25,0.5,0.75,0.9}, Tprf ={0,0.1,0.25,0.5,0.75}, TM ={−0.5,0,0.1,0.25,0.5}, ...
[13]

Problem fields:problem id, dataset, topic la- bel, problem text, reference answer, normalized reference class, and inclusion/exclusion status
[14]

Candidate fields:the K raw sampled solu- tions, extracted final answers, normalized an- swer classes, and self-consistency weightsQ c
[15]

Formalization fields:for each top answer class, the generated Lean code, repair-round index, imports, and theorem statement
[16]

Lean fields:artifact-level status, class-level sta- tus, tactic script or proof attempt, runtime, time- out flag, error message, Lean version, Mathlib commit, and hardware
[17]

Per-status definitions.The best Lean status sj of an artifact takes one of five values

Selection fields:self-consistency class, highest- weight proved class, conflict flag, Ctyp, Cprf, Runres, M, COVCALaccept/reject decision, fallback decision when used, and correctness. Per-status definitions.The best Lean status sj of an artifact takes one of five values. proved means a Lean proof of the generated statement was found and kernel checked. t...
[18]

What is the 2003rd term of the sequence of odd numbers 1,3,5,7, . . . ?

Unless otherwise stated, examples are split af- ter generation and formalization into development, calibration, and test partitions. The seed-0 split is preregistered; the bootstrap summaries resample the frozen observations over K= 20 dev/cal/test partitions. Generation and normalization.Qwen2.5-Math- 7B-Instruct generates 32 solutions per problem at tem...

2048
[19]

the nth odd number is 2n−1

All K=32 samples collapse to the single answer class 4005 (w=1.0), which is proved. The kernel-checked statement is 2·2003−1 = 4005 (bynorm_num), a faithful closed form of “the nth odd number is 2n−1 .” Diagnostics: Ctyp=Cprf=1, margin 1. COVCALaccepts 4005 (correct).Lesson:a proof is trusted when the proved class dominates the for- mally visible mass wit...

2003
[20]

The trigonometric product identity does not autoformalize

The dominant class 1/16 (w=0.84) and the minor classes 1/32 and 1/8 all fail Lean elaboration. The trigonometric product identity does not autoformalize. So, Ctyp=Cprf=0 and no class is proved. COVCALabstains; the self- consistency fallback returns1/16 (correct).Lesson: a missing proof is unresolved, not negative; absten- tion plus fallback preserves the ...

[1] [1]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word prob- lems.arXiv preprint arXiv:2110.14168. Ran El-Yaniv and Yair Wiener. 2010. On the founda- tions of noise-free selective classification.Journal of Machine Learning Research, 11:1605–1641. Yonatan Geifman and Ran El-Yaniv. 2017. Selective classification for deep neural networks. InAdvances in Neural Information Pro...

work page internal anchor Pith review Pith/arXiv arXiv 2010

[2] [2]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

Math-Shepherd: Verify and reinforce LLMs step-by-step without human annotations. InProceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 9426–9439. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-consistency impr...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

MiniF2F: A cross-system benchmark for for- mal Olympiad-level mathematics.arXiv preprint arXiv:2109.00110. Appendices A Proofs Proof of Theorem 1. Fix a threshold τ∈ T . Be- cause Aτ and gF are computed without calibra- tion labels, the accepted calibration examples are an i.i.d. sample from the conditional deployment distribution given Aτ(X) = 1, conditi...

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Main dataset:all-levels MATH-500, 378 kept after filtering (500 raw minus 122 excluded for non-normalizable references, proof-only items, and diagram-dependent geometry)

[5] [5]

Hard subset:MATH-500 levels 4–5 only, 199 kept after the same filters

[6] [6]

Candidate generator:Qwen2.5-Math-7B- Instruct (bfloat16, vLLM 0.10.2), K= 32 sam- ples, temperature 0.7, top-p= 0.95 , maximum 2048 new tokens, final answer in boxed{...}

2048

[7] [7]

Class aggregation:normalize final answers and computeQ c from self-consistency frequency

[8] [8]

Formalization:Qwen2.5-Coder-7B-Instruct (bfloat16) autoformalization with four artifacts per top class and one error-feedback repair pass

[9] [9]

Formalized classes:top four classes per prob- lem

[10] [10]

Lean proving:Lean 4.21.0 with Mathlib at commit 308445d7; tactic portfolio norm_num, ring_nf, omega, linarith, nlinarith, simp, aesop, decide, and short combinations; 20 sec- onds per tactic script

[11] [11]

Calibration:20/40/40 develop- ment/calibration/test split with seed 0; target selective risk ϵ= 0.15 on the all-levels run and ϵ= 0.10 on the hard subset; confidence 1−δ= 0.95 ; Clopper–Pearson bounds with Bonferroni correction over T (Appendix D; |T |= 125)

[12] [12]

Baselines reported:self-consistency, confidence-only abstention, raw Lean+fallback, proof-existence abstention, typed-coverage only, proved-coverage only, margin-only, COVCAL, and COVCAL+fallback. D Threshold Grid Used in Experiments The grid used in all reported runs is Ttyp ={0,0.25,0.5,0.75,0.9}, Tprf ={0,0.1,0.25,0.5,0.75}, TM ={−0.5,0,0.1,0.25,0.5}, ...

[13] [13]

Problem fields:problem id, dataset, topic la- bel, problem text, reference answer, normalized reference class, and inclusion/exclusion status

[14] [14]

Candidate fields:the K raw sampled solu- tions, extracted final answers, normalized an- swer classes, and self-consistency weightsQ c

[15] [15]

Formalization fields:for each top answer class, the generated Lean code, repair-round index, imports, and theorem statement

[16] [16]

Lean fields:artifact-level status, class-level sta- tus, tactic script or proof attempt, runtime, time- out flag, error message, Lean version, Mathlib commit, and hardware

[17] [17]

Per-status definitions.The best Lean status sj of an artifact takes one of five values

Selection fields:self-consistency class, highest- weight proved class, conflict flag, Ctyp, Cprf, Runres, M, COVCALaccept/reject decision, fallback decision when used, and correctness. Per-status definitions.The best Lean status sj of an artifact takes one of five values. proved means a Lean proof of the generated statement was found and kernel checked. t...

[18] [18]

What is the 2003rd term of the sequence of odd numbers 1,3,5,7, . . . ?

Unless otherwise stated, examples are split af- ter generation and formalization into development, calibration, and test partitions. The seed-0 split is preregistered; the bootstrap summaries resample the frozen observations over K= 20 dev/cal/test partitions. Generation and normalization.Qwen2.5-Math- 7B-Instruct generates 32 solutions per problem at tem...

2048

[19] [19]

the nth odd number is 2n−1

All K=32 samples collapse to the single answer class 4005 (w=1.0), which is proved. The kernel-checked statement is 2·2003−1 = 4005 (bynorm_num), a faithful closed form of “the nth odd number is 2n−1 .” Diagnostics: Ctyp=Cprf=1, margin 1. COVCALaccepts 4005 (correct).Lesson:a proof is trusted when the proved class dominates the for- mally visible mass wit...

2003

[20] [20]

The trigonometric product identity does not autoformalize

The dominant class 1/16 (w=0.84) and the minor classes 1/32 and 1/8 all fail Lean elaboration. The trigonometric product identity does not autoformalize. So, Ctyp=Cprf=0 and no class is proved. COVCALabstains; the self- consistency fallback returns1/16 (correct).Lesson: a missing proof is unresolved, not negative; absten- tion plus fallback preserves the ...