arxiv: 2605.10141 · v1 · submitted 2026-05-11 · 💻 cs.AI

Recognition: no theorem link

FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models

Burak S. Akbudak, Can S. Erer, G\"ozde G\"ul \c{S}ahin, Zeynel A. Ulu\c{s}an

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:17 UTC · model grok-4.3

classification 💻 cs.AI

keywords formal theorem provingreward modelsLean 4preference pairsbenchmarkerror injectionlarge language modelsproof evaluation

0 comments

The pith

A new benchmark shows frontier LLMs judge formal proofs more accurately than specialized theorem provers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FormalRewardBench to compare reward models for formal theorem proving without running full reinforcement learning experiments each time. It builds 250 preference pairs by pairing correct Lean 4 proofs with flawed versions created through five expert-designed error injection methods. Tests on these pairs find that the strongest large language models correctly identify the better proof 59.8 percent of the time, while models specialized in theorem proving reach only 24.4 percent. The results indicate that skill at generating proofs does not automatically produce skill at evaluating them. The benchmark is released publicly to help develop learned reward models that can supply denser signals than binary correctness checks alone.

Core claim

FormalRewardBench is the first benchmark for reward models in formal theorem proving with Lean 4. It consists of 250 preference pairs formed by pairing correct proofs with incorrect variants generated through five expert-curated error injection strategies. Evaluation across frontier LLMs, judge LLMs, general-purpose LLMs, and specialized theorem provers shows frontier LLMs achieve the highest performance at 59.8 percent while specialized theorem provers perform the worst at 24.4 percent, indicating that theorem proving ability does not transfer to proof evaluation.

What carries the argument

The collection of 250 preference pairs created by applying five error injection strategies (forced mistakes, minimal single-point variations, verbose incorrect proofs, natural language justification, and Python code injection) to correct Lean 4 proofs.

If this is right

Reward models for formal proofs can be compared directly on preference pairs without running full RL training loops.
Frontier LLMs can act as practical reward models to guide training of better theorem provers.
Specialized theorem proving models need separate development to become reliable proof evaluators.
Different error injection strategies pose varying levels of difficulty, with most remaining challenging for current models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

LLM-based reward models could give training signals on partial progress that binary verification cannot provide.
The observed gap between generation and evaluation skills suggests hybrid models that combine both abilities may be worth testing.
Extending the benchmark with errors actually produced by model search traces would test how well the current pairs represent real failure modes.
If reward models trained on this data improve prover performance, the benchmark could become a standard test for future reward model work.

Load-bearing premise

The five expert-curated error injection strategies produce incorrect proofs that match the kinds of mistakes actual models make during proof attempts.

What would settle it

Training a theorem prover with a reward model that scores highest on the benchmark and measuring whether it solves more new theorems than one trained only with binary verification signals.

Figures

Figures reproduced from arXiv: 2605.10141 by Burak S. Akbudak, Can S. Erer, G\"ozde G\"ul \c{S}ahin, Zeynel A. Ulu\c{s}an.

**Figure 2.** Figure 2: FormalRewardBench generation pipeline. Top: Data flows from MiniF2F through [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Model accuracy as a function of benchmark size. Accuracy stabilizes after 250 [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗

read the original abstract

Recent neural theorem provers use reinforcement learning with verifiable rewards (RLVR), where proof assistants provide binary correctness signals. While verifiable rewards are cheap and scalable without reward hacking issues, they suffer from sparse credit assignment: models receive no learning signal from difficult problems where partial progress goes unrewarded. This motivates learned reward models that can evaluate proof quality beyond binary verification. However, comparing reward models is challenging since it typically requires expensive RL training ablations. To address this, we introduce \textbf{FormalRewardBench}, the first benchmark for evaluating reward models in formal theorem proving with Lean 4. Our benchmark consists of 250 preference pairs where correct proofs are paired with incorrect variants generated through five expert curated error injection strategies: forced mistakes, minimal single-point variations, verbose incorrect proofs, natural language justification, and Python code injection. We evaluate frontier LLMs (e.g., Claude Opus 4.5), judge LLMs (e.g., CompassJudger-1-14B), general-purpose LLMs (e.g., Qwen2.5-72B-Instruct), and specialized theorem proving models (e.g., DeepSeek-Prover-V2-7B). Our results reveal that frontier LLMs achieve the highest performance (59.8\%) while specialized theorem provers perform the worst (24.4\%), suggesting that theorem proving ability does not transfer to proof evaluation. We provide further insights on various error injection mechanisms, highlighting the challenging nature of most injection mechanisms. We release \textbf{FormalRewardBench} publicly to encourage more research on developing reward models in formal mathematics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The benchmark construction is a solid contribution, but the interpretation that theorem-proving skill fails to transfer to evaluation lacks support from real error distributions.

read the letter

The one thing to know about this paper is that it puts out the first benchmark specifically for reward models in Lean-based theorem proving, using 250 preference pairs built from five error injection methods. That part is new and potentially helpful. The other thing is that their evaluation shows general frontier models beating specialized provers on this task, but that conclusion sits on shaky ground. They generate the pairs by starting with correct proofs and applying strategies like forced mistakes, single-point changes, verbose incorrect versions, natural language additions, and Python code injections. Then they run various models as judges on which proof in the pair is better. The results have frontier LLMs at 59.8% and the provers at 24.4%. They release the benchmark to help with training reward models for RLVR without needing full reinforcement learning loops each time. This is good because the field has been stuck with binary verifiable rewards that give no signal on partial progress. Having a dataset for preference-based training is a practical step forward. The different injection types also let them comment on which kinds of errors are harder to detect. The soft spot is exactly the one in the stress test. The claim that proving ability does not transfer to evaluation assumes these injected errors are like the ones models actually produce. But the paper does not compare the synthetic errors to real failures from the models they tested. If the injections happen to be the kind that LLMs handle better, the gap is an artifact of the benchmark construction rather than a general fact. Details on how the 250 pairs were checked for quality are also missing from what we have, so the numbers could shift with better validation. This is for people in the neural formal math community who want to experiment with learned rewards. A reader looking for a ready dataset to try out new reward models will find it useful. It is worth sending to peer review because new benchmarks in this area can accelerate work even if they need refinement on the analysis side. I would recommend review with attention to validating the error representativeness.

Referee Report

2 major / 2 minor

Summary. The paper introduces FormalRewardBench, the first benchmark for reward models in formal theorem proving with Lean 4. It consists of 250 preference pairs pairing correct proofs with incorrect variants generated via five expert-curated error injection strategies (forced mistakes, minimal single-point variations, verbose incorrect proofs, natural-language justification, and Python code injection). The authors evaluate frontier LLMs (e.g., Claude Opus 4.5 at 59.8%), judge LLMs, general-purpose LLMs, and specialized theorem provers (e.g., DeepSeek-Prover-V2-7B at 24.4%), concluding that frontier LLMs perform best while specialized provers perform worst, suggesting that theorem-proving ability does not transfer to proof evaluation. They release the benchmark publicly and provide insights on the relative difficulty of the injection mechanisms.

Significance. If the benchmark construction is validated, this work provides a scalable, low-cost alternative to full RL training ablations for comparing reward models in neural theorem proving. The empirical finding that general frontier LLMs substantially outperform specialized provers on preference judgments offers a concrete signal for future reward model design. The public release of the 250-pair dataset is a clear strength that enables reproducible follow-up work.

major comments (2)

[Benchmark construction] Benchmark construction section: the five error injection strategies are presented as producing the incorrect proofs for the 250 pairs, but no comparison is reported between the distribution of these synthetic errors and the actual incorrect proofs generated by the evaluated models (frontier LLMs or specialized provers). This representativeness assumption is load-bearing for the central interpretation that the 59.8% vs. 24.4% gap demonstrates lack of transfer from proving to evaluation.
[Results] Results section (and abstract): performance figures such as 59.8% and 24.4% are reported without any mention of how the 250 pairs were validated (e.g., expert review process), inter-annotator agreement, or statistical tests (e.g., McNemar or bootstrap) for the differences across model categories. With only 250 pairs, the absence of these details leaves the reliability of the headline comparison open to methodological questions.

minor comments (2)

[Abstract] The abstract and introduction could more explicitly state the total number of unique theorems underlying the 250 pairs and whether multiple error types were applied per theorem.
[Results] Table or figure presenting per-strategy accuracies would benefit from error bars or per-model breakdowns to clarify which injection mechanisms drive the overall gap.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to improve the manuscript.

read point-by-point responses

Referee: Benchmark construction section: the five error injection strategies are presented as producing the incorrect proofs for the 250 pairs, but no comparison is reported between the distribution of these synthetic errors and the actual incorrect proofs generated by the evaluated models (frontier LLMs or specialized provers). This representativeness assumption is load-bearing for the central interpretation that the 59.8% vs. 24.4% gap demonstrates lack of transfer from proving to evaluation.

Authors: We acknowledge that comparing the distribution of our expert-curated synthetic errors to the actual error patterns produced by the evaluated models would strengthen the transferability interpretation. The benchmark is intentionally constructed around diverse, expert-defined error types (forced mistakes, minimal variations, verbose proofs, natural-language justifications, and code injection) to create challenging preference pairs that test reward model robustness beyond binary verification. The observed performance gap still provides evidence that specialized provers underperform on proof evaluation tasks relative to frontier LLMs. In the revision, we will add a dedicated limitations paragraph clarifying the scope of the benchmark, noting the absence of a direct distributional comparison, and recommending future work to collect model-generated incorrect proofs for such analysis. revision: partial
Referee: Results section (and abstract): performance figures such as 59.8% and 24.4% are reported without any mention of how the 250 pairs were validated (e.g., expert review process), inter-annotator agreement, or statistical tests (e.g., McNemar or bootstrap) for the differences across model categories. With only 250 pairs, the absence of these details leaves the reliability of the headline comparison open to methodological questions.

Authors: We agree that explicit details on validation and statistical rigor are necessary for a benchmark paper. All 250 pairs were constructed and manually validated by Lean 4 experts to confirm that the injected variants are indeed invalid while preserving the original theorem statement. We will expand the benchmark construction section to describe this expert review process in detail. Additionally, we will report inter-annotator agreement metrics (where multiple reviewers assessed subsets) and include statistical tests such as bootstrap confidence intervals and McNemar's test for pairwise model comparisons in the results section and abstract to quantify the reliability of the reported differences. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark construction and evaluation

full rationale

The paper constructs FormalRewardBench from 250 preference pairs using five expert-curated error injection strategies and reports direct empirical accuracies for different model classes on those pairs. No equations, derivations, fitted parameters, or predictions appear; the headline result (59.8% vs 24.4%) is a straightforward measurement on the fixed dataset rather than a quantity forced by the paper's own definitions or self-citations. The work contains no self-definitional steps, fitted-input predictions, uniqueness theorems, or ansatzes smuggled via citation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution is the curated preference dataset, which rests on the domain assumption that the five error injection strategies produce representative flawed proofs suitable for testing reward models.

axioms (1)

domain assumption Expert-curated error injection strategies generate incorrect proofs that meaningfully test reward model capabilities.
The benchmark's validity depends on these strategies covering realistic failure modes without further validation described in the abstract.

pith-pipeline@v0.9.0 · 5615 in / 1226 out tokens · 49142 ms · 2026-05-12T04:17:38.375557+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 3 internal anchors

[1]

Olympiad- level formal mathematical reasoning with reinforcement learning

URLhttps://arxiv.org/abs/2410.16256. Leonardo de Moura and Sebastian Ullrich. The Lean 4 theorem prover and programming language.https://leanprover.github.io/, 2021. Accessed: 2024. Zhiyuan Fan, Dadi Guo, Zhitao He, Elias Stengel-Eskin, Mohit Bansal, and Yi R. Fung. Proof-verifier: Enabling reinforcement learning from verifiable rewards for mathematical t...

work page doi:10.1038/s41586-025-09833-y 2021
[2]

Qwen2.5 Technical Report

URLhttps://openreview.net/forum?id=MwU2SGLKpS. Saumya Malik, Valentina Pyatkin, Sander Land, Jacob Morrison, Noah A. Smith, Hannaneh Hajishirzi, and Nathan Lambert. Rewardeval: Advancing reward model evaluation. InThe Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=fb0G86Dewb. Tobias Nipkow, Lawr...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2026
[3]

URLhttps://arxiv.org/abs/2503.17181. Haiming Wang, Mert Unsal, Xiaohan Lin, Mantas Baksys, Junqi Liu, Marco Dos Santos, Flood Sung, Marina Vinyes, Zhenzhe Ying, Zekai Zhu, Jianqiao Lu, Hugues de Saxcé, Bolton Bailey, Chendong Song, Chenjun Xiao, Dehao Zhang, Ebony Zhang, Frederick Pu, Han Zhu, Jiawei Liu, Jonas Bayer, Julien Michel, Longhui Yu, Léo Dreyfu...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.emnlp-main 2025
[4]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

URLhttps://aclanthology.org/2025.emnlp-main.103/. Huajian Xin, Z. Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Haowei Zhang, Qihao Zhu, Dejian Yang, Zhibin Gou, Z. F. Wu, Fuli Luo, and Chong Ruan. Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.emnlp-main.712 2025
[5]

Verified !

and Isabelle (Nipkow et al., 2002) that share similar tactic-based proof construction. B Background: Reward Model Formulations Given a prompt x and two candidate responses y1 and y2, reward models determine which response is preferred. LLM-as-Judgedirectly prompts a language model to compare two responses and output a preference judgmentI∈ {1, 2}. Generat...

work page 2002
[6]

Incorrect application of tactics (simp, rw, apply)

work page
[7]

Wrong hypothesis selection or variable scoping

work page
[8]

Mismatched term types or implicit argument errors

work page
[9]

Incomplete case analysis or missing edge cases

work page
[10]

Incorrect use of induction hypotheses

work page
[11]

Wrong lemma instantiation or theorem application

work page
[12]

Scope errors with bound variables

work page
[13]

Input {theorem} Output Output only the completed proof, starting afterby

Misunderstanding proof state after tactic application Rules • Use valid Lean 4 syntax • Make errors from automated proof generation • Focus on formal reasoning mistakes, not basic math errors • The proof should appear plausible at first glance User Complete the following Lean 4 theorem with an INCORRECT proof containing a typical proof generation error. I...

work page