arxiv: 2605.10379 · v1 · submitted 2026-05-11 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness

Dimitar I. Dimitrov, Ivo Petrov, Jasper Dekoninck, Martin Vechev

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:43 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM mathematical reasoningproof quality evaluationLLM benchmarksconcisenesscognitive simplicityproof diversityadaptivity in proofs

0 comments

The pith

LLM-generated proofs differ substantially in quality beyond mere correctness, with measurable trade-offs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models can solve hard math problems with correct proofs, but these proofs still differ in clarity, brevity, and how easily they transfer to new cases. The paper introduces ProofRank, a benchmark built from competition problems that scores proofs using five automatic proxies for quality. Experiments across models reveal large gaps in these scores that correctness benchmarks miss entirely. The data also show clear conflicts, where higher quality on the proxies often reduces the chance of getting the proof right at all. This implies that future tests of LLM reasoning need to track how useful the actual proofs turn out to be.

Core claim

The central claim is that correctness alone fails to capture important differences in LLM proof quality, as shown by substantial model-to-model variation and systematic trade-offs on the ProofRank benchmark's five proxies for conciseness, computational ease, cognitive simplicity, diversity, and adaptivity.

What carries the argument

ProofRank benchmark and its five scalable proxies that separately score proof quality independent of correctness.

If this is right

Correctness-only benchmarks miss substantial differences in proof quality across models.
Significant trade-offs exist between the quality proxies and correctness rates.
Future evaluations of mathematical reasoning in LLMs should measure how useful the generated proofs are.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These proxies could be turned into training signals to push models toward more practical proofs.
The same approach might apply to evaluating LLM outputs in formal verification or code generation.
Human preference studies could test whether the automatic scores align with what mathematicians actually value.

Load-bearing premise

The five chosen proxies accurately reflect aspects of proof quality that matter to human readers and can be measured objectively at scale.

What would settle it

A large-scale study in which expert mathematicians consistently prefer proofs that the proxies score as low-quality over those scored high-quality would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.10379 by Dimitar I. Dimitrov, Ivo Petrov, Jasper Dekoninck, Martin Vechev.

**Figure 2.** Figure 2: Main results for several models. Experimental results As shown in [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Tradeoff between accuracy and conciseness under different prompting strategies. Results We run these ablations on the openweight models GPT-OSS-120B, DEEPSEEKV3.2, and KIMI-K2.5-THINK, as well as the proprietary model GEMINI-3-FLASH. In Figure 3, we show how conciseness and accuracy change under the different prompting strategies. As expected, solutions become substantially more concise under the alter… view at source ↗

**Figure 4.** Figure 4: Elo changes when only considering valid rephrasings for the conciseness metric. [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Elo changes when only considering valid rephrasings for the conciseness metric. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of the number of solutions per problem in our dataset. For each problem, we collect human-written solution attempts from public Art of Problem Solving (AoPS) posts or the official competition sources. We discard problems for which no public solution could be found. This filtering removed approximately 100 problems from the original IMO-ANSWERBENCH dataset of 400 problems. In addition, some I… view at source ↗

read the original abstract

Large language models (LLMs) have become capable mathematical problem-solvers, often producing correct proofs for challenging problems. However, correctness alone is not sufficient: mathematical proofs should also be clear, concise, insightful, and transferable to other problems. While this proof quality is subjective and depends on the reader and context, many of its components are concrete and broadly valued. In this work, we identify such components and introduce ProofRank, a benchmark curated from challenging mathematical competitions. ProofRank evaluates several scalable proxies of proof quality: (i) conciseness, measuring whether proofs avoid unnecessary steps; (ii) computational ease, measuring the extent to which a proof relies on tedious calculations; (iii) cognitive simplicity, measuring how accessible the used proof techniques are; (iv) diversity, measuring how varied a model's proofs for a single problem are; and (v) adaptivity, measuring whether a model can follow a specified proof technique. Across models, we find substantial differences in proof quality that are not captured by correctness-only benchmarks. We also observe significant trade-offs between proof-quality metrics and correctness, suggesting that future evaluations of mathematical reasoning should measure how useful LLM-generated proofs are.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces ProofRank and five proxies to move LLM proof eval past correctness, but the proxies for cognitive simplicity and adaptivity rest on unvalidated automatic rules.

read the letter

The main contribution is a new benchmark, ProofRank, built from competition problems, plus five scalable proxies meant to capture proof quality beyond right-or-wrong: conciseness by step count, computational ease by calculation load, cognitive simplicity by technique accessibility, diversity by output variation on the same problem, and adaptivity by adherence to a requested technique. They report that models differ on these dimensions even when correct and that some trade off against correctness. That framing is useful because it matches what people actually want from a proof in teaching or research assistance contexts. The curation from real contests gives the benchmark some grounding that synthetic datasets often lack. The central claim holds up at the level of motivation: binary correctness clearly misses important variation. The soft spots sit in the proxies. The abstract defines them but supplies no concrete scoring rules, thresholds, or any check against human expert ratings, especially for cognitive simplicity and adaptivity. If those turn out to be surface heuristics (token counts, simple keyword lists, or LLM self-scores), the reported differences and trade-offs could be artifacts rather than evidence of usefulness. No statistical tests or error analysis appear in the summary either, so it is hard to judge how stable the findings are. This is for researchers building or benchmarking LLM mathematical reasoning systems who want to move past pass/fail tables. A reader looking for a concrete starting point on multi-dimensional evaluation will find the benchmark idea worth examining, even if the current metrics need tightening. It deserves peer review because the direction is sound and the benchmark itself could be adopted and improved by others, provided the authors add validation studies and full implementation details.

Referee Report

3 major / 2 minor

Summary. The paper introduces ProofRank, a benchmark of challenging competition problems, to evaluate LLM-generated proofs using five automatic proxies for quality beyond binary correctness: conciseness (step count), computational ease (calculation intensity), cognitive simplicity (technique accessibility), diversity (output variation across samples), and adaptivity (adherence to a specified technique). It claims that models exhibit substantial differences in these dimensions not captured by correctness-only metrics and that significant trade-offs exist between proof-quality proxies and correctness rates.

Significance. If the proxies can be shown to be reproducible and to correlate with human judgments of proof usefulness, the work would meaningfully advance LLM evaluation in mathematical reasoning by shifting focus from correctness alone to practical utility, clarity, and transferability. The empirical benchmark approach itself is a constructive contribution.

major comments (3)

[Abstract / Proxy definitions] Abstract and proxy definitions: the five proxies are defined at a high level (e.g., cognitive simplicity via “technique accessibility,” adaptivity via “technique adherence”), but no concrete heuristics, thresholds, models, or code for automatic computation are supplied. This is load-bearing for the central claims, as the reported differences and trade-offs cannot be assessed or reproduced without the measurement rules.
[Results / Evaluation] Results and evaluation sections: no correlation study, inter-rater agreement, or validation against human expert ratings is reported for the proxies (particularly cognitive simplicity and adaptivity), nor any statistical tests, confidence intervals, or error analysis on the model comparisons. Without this, it remains unclear whether observed differences reflect genuine proof-quality variation or proxy artifacts.
[Benchmark construction] Benchmark construction: details on problem selection from competitions, proof-generation prompting, sampling strategy for diversity, and how “specified proof technique” is operationalized for adaptivity are missing, preventing assessment of dataset bias or reproducibility.

minor comments (2)

[Methods] Clarify whether the proxies are fully automatic or involve any LLM-as-judge components, and provide pseudocode or formulas for each metric.
[Results tables] Tables comparing models should report effect sizes or p-values alongside raw differences to support the “substantial differences” claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive review. The comments highlight important aspects for improving the reproducibility and validity of our benchmark. We respond to each major comment point by point below.

read point-by-point responses

Referee: [Abstract / Proxy definitions] Abstract and proxy definitions: the five proxies are defined at a high level (e.g., cognitive simplicity via “technique accessibility,” adaptivity via “technique adherence”), but no concrete heuristics, thresholds, models, or code for automatic computation are supplied. This is load-bearing for the central claims, as the reported differences and trade-offs cannot be assessed or reproduced without the measurement rules.

Authors: We agree that the proxy definitions require more concrete details to support reproducibility. The manuscript describes the proxies conceptually, but we will revise the relevant sections to include specific heuristics (e.g., for conciseness: automated step counting via proof tree parsing with a threshold of fewer than 10 steps; for cognitive simplicity: a scoring system based on technique frequency in competition solutions), thresholds, the models used for any LLM-assisted scoring, and a link to the open-source code implementing these proxies. This will allow readers to assess and reproduce the measurements. revision: yes
Referee: [Results / Evaluation] Results and evaluation sections: no correlation study, inter-rater agreement, or validation against human expert ratings is reported for the proxies (particularly cognitive simplicity and adaptivity), nor any statistical tests, confidence intervals, or error analysis on the model comparisons. Without this, it remains unclear whether observed differences reflect genuine proof-quality variation or proxy artifacts.

Authors: We recognize the value of validating the proxies against human judgments. Our focus was on developing scalable automatic proxies that can be applied broadly without human intervention. We did not include a full correlation study in this work, which is a limitation. In the revision, we will add statistical tests (e.g., t-tests for model differences) and confidence intervals to the results. We will also include a discussion of proxy validation and commit to conducting a human evaluation study in follow-up work. We believe the current proxies are grounded in established mathematical values but acknowledge the need for empirical validation. revision: partial
Referee: [Benchmark construction] Benchmark construction: details on problem selection from competitions, proof-generation prompting, sampling strategy for diversity, and how “specified proof technique” is operationalized for adaptivity are missing, preventing assessment of dataset bias or reproducibility.

Authors: We agree that additional details on the benchmark construction are essential. The revised manuscript will expand the Benchmark section to describe: the criteria for selecting problems from competitions (focusing on those with known multiple proof approaches and difficulty level), the exact prompting templates used for generating proofs, the sampling strategy (e.g., number of samples per problem, temperature settings for diversity measurement), and the operationalization of adaptivity (including the technique specification in prompts and the method for checking adherence, such as through rule-based matching or secondary LLM classification). These additions will facilitate reproducibility and bias assessment. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark introduction with no derivation chain or self-referential reductions

full rationale

The paper presents ProofRank as an empirical benchmark that defines five independent proxies for proof quality (conciseness, computational ease, cognitive simplicity, diversity, adaptivity) and applies them to LLM outputs on competition problems. No mathematical derivation, first-principles prediction, or fitted parameter is claimed; the work consists of metric definitions followed by observational comparisons across models. No equations reduce to inputs by construction, no uniqueness theorems are invoked via self-citation, and no ansatz or renaming of known results occurs. The central claims rest on the empirical measurements themselves rather than any closed loop.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the listed proxies can be measured scalably and meaningfully reflect proof quality valued by readers.

free parameters (1)

Proxy measurement rules
Exact rules for quantifying conciseness, cognitive simplicity, etc., are not specified in the abstract and likely involve implementation choices.

axioms (1)

domain assumption Mathematical proofs have measurable qualities beyond correctness such as conciseness and cognitive simplicity that are concrete and broadly valued.
Explicitly stated in the abstract as the foundation for creating the benchmark.

pith-pipeline@v0.9.0 · 5513 in / 1284 out tokens · 31828 ms · 2026-05-12T04:43:51.181080+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
We identify such components and introduce PROOFRANK, a benchmark curated from challenging mathematical competitions. PROOFRANK evaluates several scalable proxies of proof quality: (i) conciseness... (v) adaptivity...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We use pairwise comparisons... Bradley-Terry ranking... correctness as a prerequisite

Reference graph

Works this paper leans on

109 extracted references · 109 canonical work pages · 7 internal anchors

[1]

Extremal descendant integrals on moduli spaces of curves: An inequality dis- covered and proved in collaboration with ai, 2025

Johannes Schmitt. Extremal descendant integrals on moduli spaces of curves: An inequality dis- covered and proved in collaboration with ai, 2025. URL https://arxiv.org/abs/2512.14575

work page arXiv 2025
[2]

Lattice-valued bottleneck duality, 2024

Robert Ghrist, Julian Gould, and Miguel Lopez. Lattice-valued bottleneck duality, 2024. URL https://arxiv.org/abs/2410.00315

work page arXiv 2024
[3]

From euler to ai: Unifying formulas for mathematical constants, 2025

Tomer Raz, Michael Shalyt, Elyasheev Leibtag, Rotem Kalisch, Shachar Weinbaum, Yaron Hadad, and Ido Kaminer. From euler to ai: Unifying formulas for mathematical constants, 2025. URLhttps://arxiv.org/abs/2502.17533

work page arXiv 2025
[4]

Ivo Petrov, Jasper Dekoninck, Lyuben Baltadzhiev, Maria Drencheva, Kristian Minchev, Mislav Balunovic, Nikola Jovanovic, and Martin T. Vechev. Proof or bluff? evaluating llms on 2025 USA math olympiad.CoRR, abs/2503.21934, 2025. doi: 10 .48550/ARXIV.2503.21934. URL https://doi.org/10.48550/arXiv.2503.21934

work page doi:10.48550/arxiv.2503.21934 2025
[5]

Jasper Dekoninck, Ivo Petrov, Kristian Minchev, Mislav Balunovic, Martin T. Vechev, Miroslav Marinov, Maria Drencheva, Lyuba Konova, Milen Shumanov, Kaloyan Tsvetkov, Nikolay Drenchev, Lazar Todorov, Kalina Nikolova, Nikolay Georgiev, Vanesa Kalinkova, and Margulan Ismoldayev. The open proof corpus: A large-scale study of llm-generated mathematical proofs...

work page doi:10.48550/arxiv.2506.21621 2025
[6]

Thomas, and Charles Vial

Johannes Schmitt, Gergely Bérczi, Jasper Dekoninck, Jeremy Feusi, Tim Gehrunger, Raphael Appenzeller, Jim Bryan, Niklas Canova, Timo de Wolff, Filippo Gaia, Michel van Garrel, Baran Hashemi, David Holmes, Aitor Iribar Lopez, Victor Jaeck, Martina Jørgensen, Steven Kelk, Stefan Kuhlmann, Adam Kurpisz, Chiara Meroni, Ingmar Metzler, Martin Möller, Samuel Mu...

work page arXiv 2025
[7]

Hamed Mahdavi, Alireza Hashemi, Majid Daliri, Pegah Mohammadipour, Alireza Farhadi, Samira Malek, Yekta Yazdanifard, Amir Khasahmadi, and Vasant G. Honavar. Brains vs. bytes: Evaluating LLM proficiency in olympiad mathematics.CoRR, abs/2504.01995, 2025. doi: 10.48550/ARXIV.2504.01995. URLhttps://doi.org/10.48550/arXiv.2504.01995

work page doi:10.48550/arxiv.2504.01995 2025
[8]

Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V . Le, and Junehyuk Jung. Towards robust mathematical reasoning.CoRR, abs/2511.01846, 2025. ...

work page arXiv 2025
[9]

Mathematical methods and human thought in the age of ai,

Tanya Klowden and Terence Tao. Mathematical methods and human thought in the age of ai,

work page
[10]

URLhttps://arxiv.org/abs/2603.26524

work page arXiv
[11]

Mathematical capabilities of chatgpt

Simon Frieder, Luca Pinchetti, Alexis Chevalier, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Petersen, and Julius Berner. Mathematical capabilities of chatgpt. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference o...

work page 2023
[12]

Dadi Guo, Jiayu Liu, Zhiyuan Fan, Zhitao He, Haoran Li, Yumeng Wang, and Yi R. Fung. Mathematical proof as a litmus test: Revealing failure modes of advanced large reasoning models.CoRR, abs/2506.17114, 2025. doi: 10 .48550/ARXIV.2506.17114. URL https: //doi.org/10.48550/arXiv.2506.17114

work page doi:10.48550/arxiv.2506.17114 2025
[13]

Putnam-like dataset summary: Llms as mathematical competition contestants.CoRR, abs/2509.24827, 2025

Bartosz Bieganowski, Daniel Strzelecki, Robert Skiba, and Mateusz Topolewski. Putnam-like dataset summary: Llms as mathematical competition contestants.CoRR, abs/2509.24827, 2025. doi: 10.48550/ARXIV.2509.24827. URLhttps://doi.org/10.48550/arXiv.2509.24827

work page doi:10.48550/arxiv.2509.24827 2025
[14]

Santiago Gonzalez, Alireza Amiri Bavandpour, Peter Ye, Edward Zhang, Ruslans Alekse- jevs, Todor Antic, Polina Baron, Sujeet Bhalerao, Shubhrajit Bhattacharya, Zachary Burton, John Byrne, Hyungjun Choi, Nujhat Ahmed Disha, Koppány István Encz, Yuchen Fang, Robert Joseph George, Ebrahim Ghorbani, Alan Goldfarb, Jing Guo, Meghal Gupta, Stefano Huber, Annika...

work page doi:10.48550/arxiv.2602.20629 2026
[15]

Lemmabench: A live, research-level benchmark to evaluate LLM capabilities in mathematics.CoRR, abs/2602.24173, 2026

Antoine Peyronnet, Fabian Gloeckle, and Amaury Hayat. Lemmabench: A live, research-level benchmark to evaluate LLM capabilities in mathematics.CoRR, abs/2602.24173, 2026. doi: 10.48550/ARXIV.2602.24173. URLhttps://doi.org/10.48550/arXiv.2602.24173

work page doi:10.48550/arxiv.2602.24173 2026
[16]

Ivo Petrov, Jasper Dekoninck, and Martin T. Vechev. Brokenmath: A benchmark for sycophancy in theorem proving with llms.CoRR, abs/2510.04721, 2025. doi: 10.48550/ARXIV.2510.04721. URLhttps://doi.org/10.48550/arXiv.2510.04721

work page doi:10.48550/arxiv.2510.04721 2025
[17]

Reliable fine-grained evaluation of natural language math proofs.CoRR, abs/2510.13888, 2025

Wenjie Ma, Andrei Cojocaru, Neel Kolhe, Bradley Louie, Robin Said Sharif, Haihan Zhang, Vincent Zhuang, Matei Zaharia, and Sewon Min. Reliable fine-grained evaluation of natural language math proofs.CoRR, abs/2510.13888, 2025. doi: 10 .48550/ARXIV.2510.13888. URL https://doi.org/10.48550/arXiv.2510.13888

work page doi:10.48550/arxiv.2510.13888 2025
[18]

Solving inequality proofs with large language models.CoRR, abs/2506.07927, 2025

Jiayi Sheng, Luna Lyu, Jikai Jin, Tony Xia, Alex Gu, James Zou, and Pan Lu. Solving inequality proofs with large language models.CoRR, abs/2506.07927, 2025. doi: 10 .48550/ ARXIV.2506.07927. URLhttps://doi.org/10.48550/arXiv.2506.07927

work page doi:10.48550/arxiv.2506.07927 2025
[19]

Pandit, A

Shrey Pandit, Austin Xu, Xuan-Phi Nguyen, Yifei Ming, Caiming Xiong, and Shafiq Joty. Hard2verify: A step-level verification benchmark for open-ended frontier math.CoRR, abs/2510.13744, 2025. doi: 10 .48550/ARXIV.2510.13744. URL https://doi.org/10.48550/ arXiv.2510.13744

work page arXiv 2025
[20]

Hamed Mahdavi, Pouria Mahdavinia, Samira Malek, Pegah Mohammadipour, Alireza Hashemi, Majid Daliri, Alireza Farhadi, Amir Khasahmadi, Niloofar Mireshghallah, and Vasant G. Honavar. Refgrader: Automated grading of mathematical competition proofs using agentic workflows.CoRR, abs/2510.09021, 2025. doi: 10 .48550/ARXIV.2510.09021. URL https: //doi.org/10.485...

work page doi:10.48550/arxiv.2510.09021 2025
[21]

Collins, A

Katherine M. Collins, Albert Q. Jiang, Simon Frieder, Lionel Wong, Miri Zilka, Umang Bhatt, Thomas Lukasiewicz, Yuhuai Wu, Joshua B. Tenenbaum, William Hart, Timothy Gowers, Wenda Li, Adrian Weller, and Mateja Jamnik. Evaluating language models for mathematics through interactions.CoRR, abs/2306.01694, 2023. doi: 10 .48550/ARXIV.2306.01694. URL https://do...

work page doi:10.48550/arxiv.2306.01694 2023
[22]

Reliablemath: Benchmark of reliable mathemati- cal reasoning on large language models.CoRR, abs/2507.03133, 2025

Boyang Xue, Qi Zhu, Rui Wang, Sheng Wang, Hongru Wang, Fei Mi, Yasheng Wang, Lifeng Shang, Qun Liu, and Kam-Fai Wong. Reliablemath: Benchmark of reliable mathemati- cal reasoning on large language models.CoRR, abs/2507.03133, 2025. doi: 10 .48550/ ARXIV.2507.03133. URLhttps://doi.org/10.48550/arXiv.2507.03133. 11

work page doi:10.48550/arxiv.2507.03133 2025
[23]

Polina Kirichenko, Mark Ibrahim, Kamalika Chaudhuri, and Samuel J. Bell. Abstentionbench: Reasoning llms fail on unanswerable questions.CoRR, abs/2506.09038, 2025. doi: 10 .48550/ ARXIV.2506.09038. URLhttps://doi.org/10.48550/arXiv.2506.09038

work page doi:10.48550/arxiv.2506.09038 2025
[24]

Interactive Benchmarks

Baoqing Yue, Zihan Zhu, Yifan Zhang, Jichen Feng, Hufei Yang, and Mengdi Wang. Interactive benchmarks.CoRR, abs/2603.04737, 2026. doi: 10 .48550/ARXIV.2603.04737. URL https: //doi.org/10.48550/arXiv.2603.04737

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.04737 2026
[25]

GAUSS: benchmarking structured mathematical skills for large language models.CoRR, abs/2509.18122, 2025

Yue Zhang, Jiaxin Zhang, Qiuyu Ren, Tahsin Saffat, Xiaoxuan Liu, Zitong Yang, Banghua Zhu, and Yi Ma. GAUSS: benchmarking structured mathematical skills for large language models.CoRR, abs/2509.18122, 2025. doi: 10 .48550/ARXIV.2509.18122. URL https: //doi.org/10.48550/arXiv.2509.18122

work page doi:10.48550/arxiv.2509.18122 2025
[26]

Marlene Steinbach, Shreya Bhandari, Jennifer Meyer, and Zach A. Pardos. When llms hal- lucinate: Examining the effects of erroneous feedback in math tutoring systems. In Caitlin Mills, Giora Alexandron, Davide Taibi, Giosuè Lo Bosco, and Luc Paquette, editors,Proceed- ings of the 18th International Conference on Educational Data Mining, EDM 2025, Palermo,...

work page 2025
[27]

A scoping survey of chatgpt in mathematics education.Digital Experiences in Mathematics Education, 11:9–41, 2025

Birgit Pepin, Nils Buchholtz, and Ulises Salinas-Hernández. A scoping survey of chatgpt in mathematics education.Digital Experiences in Mathematics Education, 11:9–41, 2025. doi: 10.1007/s40751-025-00172-1. URLhttps://doi.org/10.1007/s40751-025-00172-1

work page doi:10.1007/s40751-025-00172-1 2025
[28]

Chatgpt in school mathemat- ics education: A systematic review of opportunities, challenges, and pedagogical implications

Muhammad Turmuzi, Syahrul Azmi, and Ni Made Intan Kertiyani. Chatgpt in school mathemat- ics education: A systematic review of opportunities, challenges, and pedagogical implications. Teaching and Teacher Education, 170:105286, 2026. doi: 10.1016/j.tate.2025.105286. URL https://doi.org/10.1016/j.tate.2025.105286

work page doi:10.1016/j.tate.2025.105286 2026
[29]

Reddig, Arav Arora, and Christopher J

Jennifer M. Reddig, Arav Arora, and Christopher J. MacLellan. Generating in-context, personalized feedback for intelligent tutors with large language models.Int. J. Artif. In- tell. Educ., 35(6):3459–3500, 2025. doi: 10 .1007/S40593-025-00505-6. URL https: //doi.org/10.1007/s40593-025-00505-6

work page doi:10.1007/s40593-025-00505-6 2025
[30]

In: Camara, O., et al

Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. InCADE, volume 12699 ofLecture Notes in Computer Science, pages 625–635. Springer, 2021. doi: 10 .1007/978-3-030-79876-5 _37. URL https://doi.org/10.1007/978- 3-030-79876-5_37

work page doi:10.1007/978- 2021
[31]

Hilbert: Recursively building formal proofs with informal reasoning.CoRR, abs/2509.22819, 2025

Sumanth Varambally, Thomas V oice, Yanchao Sun, Zhifeng Chen, Rose Yu, and Ke Ye. Hilbert: Recursively building formal proofs with informal reasoning.CoRR, abs/2509.22819, 2025. doi: 10.48550/ARXIV.2509.22819. URLhttps://doi.org/10.48550/arXiv.2509.22819

work page doi:10.48550/arxiv.2509.22819 2025
[32]

Numina-lean-agent: An open and gen- eral agentic reasoning system for formal mathematics

Junqi Liu, Zihao Zhou, Zekai Zhu, Marco Dos Santos, Weikun He, Jiawei Liu, Ran Wang, Yunzhou Xie, Junqiao Zhao, Qiufeng Wang, Lihong Zhi, Jia Li, and Wenda Li. Numina-lean- agent: An open and general agentic reasoning system for formal mathematics, 2026. URL https://arxiv.org/abs/2601.14027

work page arXiv 2026
[33]

Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, Jiayun Wu, Jiri Gesi, Ximing Lu, David Acuna, Kaiyu Yang, Hongzhou Lin, Yejin Choi, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction, 2025. URL ht...

work page arXiv 2025
[34]

Putnambench: Evaluating neural theorem- provers on the putnam mathematical competition

George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jen- nings, Amitayush Thakur, and Swarat Chaudhuri. Putnambench: Evaluating neural theorem- provers on the putnam mathematical competition. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, ed- itors,Advances in ...

work page 2024
[35]

arXiv:2302.12433 [cs.CL] https://arxiv.org/abs/2302.12433

Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, and Jeremy Avigad. Proofnet: Autoformalizing and formally proving undergraduate- level mathematics.CoRR, abs/2302.12433, 2023. doi: 10 .48550/ARXIV.2302.12433. URL https://doi.org/10.48550/arXiv.2302.12433

work page doi:10.48550/arxiv.2302.12433 2023
[36]

minif2f: a cross-system benchmark for formal olympiad-level mathematics

Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. minif2f: a cross-system benchmark for formal olympiad-level mathematics. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=9ZPegFuFTFv

work page 2022
[37]

Formalproofbench: Can models write grad- uate level math proofs that are formally verified?CoRR, abs/2603.26996, 2026

Nikil Ravi, Kexing Ying, Vasilii Nesterov, Rayan Krishnan, Elif Uskuplu, Bingyu Xia, Janitha Aswedige, and Langston Nashold. Formalproofbench: Can models write grad- uate level math proofs that are formally verified?CoRR, abs/2603.26996, 2026. doi: 10.48550/ARXIV.2603.26996. URLhttps://doi.org/10.48550/arXiv.2603.26996

work page doi:10.48550/arxiv.2603.26996 2026
[38]

Jiedong Jiang, Wanyi He, Yuefeng Wang, Guoxiong Gao, Yongle Hu, Jingting Wang, Nailin Guan, Peihao Wu, Chunbo Dai, Liang Xiao, et al

Jiedong Jiang, Wanyi He, Yuefeng Wang, Guoxiong Gao, Yongle Hu, Jingting Wang, Nailing Guan, Peihao Wu, Chunbo Dai, Liang Xiao, and Bin Dong. FATE: A formal benchmark series for frontier algebra of multiple difficulty levels.CoRR, abs/2511.02872, 2025. doi: 10.48550/ARXIV.2511.02872. URLhttps://doi.org/10.48550/arXiv.2511.02872

work page doi:10.48550/arxiv.2511.02872 2025
[39]

Proofbench: Can models write math proofs that are formally verified? Vals AI, January 2026

Nikil Ravi, Langston Nashold, and Rayan Krishnan. Proofbench: Can models write math proofs that are formally verified? Vals AI, January 2026. URL https://www.vals.ai/benchmarks/ proof_bench. Accessed: 2026-04-29

work page 2026
[40]

11 Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yi- han Geng, Jiawei Ge, Jingruo Sun, et al

Austin Letson, Leopoldo Sarra, Auguste Poiroux, Oliver Dressler, Paul Lezeau, Dhyan Aranha, Frederick Pu, Aaron Hill, Miguel Corredera Hidalgo, Julian Berman, George Tsoukalas, and Lenny Taelman. Sorrydb: Can AI provers complete real-world lean theorems?CoRR, abs/2603.02668, 2026. doi: 10 .48550/ARXIV.2603.02668. URL https://doi.org/10.48550/ arXiv.2603.02668

work page arXiv 2026
[41]

Taylor, Junyi Zhang, Ethan Ji, Vigyan Sahai, Haikang Deng, Yuanzhou Chen, Yifan Yuan, Di Wu, Jia-Chen Gu, Kai-Wei Chang, Nanyun Peng, Amit Sahai, and Wei Wang

Alexander K. Taylor, Junyi Zhang, Ethan Ji, Vigyan Sahai, Haikang Deng, Yuanzhou Chen, Yifan Yuan, Di Wu, Jia-Chen Gu, Kai-Wei Chang, Nanyun Peng, Amit Sahai, and Wei Wang. Taobench: Do automated theorem prover llms generalize beyond mathlib?CoRR, abs/2603.12744, 2026. doi: 10 .48550/ARXIV.2603.12744. URL https://doi.org/10.48550/ arXiv.2603.12744

work page arXiv 2026
[42]

Markosyan

Alex Gu, Bartosz Piotrowski, Fabian Gloeckle, Kaiyu Yang, and Aram H. Markosyan. Proofoptimizer: Training language models to simplify proofs without human demonstra- tions.CoRR, abs/2510.15700, 2025. doi: 10 .48550/ARXIV.2510.15700. URL https: //doi.org/10.48550/arXiv.2510.15700

work page doi:10.48550/arxiv.2510.15700 2025
[43]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, edi- tors,Advances in Neural ...

work page 2023
[44]

Wildbench: Benchmarking llms with challenging tasks from real users in the wild

Bill Yuchen Lin, Yuntian Deng, Khyathi Raghavi Chandu, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. Wildbench: Benchmarking llms with challenging tasks from real users in the wild. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. U...

work page 2025
[45]

Chawla, and Xiangliang Zhang

Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V . Chawla, and Xiangliang Zhang. Justice or prejudice? quantifying biases in llm-as-a-judge. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. ...

work page 2025
[46]

arXiv preprint arXiv:2410.21819 (2025)

Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri. Self-preference bias in llm-as-a-judge. CoRR, abs/2410.21819, 2024. doi: 10 .48550/ARXIV.2410.21819. URL https://doi.org/ 10.48550/arXiv.2410.21819

work page doi:10.48550/arxiv.2410.21819 2024
[47]

José Pombal, Ricardo Rei, and André F. T. Martins. Self-preference bias in rubric-based evaluation of large language models.CoRR, abs/2604.06996, 2026. doi: 10 .48550/ ARXIV.2604.06996. URLhttps://doi.org/10.48550/arXiv.2604.06996

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.06996 2026
[48]

arXiv preprint arXiv:2310.10076 , year=

Keita Saito, Akifumi Wachi, Koki Wataoka, and Youhei Akimoto. Verbosity bias in pref- erence labeling by large language models.CoRR, abs/2310.10076, 2023. doi: 10 .48550/ ARXIV.2310.10076. URLhttps://doi.org/10.48550/arXiv.2310.10076

work page doi:10.48550/arxiv.2310.10076 2023
[49]

Judging the judges: A systematic study of position bias in llm-as-a-judge

Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush V osoughi. Judging the judges: A systematic study of position bias in llm-as-a-judge. In Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, and Dhirendra Pratap Singh, editors,Proceedings of the 14th Inter...

work page 2025
[50]

Mislav Balunovic, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovic, and Martin T. Vechev. Math- arena: Evaluating llms on uncontaminated math competitions.CoRR, abs/2505.23281, 2025. doi: 10.48550/ARXIV.2505.23281. URLhttps://doi.org/10.48550/arXiv.2505.23281

work page internal anchor Pith review doi:10.48550/arxiv.2505.23281 2025
[51]

Gpt-5.4 thinking system card

OpenAI. Gpt-5.4 thinking system card. Technical report, OpenAI, March 2026. URL https: //openai.com/index/gpt-5-4-thinking-system-card/. Accessed: 2026-04-09

work page 2026
[52]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI. gpt-oss-120b & gpt-oss-20b model card.CoRR, abs/2508.10925, 2025. doi: 10 .48550/ ARXIV.2508.10925. URLhttps://doi.org/10.48550/arXiv.2508.10925

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.10925 2025
[53]

Gemini 3.1 pro model card

Google. Gemini 3.1 pro model card. Technical report, Google, February 2026. URL https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-1-Pro- Model-Card.pdf. Accessed: 2026-04-09

work page 2026
[54]

Gemini 3 flash model card

Google. Gemini 3 flash model card. Technical report, Google, December 2025. URL https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Flash- Model-Card.pdf. Accessed: 2026-04-09

work page 2025
[55]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

work page 2026
[56]

GLM-5: from Vibe Coding to Agentic Engineering

GLM. GLM-5: from vibe coding to agentic engineering.CoRR, abs/2602.15763, 2026. doi: 10.48550/ARXIV.2602.15763. URLhttps://doi.org/10.48550/arXiv.2602.15763

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.15763 2026
[57]

Kimi K2.5: Visual Agentic Intelligence

Kimi Team. Kimi K2.5: visual agentic intelligence.CoRR, abs/2602.02276, 2026. doi: 10.48550/ARXIV.2602.02276. URLhttps://doi.org/10.48550/arXiv.2602.02276

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.02276 2026
[58]

Grok 4.1 fast and agent tools api.xAI News, November 2025

xAI. Grok 4.1 fast and agent tools api.xAI News, November 2025. URL https://x.ai/news/ grok-4-1-fast. Accessed: 2026-04-09

work page 2025
[59]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models.CoRR, abs/2512.02556, 2025. doi: 10 .48550/ARXIV.2512.02556. URL https://doi.org/10.48550/ arXiv.2512.02556

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

Step 3.5 flash: Open frontier-level intelligence with 11b active parameters

StepFun Team. Step 3.5 flash: Open frontier-level intelligence with 11b active parameters. CoRR, abs/2602.10604, 2026. doi: 10 .48550/ARXIV.2602.10604. URL https://doi.org/ 10.48550/arXiv.2602.10604. 14 A Additional Experimental Details We provide additional experimental details that are not included in the main text to ensure full transparency and reprod...

work page doi:10.48550/arxiv.2602.10604 2026
[61]

**Golden Answer:** State the Golden Answer

work page
[62]

No clear final answer found

**Extracted Model Answer:** State the extracted answer based on the Extraction Protocol. If none found, state "No clear final answer found."

work page
[63]

Detail the steps taken to verify mathematical equivalence (e.g., simplification, algebraic manipulation)

**Equivalence Analysis:** Compare the two answers using the Grading Guidelines. Detail the steps taken to verify mathematical equivalence (e.g., simplification, algebraic manipulation). You must actively try to prove they are the same before concluding they are different

work page
[64]

Correct" or

**Conclusion:** State the final determination ("Correct" or "Incorrect"). 24 **Part 2: Final Grade** Immediately following the closing </thinking> tag, output **ONLY** the final grade. * If Correct: \\boxed{{Correct}} * If Incorrect: \\boxed{{Incorrect}} **CRITICAL CONSTRAINT: Do not add any text, explanations, or formatting outside the <thinking> tags or...

work page
[65]

**Golden Answer:** (-\infty, -4) \cup (4, \infty)

work page
[66]

**Extracted Model Answer:** ∅(the empty set)

work page
[67]

The Model Answer is the empty set

**Equivalence Analysis:** The Golden Answer is a non-empty set of real numbers. The Model Answer is the empty set. These two sets are not equivalent. The empty set contains no elements, while the Golden Answer contains an infinite number of elements

work page
[68]

\\(" and

**Conclusion:** Incorrect </thinking> \\boxed{{Incorrect}} ------- # 4. Input Data Here is the problem, model solution, and golden answer to grade: ### Problem Statement: {problem} ### Model Solution: {solution} ### Golden Answer: {gold_answer} F.3.3 Solution Verification Prompt We use the following prompt for running GPT-5.4 to determine the overall corr...

work page
[70]

**Backbone** The main structural move that sets up the solution, such as a reflection construction, polynomial-root encoding, greedy maintenance strategy, block decomposition, reduction to a theorem, or symmetry-based transformation

work page
[71]

The **backbone** is the most important field

**Closing engine** The theorem, lemma, or decisive move that turns the backbone into the conclusion. The **backbone** is the most important field. The **closing engine** should be recorded when it is a real mathematical ingredient, not just routine simplification. # Requirements

work page
[72]

Identify what makes the solution work conceptually, not how to carry out the computations

work page
[73]

Do not solve the problem on your own. 26

work page
[74]

Do not include chain-of-thought, hidden reasoning, or speculative reconstruction

work page
[75]

Do not fill in missing steps

work page
[76]

Work only with what is explicitly present or very strongly signaled in the solution text

work page
[77]

If the solution is incomplete but its intended method is still recognizable, describe that intended method

work page
[78]

If the text is too vague to identify a genuine nontrivial method, set`noCoreIdea`to`true`and use` null`where appropriate

work page
[79]

Short verbatim evidence quotes may contain formulas if they appear in the provided solution

Avoid generating equations or calculations yourself. Short verbatim evidence quotes may contain formulas if they appear in the provided solution

work page
[80]

Write`coreIdea`and`supportingIdeas`in an imperative, hint-like style

work page
[81]

use algebra,

Keep labels method-specific. Avoid broad labels such as "use algebra," "use symmetry," or "use geometry " unless no more specific description is supported. # Output format Your response must include: - **noCoreIdea**: whether the provided solution lacks an identifiable nontrivial core idea - **solutionSummary**: a concise summary of the solution's method ...

work page

Showing first 80 references.