Recognition: 2 theorem links
· Lean TheoremNot All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness
Pith reviewed 2026-05-12 04:43 UTC · model grok-4.3
The pith
LLM-generated proofs differ substantially in quality beyond mere correctness, with measurable trade-offs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that correctness alone fails to capture important differences in LLM proof quality, as shown by substantial model-to-model variation and systematic trade-offs on the ProofRank benchmark's five proxies for conciseness, computational ease, cognitive simplicity, diversity, and adaptivity.
What carries the argument
ProofRank benchmark and its five scalable proxies that separately score proof quality independent of correctness.
If this is right
- Correctness-only benchmarks miss substantial differences in proof quality across models.
- Significant trade-offs exist between the quality proxies and correctness rates.
- Future evaluations of mathematical reasoning in LLMs should measure how useful the generated proofs are.
Where Pith is reading between the lines
- These proxies could be turned into training signals to push models toward more practical proofs.
- The same approach might apply to evaluating LLM outputs in formal verification or code generation.
- Human preference studies could test whether the automatic scores align with what mathematicians actually value.
Load-bearing premise
The five chosen proxies accurately reflect aspects of proof quality that matter to human readers and can be measured objectively at scale.
What would settle it
A large-scale study in which expert mathematicians consistently prefer proofs that the proxies score as low-quality over those scored high-quality would falsify the central claim.
Figures
read the original abstract
Large language models (LLMs) have become capable mathematical problem-solvers, often producing correct proofs for challenging problems. However, correctness alone is not sufficient: mathematical proofs should also be clear, concise, insightful, and transferable to other problems. While this proof quality is subjective and depends on the reader and context, many of its components are concrete and broadly valued. In this work, we identify such components and introduce ProofRank, a benchmark curated from challenging mathematical competitions. ProofRank evaluates several scalable proxies of proof quality: (i) conciseness, measuring whether proofs avoid unnecessary steps; (ii) computational ease, measuring the extent to which a proof relies on tedious calculations; (iii) cognitive simplicity, measuring how accessible the used proof techniques are; (iv) diversity, measuring how varied a model's proofs for a single problem are; and (v) adaptivity, measuring whether a model can follow a specified proof technique. Across models, we find substantial differences in proof quality that are not captured by correctness-only benchmarks. We also observe significant trade-offs between proof-quality metrics and correctness, suggesting that future evaluations of mathematical reasoning should measure how useful LLM-generated proofs are.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ProofRank, a benchmark of challenging competition problems, to evaluate LLM-generated proofs using five automatic proxies for quality beyond binary correctness: conciseness (step count), computational ease (calculation intensity), cognitive simplicity (technique accessibility), diversity (output variation across samples), and adaptivity (adherence to a specified technique). It claims that models exhibit substantial differences in these dimensions not captured by correctness-only metrics and that significant trade-offs exist between proof-quality proxies and correctness rates.
Significance. If the proxies can be shown to be reproducible and to correlate with human judgments of proof usefulness, the work would meaningfully advance LLM evaluation in mathematical reasoning by shifting focus from correctness alone to practical utility, clarity, and transferability. The empirical benchmark approach itself is a constructive contribution.
major comments (3)
- [Abstract / Proxy definitions] Abstract and proxy definitions: the five proxies are defined at a high level (e.g., cognitive simplicity via “technique accessibility,” adaptivity via “technique adherence”), but no concrete heuristics, thresholds, models, or code for automatic computation are supplied. This is load-bearing for the central claims, as the reported differences and trade-offs cannot be assessed or reproduced without the measurement rules.
- [Results / Evaluation] Results and evaluation sections: no correlation study, inter-rater agreement, or validation against human expert ratings is reported for the proxies (particularly cognitive simplicity and adaptivity), nor any statistical tests, confidence intervals, or error analysis on the model comparisons. Without this, it remains unclear whether observed differences reflect genuine proof-quality variation or proxy artifacts.
- [Benchmark construction] Benchmark construction: details on problem selection from competitions, proof-generation prompting, sampling strategy for diversity, and how “specified proof technique” is operationalized for adaptivity are missing, preventing assessment of dataset bias or reproducibility.
minor comments (2)
- [Methods] Clarify whether the proxies are fully automatic or involve any LLM-as-judge components, and provide pseudocode or formulas for each metric.
- [Results tables] Tables comparing models should report effect sizes or p-values alongside raw differences to support the “substantial differences” claim.
Simulated Author's Rebuttal
We thank the referee for their thorough and constructive review. The comments highlight important aspects for improving the reproducibility and validity of our benchmark. We respond to each major comment point by point below.
read point-by-point responses
-
Referee: [Abstract / Proxy definitions] Abstract and proxy definitions: the five proxies are defined at a high level (e.g., cognitive simplicity via “technique accessibility,” adaptivity via “technique adherence”), but no concrete heuristics, thresholds, models, or code for automatic computation are supplied. This is load-bearing for the central claims, as the reported differences and trade-offs cannot be assessed or reproduced without the measurement rules.
Authors: We agree that the proxy definitions require more concrete details to support reproducibility. The manuscript describes the proxies conceptually, but we will revise the relevant sections to include specific heuristics (e.g., for conciseness: automated step counting via proof tree parsing with a threshold of fewer than 10 steps; for cognitive simplicity: a scoring system based on technique frequency in competition solutions), thresholds, the models used for any LLM-assisted scoring, and a link to the open-source code implementing these proxies. This will allow readers to assess and reproduce the measurements. revision: yes
-
Referee: [Results / Evaluation] Results and evaluation sections: no correlation study, inter-rater agreement, or validation against human expert ratings is reported for the proxies (particularly cognitive simplicity and adaptivity), nor any statistical tests, confidence intervals, or error analysis on the model comparisons. Without this, it remains unclear whether observed differences reflect genuine proof-quality variation or proxy artifacts.
Authors: We recognize the value of validating the proxies against human judgments. Our focus was on developing scalable automatic proxies that can be applied broadly without human intervention. We did not include a full correlation study in this work, which is a limitation. In the revision, we will add statistical tests (e.g., t-tests for model differences) and confidence intervals to the results. We will also include a discussion of proxy validation and commit to conducting a human evaluation study in follow-up work. We believe the current proxies are grounded in established mathematical values but acknowledge the need for empirical validation. revision: partial
-
Referee: [Benchmark construction] Benchmark construction: details on problem selection from competitions, proof-generation prompting, sampling strategy for diversity, and how “specified proof technique” is operationalized for adaptivity are missing, preventing assessment of dataset bias or reproducibility.
Authors: We agree that additional details on the benchmark construction are essential. The revised manuscript will expand the Benchmark section to describe: the criteria for selecting problems from competitions (focusing on those with known multiple proof approaches and difficulty level), the exact prompting templates used for generating proofs, the sampling strategy (e.g., number of samples per problem, temperature settings for diversity measurement), and the operationalization of adaptivity (including the technique specification in prompts and the method for checking adherence, such as through rule-based matching or secondary LLM classification). These additions will facilitate reproducibility and bias assessment. revision: yes
Circularity Check
Empirical benchmark introduction with no derivation chain or self-referential reductions
full rationale
The paper presents ProofRank as an empirical benchmark that defines five independent proxies for proof quality (conciseness, computational ease, cognitive simplicity, diversity, adaptivity) and applies them to LLM outputs on competition problems. No mathematical derivation, first-principles prediction, or fitted parameter is claimed; the work consists of metric definitions followed by observational comparisons across models. No equations reduce to inputs by construction, no uniqueness theorems are invoked via self-citation, and no ansatz or renaming of known results occurs. The central claims rest on the empirical measurements themselves rather than any closed loop.
Axiom & Free-Parameter Ledger
free parameters (1)
- Proxy measurement rules
axioms (1)
- domain assumption Mathematical proofs have measurable qualities beyond correctness such as conciseness and cognitive simplicity that are concrete and broadly valued.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearWe identify such components and introduce PROOFRANK, a benchmark curated from challenging mathematical competitions. PROOFRANK evaluates several scalable proxies of proof quality: (i) conciseness... (v) adaptivity...
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearWe use pairwise comparisons... Bradley-Terry ranking... correctness as a prerequisite
Reference graph
Works this paper leans on
-
[1]
Johannes Schmitt. Extremal descendant integrals on moduli spaces of curves: An inequality dis- covered and proved in collaboration with ai, 2025. URL https://arxiv.org/abs/2512.14575
-
[2]
Lattice-valued bottleneck duality, 2024
Robert Ghrist, Julian Gould, and Miguel Lopez. Lattice-valued bottleneck duality, 2024. URL https://arxiv.org/abs/2410.00315
-
[3]
From euler to ai: Unifying formulas for mathematical constants, 2025
Tomer Raz, Michael Shalyt, Elyasheev Leibtag, Rotem Kalisch, Shachar Weinbaum, Yaron Hadad, and Ido Kaminer. From euler to ai: Unifying formulas for mathematical constants, 2025. URLhttps://arxiv.org/abs/2502.17533
-
[4]
Ivo Petrov, Jasper Dekoninck, Lyuben Baltadzhiev, Maria Drencheva, Kristian Minchev, Mislav Balunovic, Nikola Jovanovic, and Martin T. Vechev. Proof or bluff? evaluating llms on 2025 USA math olympiad.CoRR, abs/2503.21934, 2025. doi: 10 .48550/ARXIV.2503.21934. URL https://doi.org/10.48550/arXiv.2503.21934
-
[5]
Jasper Dekoninck, Ivo Petrov, Kristian Minchev, Mislav Balunovic, Martin T. Vechev, Miroslav Marinov, Maria Drencheva, Lyuba Konova, Milen Shumanov, Kaloyan Tsvetkov, Nikolay Drenchev, Lazar Todorov, Kalina Nikolova, Nikolay Georgiev, Vanesa Kalinkova, and Margulan Ismoldayev. The open proof corpus: A large-scale study of llm-generated mathematical proofs...
-
[6]
Johannes Schmitt, Gergely Bérczi, Jasper Dekoninck, Jeremy Feusi, Tim Gehrunger, Raphael Appenzeller, Jim Bryan, Niklas Canova, Timo de Wolff, Filippo Gaia, Michel van Garrel, Baran Hashemi, David Holmes, Aitor Iribar Lopez, Victor Jaeck, Martina Jørgensen, Steven Kelk, Stefan Kuhlmann, Adam Kurpisz, Chiara Meroni, Ingmar Metzler, Martin Möller, Samuel Mu...
-
[7]
Hamed Mahdavi, Alireza Hashemi, Majid Daliri, Pegah Mohammadipour, Alireza Farhadi, Samira Malek, Yekta Yazdanifard, Amir Khasahmadi, and Vasant G. Honavar. Brains vs. bytes: Evaluating LLM proficiency in olympiad mathematics.CoRR, abs/2504.01995, 2025. doi: 10.48550/ARXIV.2504.01995. URLhttps://doi.org/10.48550/arXiv.2504.01995
-
[8]
Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V . Le, and Junehyuk Jung. Towards robust mathematical reasoning.CoRR, abs/2511.01846, 2025. ...
-
[9]
Mathematical methods and human thought in the age of ai,
Tanya Klowden and Terence Tao. Mathematical methods and human thought in the age of ai,
- [10]
-
[11]
Mathematical capabilities of chatgpt
Simon Frieder, Luca Pinchetti, Alexis Chevalier, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Petersen, and Julius Berner. Mathematical capabilities of chatgpt. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference o...
work page 2023
-
[12]
Dadi Guo, Jiayu Liu, Zhiyuan Fan, Zhitao He, Haoran Li, Yumeng Wang, and Yi R. Fung. Mathematical proof as a litmus test: Revealing failure modes of advanced large reasoning models.CoRR, abs/2506.17114, 2025. doi: 10 .48550/ARXIV.2506.17114. URL https: //doi.org/10.48550/arXiv.2506.17114
-
[13]
Putnam-like dataset summary: Llms as mathematical competition contestants.CoRR, abs/2509.24827, 2025
Bartosz Bieganowski, Daniel Strzelecki, Robert Skiba, and Mateusz Topolewski. Putnam-like dataset summary: Llms as mathematical competition contestants.CoRR, abs/2509.24827, 2025. doi: 10.48550/ARXIV.2509.24827. URLhttps://doi.org/10.48550/arXiv.2509.24827
-
[14]
Santiago Gonzalez, Alireza Amiri Bavandpour, Peter Ye, Edward Zhang, Ruslans Alekse- jevs, Todor Antic, Polina Baron, Sujeet Bhalerao, Shubhrajit Bhattacharya, Zachary Burton, John Byrne, Hyungjun Choi, Nujhat Ahmed Disha, Koppány István Encz, Yuchen Fang, Robert Joseph George, Ebrahim Ghorbani, Alan Goldfarb, Jing Guo, Meghal Gupta, Stefano Huber, Annika...
-
[15]
Antoine Peyronnet, Fabian Gloeckle, and Amaury Hayat. Lemmabench: A live, research-level benchmark to evaluate LLM capabilities in mathematics.CoRR, abs/2602.24173, 2026. doi: 10.48550/ARXIV.2602.24173. URLhttps://doi.org/10.48550/arXiv.2602.24173
-
[16]
Ivo Petrov, Jasper Dekoninck, and Martin T. Vechev. Brokenmath: A benchmark for sycophancy in theorem proving with llms.CoRR, abs/2510.04721, 2025. doi: 10.48550/ARXIV.2510.04721. URLhttps://doi.org/10.48550/arXiv.2510.04721
-
[17]
Reliable fine-grained evaluation of natural language math proofs.CoRR, abs/2510.13888, 2025
Wenjie Ma, Andrei Cojocaru, Neel Kolhe, Bradley Louie, Robin Said Sharif, Haihan Zhang, Vincent Zhuang, Matei Zaharia, and Sewon Min. Reliable fine-grained evaluation of natural language math proofs.CoRR, abs/2510.13888, 2025. doi: 10 .48550/ARXIV.2510.13888. URL https://doi.org/10.48550/arXiv.2510.13888
-
[18]
Solving inequality proofs with large language models.CoRR, abs/2506.07927, 2025
Jiayi Sheng, Luna Lyu, Jikai Jin, Tony Xia, Alex Gu, James Zou, and Pan Lu. Solving inequality proofs with large language models.CoRR, abs/2506.07927, 2025. doi: 10 .48550/ ARXIV.2506.07927. URLhttps://doi.org/10.48550/arXiv.2506.07927
-
[19]
Shrey Pandit, Austin Xu, Xuan-Phi Nguyen, Yifei Ming, Caiming Xiong, and Shafiq Joty. Hard2verify: A step-level verification benchmark for open-ended frontier math.CoRR, abs/2510.13744, 2025. doi: 10 .48550/ARXIV.2510.13744. URL https://doi.org/10.48550/ arXiv.2510.13744
-
[20]
Hamed Mahdavi, Pouria Mahdavinia, Samira Malek, Pegah Mohammadipour, Alireza Hashemi, Majid Daliri, Alireza Farhadi, Amir Khasahmadi, Niloofar Mireshghallah, and Vasant G. Honavar. Refgrader: Automated grading of mathematical competition proofs using agentic workflows.CoRR, abs/2510.09021, 2025. doi: 10 .48550/ARXIV.2510.09021. URL https: //doi.org/10.485...
-
[21]
Katherine M. Collins, Albert Q. Jiang, Simon Frieder, Lionel Wong, Miri Zilka, Umang Bhatt, Thomas Lukasiewicz, Yuhuai Wu, Joshua B. Tenenbaum, William Hart, Timothy Gowers, Wenda Li, Adrian Weller, and Mateja Jamnik. Evaluating language models for mathematics through interactions.CoRR, abs/2306.01694, 2023. doi: 10 .48550/ARXIV.2306.01694. URL https://do...
-
[22]
Boyang Xue, Qi Zhu, Rui Wang, Sheng Wang, Hongru Wang, Fei Mi, Yasheng Wang, Lifeng Shang, Qun Liu, and Kam-Fai Wong. Reliablemath: Benchmark of reliable mathemati- cal reasoning on large language models.CoRR, abs/2507.03133, 2025. doi: 10 .48550/ ARXIV.2507.03133. URLhttps://doi.org/10.48550/arXiv.2507.03133. 11
-
[23]
Polina Kirichenko, Mark Ibrahim, Kamalika Chaudhuri, and Samuel J. Bell. Abstentionbench: Reasoning llms fail on unanswerable questions.CoRR, abs/2506.09038, 2025. doi: 10 .48550/ ARXIV.2506.09038. URLhttps://doi.org/10.48550/arXiv.2506.09038
-
[24]
Baoqing Yue, Zihan Zhu, Yifan Zhang, Jichen Feng, Hufei Yang, and Mengdi Wang. Interactive benchmarks.CoRR, abs/2603.04737, 2026. doi: 10 .48550/ARXIV.2603.04737. URL https: //doi.org/10.48550/arXiv.2603.04737
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.04737 2026
-
[25]
Yue Zhang, Jiaxin Zhang, Qiuyu Ren, Tahsin Saffat, Xiaoxuan Liu, Zitong Yang, Banghua Zhu, and Yi Ma. GAUSS: benchmarking structured mathematical skills for large language models.CoRR, abs/2509.18122, 2025. doi: 10 .48550/ARXIV.2509.18122. URL https: //doi.org/10.48550/arXiv.2509.18122
-
[26]
Marlene Steinbach, Shreya Bhandari, Jennifer Meyer, and Zach A. Pardos. When llms hal- lucinate: Examining the effects of erroneous feedback in math tutoring systems. In Caitlin Mills, Giora Alexandron, Davide Taibi, Giosuè Lo Bosco, and Luc Paquette, editors,Proceed- ings of the 18th International Conference on Educational Data Mining, EDM 2025, Palermo,...
work page 2025
-
[27]
Birgit Pepin, Nils Buchholtz, and Ulises Salinas-Hernández. A scoping survey of chatgpt in mathematics education.Digital Experiences in Mathematics Education, 11:9–41, 2025. doi: 10.1007/s40751-025-00172-1. URLhttps://doi.org/10.1007/s40751-025-00172-1
-
[28]
Muhammad Turmuzi, Syahrul Azmi, and Ni Made Intan Kertiyani. Chatgpt in school mathemat- ics education: A systematic review of opportunities, challenges, and pedagogical implications. Teaching and Teacher Education, 170:105286, 2026. doi: 10.1016/j.tate.2025.105286. URL https://doi.org/10.1016/j.tate.2025.105286
-
[29]
Reddig, Arav Arora, and Christopher J
Jennifer M. Reddig, Arav Arora, and Christopher J. MacLellan. Generating in-context, personalized feedback for intelligent tutors with large language models.Int. J. Artif. In- tell. Educ., 35(6):3459–3500, 2025. doi: 10 .1007/S40593-025-00505-6. URL https: //doi.org/10.1007/s40593-025-00505-6
-
[30]
Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. InCADE, volume 12699 ofLecture Notes in Computer Science, pages 625–635. Springer, 2021. doi: 10 .1007/978-3-030-79876-5 _37. URL https://doi.org/10.1007/978- 3-030-79876-5_37
-
[31]
Hilbert: Recursively building formal proofs with informal reasoning.CoRR, abs/2509.22819, 2025
Sumanth Varambally, Thomas V oice, Yanchao Sun, Zhifeng Chen, Rose Yu, and Ke Ye. Hilbert: Recursively building formal proofs with informal reasoning.CoRR, abs/2509.22819, 2025. doi: 10.48550/ARXIV.2509.22819. URLhttps://doi.org/10.48550/arXiv.2509.22819
-
[32]
Numina-lean-agent: An open and gen- eral agentic reasoning system for formal mathematics
Junqi Liu, Zihao Zhou, Zekai Zhu, Marco Dos Santos, Weikun He, Jiawei Liu, Ran Wang, Yunzhou Xie, Junqiao Zhao, Qiufeng Wang, Lihong Zhi, Jia Li, and Wenda Li. Numina-lean- agent: An open and general agentic reasoning system for formal mathematics, 2026. URL https://arxiv.org/abs/2601.14027
-
[33]
Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction
Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, Jiayun Wu, Jiri Gesi, Ximing Lu, David Acuna, Kaiyu Yang, Hongzhou Lin, Yejin Choi, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction, 2025. URL ht...
-
[34]
Putnambench: Evaluating neural theorem- provers on the putnam mathematical competition
George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jen- nings, Amitayush Thakur, and Swarat Chaudhuri. Putnambench: Evaluating neural theorem- provers on the putnam mathematical competition. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, ed- itors,Advances in ...
work page 2024
-
[35]
arXiv:2302.12433 [cs.CL] https://arxiv.org/abs/2302.12433
Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, and Jeremy Avigad. Proofnet: Autoformalizing and formally proving undergraduate- level mathematics.CoRR, abs/2302.12433, 2023. doi: 10 .48550/ARXIV.2302.12433. URL https://doi.org/10.48550/arXiv.2302.12433
-
[36]
minif2f: a cross-system benchmark for formal olympiad-level mathematics
Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. minif2f: a cross-system benchmark for formal olympiad-level mathematics. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=9ZPegFuFTFv
work page 2022
-
[37]
Nikil Ravi, Kexing Ying, Vasilii Nesterov, Rayan Krishnan, Elif Uskuplu, Bingyu Xia, Janitha Aswedige, and Langston Nashold. Formalproofbench: Can models write grad- uate level math proofs that are formally verified?CoRR, abs/2603.26996, 2026. doi: 10.48550/ARXIV.2603.26996. URLhttps://doi.org/10.48550/arXiv.2603.26996
-
[38]
Jiedong Jiang, Wanyi He, Yuefeng Wang, Guoxiong Gao, Yongle Hu, Jingting Wang, Nailing Guan, Peihao Wu, Chunbo Dai, Liang Xiao, and Bin Dong. FATE: A formal benchmark series for frontier algebra of multiple difficulty levels.CoRR, abs/2511.02872, 2025. doi: 10.48550/ARXIV.2511.02872. URLhttps://doi.org/10.48550/arXiv.2511.02872
-
[39]
Proofbench: Can models write math proofs that are formally verified? Vals AI, January 2026
Nikil Ravi, Langston Nashold, and Rayan Krishnan. Proofbench: Can models write math proofs that are formally verified? Vals AI, January 2026. URL https://www.vals.ai/benchmarks/ proof_bench. Accessed: 2026-04-29
work page 2026
-
[40]
Austin Letson, Leopoldo Sarra, Auguste Poiroux, Oliver Dressler, Paul Lezeau, Dhyan Aranha, Frederick Pu, Aaron Hill, Miguel Corredera Hidalgo, Julian Berman, George Tsoukalas, and Lenny Taelman. Sorrydb: Can AI provers complete real-world lean theorems?CoRR, abs/2603.02668, 2026. doi: 10 .48550/ARXIV.2603.02668. URL https://doi.org/10.48550/ arXiv.2603.02668
-
[41]
Alexander K. Taylor, Junyi Zhang, Ethan Ji, Vigyan Sahai, Haikang Deng, Yuanzhou Chen, Yifan Yuan, Di Wu, Jia-Chen Gu, Kai-Wei Chang, Nanyun Peng, Amit Sahai, and Wei Wang. Taobench: Do automated theorem prover llms generalize beyond mathlib?CoRR, abs/2603.12744, 2026. doi: 10 .48550/ARXIV.2603.12744. URL https://doi.org/10.48550/ arXiv.2603.12744
-
[42]
Alex Gu, Bartosz Piotrowski, Fabian Gloeckle, Kaiyu Yang, and Aram H. Markosyan. Proofoptimizer: Training language models to simplify proofs without human demonstra- tions.CoRR, abs/2510.15700, 2025. doi: 10 .48550/ARXIV.2510.15700. URL https: //doi.org/10.48550/arXiv.2510.15700
-
[43]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, edi- tors,Advances in Neural ...
work page 2023
-
[44]
Wildbench: Benchmarking llms with challenging tasks from real users in the wild
Bill Yuchen Lin, Yuntian Deng, Khyathi Raghavi Chandu, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. Wildbench: Benchmarking llms with challenging tasks from real users in the wild. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. U...
work page 2025
-
[45]
Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V . Chawla, and Xiangliang Zhang. Justice or prejudice? quantifying biases in llm-as-a-judge. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. ...
work page 2025
-
[46]
arXiv preprint arXiv:2410.21819 (2025)
Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri. Self-preference bias in llm-as-a-judge. CoRR, abs/2410.21819, 2024. doi: 10 .48550/ARXIV.2410.21819. URL https://doi.org/ 10.48550/arXiv.2410.21819
-
[47]
José Pombal, Ricardo Rei, and André F. T. Martins. Self-preference bias in rubric-based evaluation of large language models.CoRR, abs/2604.06996, 2026. doi: 10 .48550/ ARXIV.2604.06996. URLhttps://doi.org/10.48550/arXiv.2604.06996
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.06996 2026
-
[48]
arXiv preprint arXiv:2310.10076 , year=
Keita Saito, Akifumi Wachi, Koki Wataoka, and Youhei Akimoto. Verbosity bias in pref- erence labeling by large language models.CoRR, abs/2310.10076, 2023. doi: 10 .48550/ ARXIV.2310.10076. URLhttps://doi.org/10.48550/arXiv.2310.10076
-
[49]
Judging the judges: A systematic study of position bias in llm-as-a-judge
Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush V osoughi. Judging the judges: A systematic study of position bias in llm-as-a-judge. In Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, and Dhirendra Pratap Singh, editors,Proceedings of the 14th Inter...
work page 2025
-
[50]
Mislav Balunovic, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovic, and Martin T. Vechev. Math- arena: Evaluating llms on uncontaminated math competitions.CoRR, abs/2505.23281, 2025. doi: 10.48550/ARXIV.2505.23281. URLhttps://doi.org/10.48550/arXiv.2505.23281
work page internal anchor Pith review doi:10.48550/arxiv.2505.23281 2025
-
[51]
OpenAI. Gpt-5.4 thinking system card. Technical report, OpenAI, March 2026. URL https: //openai.com/index/gpt-5-4-thinking-system-card/. Accessed: 2026-04-09
work page 2026
-
[52]
gpt-oss-120b & gpt-oss-20b Model Card
OpenAI. gpt-oss-120b & gpt-oss-20b model card.CoRR, abs/2508.10925, 2025. doi: 10 .48550/ ARXIV.2508.10925. URLhttps://doi.org/10.48550/arXiv.2508.10925
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.10925 2025
-
[53]
Google. Gemini 3.1 pro model card. Technical report, Google, February 2026. URL https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-1-Pro- Model-Card.pdf. Accessed: 2026-04-09
work page 2026
-
[54]
Google. Gemini 3 flash model card. Technical report, Google, December 2025. URL https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Flash- Model-Card.pdf. Accessed: 2026-04-09
work page 2025
-
[55]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5
work page 2026
-
[56]
GLM-5: from Vibe Coding to Agentic Engineering
GLM. GLM-5: from vibe coding to agentic engineering.CoRR, abs/2602.15763, 2026. doi: 10.48550/ARXIV.2602.15763. URLhttps://doi.org/10.48550/arXiv.2602.15763
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.15763 2026
-
[57]
Kimi K2.5: Visual Agentic Intelligence
Kimi Team. Kimi K2.5: visual agentic intelligence.CoRR, abs/2602.02276, 2026. doi: 10.48550/ARXIV.2602.02276. URLhttps://doi.org/10.48550/arXiv.2602.02276
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.02276 2026
-
[58]
Grok 4.1 fast and agent tools api.xAI News, November 2025
xAI. Grok 4.1 fast and agent tools api.xAI News, November 2025. URL https://x.ai/news/ grok-4-1-fast. Accessed: 2026-04-09
work page 2025
-
[59]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models.CoRR, abs/2512.02556, 2025. doi: 10 .48550/ARXIV.2512.02556. URL https://doi.org/10.48550/ arXiv.2512.02556
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
Step 3.5 flash: Open frontier-level intelligence with 11b active parameters
StepFun Team. Step 3.5 flash: Open frontier-level intelligence with 11b active parameters. CoRR, abs/2602.10604, 2026. doi: 10 .48550/ARXIV.2602.10604. URL https://doi.org/ 10.48550/arXiv.2602.10604. 14 A Additional Experimental Details We provide additional experimental details that are not included in the main text to ensure full transparency and reprod...
-
[61]
**Golden Answer:** State the Golden Answer
-
[62]
**Extracted Model Answer:** State the extracted answer based on the Extraction Protocol. If none found, state "No clear final answer found."
-
[63]
**Equivalence Analysis:** Compare the two answers using the Grading Guidelines. Detail the steps taken to verify mathematical equivalence (e.g., simplification, algebraic manipulation). You must actively try to prove they are the same before concluding they are different
-
[64]
**Conclusion:** State the final determination ("Correct" or "Incorrect"). 24 **Part 2: Final Grade** Immediately following the closing </thinking> tag, output **ONLY** the final grade. * If Correct: \\boxed{{Correct}} * If Incorrect: \\boxed{{Incorrect}} **CRITICAL CONSTRAINT: Do not add any text, explanations, or formatting outside the <thinking> tags or...
-
[65]
**Golden Answer:** (-\infty, -4) \cup (4, \infty)
-
[66]
**Extracted Model Answer:** ∅(the empty set)
-
[67]
The Model Answer is the empty set
**Equivalence Analysis:** The Golden Answer is a non-empty set of real numbers. The Model Answer is the empty set. These two sets are not equivalent. The empty set contains no elements, while the Golden Answer contains an infinite number of elements
-
[68]
**Conclusion:** Incorrect </thinking> \\boxed{{Incorrect}} ------- # 4. Input Data Here is the problem, model solution, and golden answer to grade: ### Problem Statement: {problem} ### Model Solution: {solution} ### Golden Answer: {gold_answer} F.3.3 Solution Verification Prompt We use the following prompt for running GPT-5.4 to determine the overall corr...
-
[70]
**Backbone** The main structural move that sets up the solution, such as a reflection construction, polynomial-root encoding, greedy maintenance strategy, block decomposition, reduction to a theorem, or symmetry-based transformation
-
[71]
The **backbone** is the most important field
**Closing engine** The theorem, lemma, or decisive move that turns the backbone into the conclusion. The **backbone** is the most important field. The **closing engine** should be recorded when it is a real mathematical ingredient, not just routine simplification. # Requirements
-
[72]
Identify what makes the solution work conceptually, not how to carry out the computations
-
[73]
Do not solve the problem on your own. 26
-
[74]
Do not include chain-of-thought, hidden reasoning, or speculative reconstruction
-
[75]
Do not fill in missing steps
-
[76]
Work only with what is explicitly present or very strongly signaled in the solution text
-
[77]
If the solution is incomplete but its intended method is still recognizable, describe that intended method
-
[78]
If the text is too vague to identify a genuine nontrivial method, set`noCoreIdea`to`true`and use` null`where appropriate
-
[79]
Short verbatim evidence quotes may contain formulas if they appear in the provided solution
Avoid generating equations or calculations yourself. Short verbatim evidence quotes may contain formulas if they appear in the provided solution
-
[80]
Write`coreIdea`and`supportingIdeas`in an imperative, hint-like style
-
[81]
Keep labels method-specific. Avoid broad labels such as "use algebra," "use symmetry," or "use geometry " unless no more specific description is supported. # Output format Your response must include: - **noCoreIdea**: whether the provided solution lacks an identifiable nontrivial core idea - **solutionSummary**: a concise summary of the solution's method ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.