pith. machine review for the scientific record. sign in

arxiv: 2604.06802 · v1 · submitted 2026-04-08 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Riemann-Bench: A Benchmark for Moonshot Mathematics

Edwin Chen, Erik Knutsen, Suhaas Garre, Sushant Mehta

Pith reviewed 2026-05-10 17:58 UTC · model grok-4.3

classification 💻 cs.AI
keywords mathematical reasoningAI benchmarksresearch-level mathematicsfrontier modelsolympiad performanceprivate evaluationclosed-form solutions
0
0 comments X

The pith

Frontier AI models score below 10% on a benchmark of 25 research-level math problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a private benchmark of 25 problems authored by mathematics professors and medalists, each requiring weeks of independent work and double-blind expert verification. These problems test deep theoretical insight rather than competition tricks or limited-domain techniques found in olympiads. When run as unconstrained agents with full access to code and search tools, every current frontier model falls below 10% success across repeated trials. The low scores establish a measurable separation between olympiad-style solving and the demands of open mathematical research. Keeping the problems private prevents memorization and focuses evaluation on authentic capability.

Core claim

Riemann-Bench consists of 25 closed-form problems that underwent double-blind verification by independent experts who solved them from scratch. Evaluated as research agents with coding and search tools over 100 runs each, all frontier models achieve success rates below 10%. This result demonstrates that gold-medal olympiad performance does not translate to the sustained theoretical work needed for genuine research mathematics.

What carries the argument

Riemann-Bench, a private set of 25 expert-curated problems with programmatic verifiers for closed-form answers.

If this is right

  • Olympiad success does not predict performance on research-level problems.
  • New training or reasoning methods will be required to reach research capability.
  • Private benchmarks can isolate true capability from training-data effects.
  • Evaluations must move beyond competition formats to track progress toward open-ended mathematics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could be extended to track incremental gains as models improve on individual problems over time.
  • Similar private expert-curated sets might expose comparable gaps in other technical fields such as theoretical physics or algorithm design.
  • If models remain low-scoring, it points toward the need for architectures that support longer-horizon discovery rather than single-shot solutions.

Load-bearing premise

The 25 problems genuinely demand original research insight that current AI methods cannot solve through pattern matching or leaked data.

What would settle it

An independent run where any frontier model reaches above 30% success rate on the full set of problems while respecting the unconstrained-agent protocol.

Figures

Figures reproduced from arXiv: 2604.06802 by Edwin Chen, Erik Knutsen, Suhaas Garre, Sushant Mehta.

Figure 1
Figure 1. Figure 1: pass@1 across frontier models on RIEMANN-BENCH. All models score below 10%, confirming that research-level mathematics remains substantially beyond current capabilities. The central finding is important: all frontier models currently score below 10% on RIEMANN￾BENCH, even with full access to coding tools and search. This is in sharp contrast to olympiad-level benchmarks, where the same models approach or e… view at source ↗
read the original abstract

Recent AI systems have achieved gold-medal-level performance on the International Mathematical Olympiad, demonstrating remarkable proficiency at competition-style problem solving. However, competition mathematics represents only a narrow slice of mathematical reasoning: problems are drawn from limited domains, require minimal advanced machinery, and can often reward insightful tricks over deep theoretical knowledge. We introduce \bench{}, a private benchmark of 25 expert-curated problems designed to evaluate AI systems on research-level mathematics that goes far beyond the olympiad frontier. Problems are authored by Ivy League mathematics professors, graduate students, and PhD-holding IMO medalists, and routinely took their authors weeks to solve independently. Each problem undergoes double-blind verification by two independent domain experts who must solve the problem from scratch, and yields a unique, closed-form solution assessed by programmatic verifiers. We evaluate frontier models as unconstrained research agents, with full access to coding tools, search, and open-ended reasoning, using an unbiased statistical estimator computed over 100 independent runs per problem. Our results reveal that all frontier models currently score below 10\%, exposing a substantial gap between olympiad-level problem solving and genuine research-level mathematical reasoning. By keeping the benchmark fully private, we ensure that measured performance reflects authentic mathematical capability rather than memorization of training data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Riemann-Bench, a private benchmark of 25 expert-curated problems in research-level mathematics authored by professors, graduate students, and IMO medalists, each verified double-blind by independent experts and equipped with unique closed-form solutions assessed via programmatic verifiers. It evaluates frontier AI models as unconstrained research agents with full access to coding tools, search, and open-ended reasoning, applying an unbiased statistical estimator over 100 independent runs per problem, and reports that all models score below 10%, claiming this exposes a substantial gap between olympiad-level performance and genuine research-level mathematical reasoning.

Significance. If the problems genuinely demand novel research-level insight beyond olympiad techniques and the evaluation protocol fairly captures this without leakage or tool-assisted shortcuts, the results would provide valuable evidence of current limitations in AI mathematical reasoning. The paper merits credit for its rigorous elements: double-blind expert verification requiring solvers to start from scratch, programmatic closed-form verifiers, and the 100-run statistical estimator to reduce variance in agent performance.

major comments (3)
  1. [Abstract] Abstract and evaluation protocol description: The headline result that all frontier models score below 10% is presented as an aggregate without any per-problem success rates, failure-mode analysis, or explicit confirmation that the 100-run estimator was applied uniformly, which is load-bearing for assessing whether the low scores reflect true problem difficulty rather than protocol inconsistencies or model-specific behaviors.
  2. [Benchmark construction] Benchmark construction section: The complete privacy of all 25 problems, while motivated by leakage prevention, provides no public examples, domain descriptions, or high-level characterizations, making it impossible to independently test the central assumption that these problems require research-level insight unsolvable by existing AI techniques or clever use of the allowed tools.
  3. [Results] Results and evaluation sections: No details are given on how the unconstrained-agent protocol (coding/search tools plus verifiers) was implemented consistently or on any statistical checks for run-to-run variability, which directly affects the reliability of the <10% claim and the asserted gap from olympiad performance.
minor comments (2)
  1. [Introduction] The notation for the benchmark name should be introduced and defined explicitly on first use in the introduction rather than appearing only in the abstract.
  2. [Results] A table summarizing model scores with confidence intervals from the 100-run estimator would improve clarity of the results presentation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address each major comment point by point below, indicating where we will revise the manuscript to improve clarity and transparency while preserving the benchmark's core design.

read point-by-point responses
  1. Referee: [Abstract] Abstract and evaluation protocol description: The headline result that all frontier models score below 10% is presented as an aggregate without any per-problem success rates, failure-mode analysis, or explicit confirmation that the 100-run estimator was applied uniformly, which is load-bearing for assessing whether the low scores reflect true problem difficulty rather than protocol inconsistencies or model-specific behaviors.

    Authors: We agree that disaggregating the results would strengthen the paper. In the revised manuscript we will add a table in the Results section reporting per-problem success rates (with the 100-run estimator applied uniformly to each), together with a concise failure-mode summary drawn from the observed agent trajectories. We will also insert an explicit statement in the evaluation protocol subsection confirming uniform application of the estimator across all 25 problems. revision: yes

  2. Referee: [Benchmark construction] Benchmark construction section: The complete privacy of all 25 problems, while motivated by leakage prevention, provides no public examples, domain descriptions, or high-level characterizations, making it impossible to independently test the central assumption that these problems require research-level insight unsolvable by existing AI techniques or clever use of the allowed tools.

    Authors: We acknowledge the referee's desire for greater visibility. Releasing even high-level characterizations or examples would create an unacceptable risk of future leakage, which would invalidate the benchmark as a measure of genuine research capability. The research-level character is instead substantiated by the double-blind verification protocol, in which two independent experts solved each problem from scratch; we will expand the Benchmark construction section with additional quantitative details on verification time, expert credentials, and the requirement that solutions be novel closed-form expressions not reducible to olympiad techniques. revision: partial

  3. Referee: [Results] Results and evaluation sections: No details are given on how the unconstrained-agent protocol (coding/search tools plus verifiers) was implemented consistently or on any statistical checks for run-to-run variability, which directly affects the reliability of the <10% claim and the asserted gap from olympiad performance.

    Authors: We agree that implementation details and variability analysis are needed. The revised Results and Evaluation sections will describe the exact tool-access configuration (coding environment, search API, and verifier interface) provided identically to each model, the prompt template used to enforce the unconstrained-agent setting, and statistical checks including per-problem standard deviation, 95% confidence intervals, and a test for run-to-run independence across the 100 trials. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark report with no derivations or self-referential claims

full rationale

The paper is a straightforward empirical benchmark introduction and evaluation report. It describes curation of 25 private problems by experts, double-blind verification, and unconstrained agent runs on frontier models yielding <10% scores. No derivation chain exists, no equations or first-principles results are claimed, no parameters are fitted and relabeled as predictions, and no self-citations or ansatzes are invoked to justify any load-bearing step. The central claim rests on the empirical protocol and problem privacy rather than any reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no free parameters, mathematical derivations, or new physical entities; its central claim rests on the domain assumption that the curated problems test research-level mathematics.

axioms (1)
  • domain assumption The 25 problems represent genuine research-level mathematics that requires deep theoretical knowledge beyond olympiad-style tricks.
    Stated in the abstract as the motivation for creating problems that took authors weeks to solve independently.

pith-pipeline@v0.9.0 · 5526 in / 1329 out tokens · 55466 ms · 2026-05-10T17:58:31.414344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

    cs.CL 2026-05 unverdicted novelty 8.0

    Soohak is a new 439-problem mathematician-authored benchmark showing frontier LLMs reach only 30% on research math and fail to exceed 50% on refusing ill-posed questions.

  2. Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · cited by 2 Pith papers · 4 internal anchors

  1. [1]

    arXiv:2302.12433 [cs.CL] https://arxiv.org/abs/2302.12433

    Azerbayev, Z., Piotrowski, B., Schoelkopf, H., et al. ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics.arXiv preprint arXiv:2302.12433,

  2. [2]

    Evaluating Large Language Models Trained on Code

    Chen, M., Tworek, J., Jun, H., et al. Evaluating Large Language Models Trained on Code.arXiv preprint arXiv:2107.03374,

  3. [3]

    arXiv preprint arXiv:2502.03544 , title =

    Chervonyi, Y ., et al. Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeome- try2.arXiv preprint arXiv:2502.03544,

  4. [4]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V ., Bavarian, M., et al. Training Verifiers to Solve Math Word Problems.arXiv preprint arXiv:2110.14168,

  5. [5]

    Deepseek-prover-v2: Advancing formal mathematical rea- soning via reinforcement learning for subgoal decomposition.arXiv preprint arXiv:2504.21801,

    Ren, X., et al. DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition.arXiv preprint arXiv:2504.21801,

  6. [6]

    Deepseekmath-v2: Towards self-verifiable mathematical reasoning,

    Shao, Z., Luo, Y ., Lu, C., et al. DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning. arXiv preprint arXiv:2511.22570,

  7. [7]

    MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data.arXiv preprint arXiv:2406.18321,

    8 Fang, F., Mei, X., Miao, Z., et al. MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data.arXiv preprint arXiv:2406.18321,

  8. [8]

    2410.07985 , archivePrefix=

    Gao, B., Song, Y ., et al. Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models.arXiv preprint arXiv:2410.07985,

  9. [9]

    Robin Hartshorne

    Glazer, E., Erdil, E., Besiroglu, T., et al. FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI.arXiv preprint arXiv:2411.04872,

  10. [10]

    Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models

    OlymMATH. Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models.arXiv preprint arXiv:2503.21380,

  11. [11]

    OpenAI o1 System Card

    OpenAI. OpenAI o1 System Card.arXiv preprint arXiv:2412.16720,

  12. [12]

    Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong, Lakshya A

    Pan, M. Z., Arabzadeh, N., Cogo, R., et al. Measuring Agents in Production.arXiv preprint arXiv:2512.04123,

  13. [13]

    ARB: Advanced Reasoning Benchmark for Large Language Models.arXiv preprint arXiv:2307.13692,

    Sawada, T., Paleka, D., Havrilla, A., et al. ARB: Advanced Reasoning Benchmark for Large Language Models.arXiv preprint arXiv:2307.13692,

  14. [14]

    Towards large language models as copilots for theorem proving in lean.arXiv preprint arXiv:2404.12534, 2024

    Song, P., Yang, K., and Anandkumar, A. Lean Copilot: LLMs as Copilots for Theorem Proving in Lean.arXiv preprint arXiv:2404.12534,

  15. [15]

    Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap.arXiv preprint arXiv:2402.19450,

    Srivastava, M., et al. Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap.arXiv preprint arXiv:2402.19450,

  16. [16]

    Zhouliang Yu, Ruotian Peng, Keyi Ding, Yizhe Li, Zhongyuan Peng, Minghao Liu, Yifan Zhang, Zheng Yuan, Huajian Xin, Wenhao Huang, et al

    Tsoukalas, G., Jasber, J., et al. PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition.arXiv preprint arXiv:2407.11214,

  17. [17]

    arXiv preprint arXiv:2406.04244 , year=

    9 Xu, R., et al. Benchmark Data Contamination of Large Language Models: A Survey.arXiv preprint arXiv:2406.04244,

  18. [18]

    Don’t make your llm an evaluation benchmark cheater

    Zhou, K., Zhu, Y ., et al. Don’t Make Your LLM an Evaluation Benchmark Cheater.arXiv preprint arXiv:2311.01964,