First Proof Second Batch

Lauren Williams; Mohammed Abouzaid; Nikhil Srivastava; Rachel Ward

arxiv: 2606.18119 · v1 · pith:VPOC3ES3new · submitted 2026-06-16 · 💻 cs.AI

First Proof Second Batch

Mohammed Abouzaid , Nikhil Srivastava , Rachel Ward , Lauren Williams This is my paper

Pith reviewed 2026-06-27 01:05 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI evaluationresearch mathematicsproblem solvingbenchmarkmathematical reasoningAI testingresearch problems

0 comments

The pith

AI systems were tested on ten research-level mathematics problems that arose in actual research.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates several AI systems by giving them ten problems drawn from a range of mathematical fields, each contributed by working mathematicians because the questions came up in their own research. It presents the problems, describes the testing methodology including prompts and allowed tools, reports the outcomes, and supplies links to the human solutions, AI-generated attempts, and referee logs. A sympathetic reader would care because the exercise supplies a concrete, realistic measure of whether current AI can handle the kind of open-ended, non-routine questions that appear in live mathematical work rather than polished textbook exercises.

Core claim

We assembled ten problems contributed by ten groups of mathematicians working in topology, combinatorics, probability, geometry, and related areas; we presented these problems to multiple AI systems under a uniform protocol; and we collected the generated solutions together with expert referee reports that document success or failure on each item.

What carries the argument

The collection of ten contributed research problems together with the fixed testing protocol and subsequent referee evaluation.

If this is right

The reported results supply a public benchmark against which future AI models can be compared on research-style mathematics.
The referee reports identify concrete patterns of success and failure that can guide targeted improvements in AI mathematical reasoning.
The same contributed-problem format offers a template that other fields could adopt for realistic capability testing.
The accompanying human solutions and logs create a reusable dataset for studying how AI outputs compare with expert reasoning on the same items.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Releasing the full set of problems and logs publicly could allow independent groups to run additional models or refine the evaluation criteria.
If the same problems are reused over time, performance trends could track genuine progress in AI mathematical ability rather than benchmark overfitting.
Extending the approach to problems that require extended computation or multi-step proofs might reveal different capability profiles than short-answer items.

Load-bearing premise

The ten problems are representative of the difficulty and structure of open research mathematics and the testing protocol does not systematically favor or disfavor the AI systems.

What would settle it

A repeat of the experiment on a fresh collection of ten research problems drawn the same way, or with modest changes to prompt wording or allowed tools, that produces substantially different overall success rates would undermine the headline assessment of AI capability.

Figures

Figures reproduced from arXiv: 2606.18119 by Lauren Williams, Mohammed Abouzaid, Nikhil Srivastava, Rachel Ward.

**Figure 1.** Figure 1: The mathematical research cycle. to assessing mathematical proofs that AI systems have created. As in our initial experiment, we collected solved but unpublished mathematical problems from human mathematicians, which we posed as a challenge to AI systems. 1.2 Summary of Methodology In keeping with the values of the academic math community, our guiding principle in designing the second batch was transparenc… view at source ↗

**Figure 2.** Figure 2: Overview of the First Proof Second Batch methodology. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Problem selection process. Solicitation. We solicited problems from mathematicians representing a wide range of mathematical fields as well as geographic locations (mostly in the United States for logistical reasons). We aimed to obtain mathematical problems which involve a nonstandard insight to answer, and which have a proof known to the mathematician of at most 8 pages which had not appeared on the int… view at source ↗

**Figure 4.** Figure 4: Testing process. 2. The system must take as input ten math problems given as .tex source, solve them in one shot (i.e., with no further interaction), and output the results in within 24 hours. 3. The system must log input, output, and reasoning tokens. We did not put an upper bound on the number of tokens allowed. 4. First Proof must be allowed to publish all of the output, logs, and code on June 10, 2026.… view at source ↗

**Figure 5.** Figure 5: Grading process. Selection of Referees. We asked the authors of the problems to suggest a list of referees for each problem, that is, mathematicians who had expertise in the field and were qualified to grade the solutions. After several rounds of invitations, we were able to confirm 3 referees for each problem. The total number of confirmed referees across all problems was 30. Anonymization and Referee Ass… view at source ↗

read the original abstract

To assess the ability of current AI systems to correctly solve research-level mathematics problems, we tested several AI systems on a set of ten problems in a broad range of mathematical fields; these problems arose naturally in the research process of the contributors. This document includes the problems, our methodology, and the results of our testing. We provide links to supplementary documents including the human solutions, the AI-generated solutions, and the referee reports and logs for the AI-generated solutions. The ten problems were contributed by the following mathematicians: (1) Dariusz Kaloci\'nski and Theodore A. Slaman, (2) Richard Schwartz, (3) Aleksa Milojevic and Benny Sudakov, (4) Larry Guth, (5) Oleg Butkovsky, Jonathan Mattingly, and Lorenzo Zambotti, (6) Joshua Evan Greene and Duncan McCoy, (7) Sucharit Sarkar, (8) Sam Payne and Jidong (Jayden) Wang, (9) Sylvie Corteel and John Lentfer, (10) Srivatsav Kunnawalkam Elayavalli.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper reports ten new research math problems with AI attempts and referee reviews, but the headline value depends on unverified representativeness and protocol neutrality.

read the letter

The main takeaway is a set of ten fresh research problems contributed by active mathematicians, paired with AI solution attempts and external referee evaluations. The paper supplies the problems themselves, human solutions, AI outputs, and referee logs, which is the concrete new data.

It does the basics right by making everything inspectable rather than just claiming aggregate performance. The problems come from real research contexts across fields like topology, combinatorics, and probability, and the authors avoid overclaiming by framing it as a specific test batch. That level of documentation is better than many AI benchmark efforts.

The soft spots are exactly where the stress-test note flags them. There is no quantitative comparison showing these problems match the difficulty or distribution of typical open questions from arXiv or MathOverflow, so the observed AI results stay tied to this curated collection. On the protocol side, the paper needs to demonstrate that prompts, tool access, attempt limits, and correctness criteria were fixed in advance and applied uniformly; without that audit, any edge in one system could reflect setup choices rather than capability. The abstract promises methodology details, but if the full text does not include those checks, generalization stays limited.

This is for readers tracking concrete AI performance on frontier math rather than theoretical frameworks. Someone working on math reasoning benchmarks will get usable examples to discuss. It deserves peer review because the data is new, the setup is transparent enough to evaluate, and the main concerns are addressable with added documentation on selection and protocol rather than a fundamental flaw in the reported outcomes.

Referee Report

2 major / 1 minor

Summary. The manuscript reports results from testing multiple AI systems on a set of ten research-level mathematics problems contributed by working mathematicians across diverse fields. The problems are presented as having arisen naturally in the contributors' research; the paper describes the testing methodology, provides the problems and AI outputs, and links to human solutions plus referee reports and logs.

Significance. If the ten problems are representative in difficulty and structure of typical open research questions and if the evaluation protocol was applied uniformly without post-hoc adjustments, the study would supply useful empirical data on current AI performance on genuine research mathematics. The provision of raw AI outputs and referee logs is a positive transparency feature.

major comments (2)

[Methodology] Methodology section: the claim that the ten problems 'arose naturally in the research process of the contributors' and therefore constitute a meaningful test of research-level mathematics is not supported by any quantitative comparison of problem difficulty, field distribution, or structural features against a reference corpus (e.g., recent arXiv math submissions or MathOverflow questions). Without such a comparison, the observed success rates cannot be generalized beyond this specific curated set.
[Methodology] Methodology section: the description of the testing protocol (prompt templates, number of attempts allowed, tool access, and correctness criteria) does not include an explicit audit or statement confirming that the protocol was applied identically to all systems and without post-selection adjustments. This is load-bearing for any comparative performance claims.

minor comments (1)

[Abstract] The abstract lists ten contributors but does not state how many distinct AI systems were evaluated or the exact success/failure counts; adding a concise summary table in the main text would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the two major comments point by point below.

read point-by-point responses

Referee: [Methodology] Methodology section: the claim that the ten problems 'arose naturally in the research process of the contributors' and therefore constitute a meaningful test of research-level mathematics is not supported by any quantitative comparison of problem difficulty, field distribution, or structural features against a reference corpus (e.g., recent arXiv math submissions or MathOverflow questions). Without such a comparison, the observed success rates cannot be generalized beyond this specific curated set.

Authors: The manuscript does not contain a quantitative comparison against a reference corpus such as recent arXiv submissions. The problems were obtained by inviting active researchers to contribute questions that had arisen in their own ongoing work, with the intent of capturing authentic research-level mathematics rather than engineered benchmarks. The paper does not claim statistical representativeness or generalizability of the observed rates to the broader space of open problems; the results are presented as empirical observations on this specific collection. We will revise the methodology section to state explicitly that the selection prioritizes natural occurrence and field diversity over statistical sampling, and that no generalization beyond the tested instances is asserted. revision: yes
Referee: [Methodology] Methodology section: the description of the testing protocol (prompt templates, number of attempts allowed, tool access, and correctness criteria) does not include an explicit audit or statement confirming that the protocol was applied identically to all systems and without post-selection adjustments. This is load-bearing for any comparative performance claims.

Authors: We agree that an explicit confirmation strengthens the manuscript. The protocol (including prompt templates, attempt limits, tool access, and referee-determined correctness criteria) was defined in advance and applied uniformly to every system, with all interactions recorded in the provided logs and no post-hoc changes to testing conditions. We will add a concise statement in the methodology section affirming uniform application across systems and directing readers to the supplementary logs for verification. revision: yes

Circularity Check

0 steps flagged

Empirical report of AI testing on math problems shows no circularity

full rationale

The paper is an empirical report documenting test outcomes of AI systems on ten contributed research-level math problems. No derivation chain, equations, fitted parameters, predictions, or ansatzes are present. The central claim rests on the representativeness of the problem set and protocol neutrality, which are stated as assumptions without any self-referential reduction or self-citation load-bearing. This matches the default case of a self-contained empirical study with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical evaluation study with no free parameters, mathematical axioms, or invented entities in the central claim.

pith-pipeline@v0.9.1-grok · 5720 in / 958 out tokens · 36726 ms · 2026-06-27T01:05:07.796897+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 1 canonical work pages

[1]

Schmitt, Johannes and B. I
[2]

Zhang, Jie and Petrui, Cezara and Nikoli. Real
[3]

2026 , month =

First Proof , author =. 2026 , month =. 2602.05192 , archivePrefix=

work page arXiv 2026
[4]

On proof and progress in mathematics , author=
[5]

The Optimal Paper Moebius Band , author=
[6]

Area-expanding embeddings of rectangles , author=
[7]

Transactions of the American Mathematical Society , volume=

Inverting a cylinder through isometric immersions and isometric embeddings , author=. Transactions of the American Mathematical Society , volume=
[8]

Wunderlich, Walter , journal=
[9]

Ein elementarer Beweis f

Sadowsky, Michael , journal=. Ein elementarer Beweis f
[10]

Algebraic Topology , author=

[1] [1]

Schmitt, Johannes and B. I

[2] [2]

Zhang, Jie and Petrui, Cezara and Nikoli. Real

[3] [3]

2026 , month =

First Proof , author =. 2026 , month =. 2602.05192 , archivePrefix=

work page arXiv 2026

[4] [4]

On proof and progress in mathematics , author=

[5] [5]

The Optimal Paper Moebius Band , author=

[6] [6]

Area-expanding embeddings of rectangles , author=

[7] [7]

Transactions of the American Mathematical Society , volume=

Inverting a cylinder through isometric immersions and isometric embeddings , author=. Transactions of the American Mathematical Society , volume=

[8] [8]

Wunderlich, Walter , journal=

[9] [9]

Ein elementarer Beweis f

Sadowsky, Michael , journal=. Ein elementarer Beweis f

[10] [10]

Algebraic Topology , author=