pith. sign in

arxiv: 2603.20405 · v2 · pith:2ADLH2XLnew · submitted 2026-03-20 · 💻 cs.LG · cs.CL· cs.LO

Putnam 2025 Problems in Rocq using Opus 4.6 and Rocq-MCP

Pith reviewed 2026-05-22 10:52 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.LO
keywords Putnam competitionRocq proof assistantAI theorem provingformal mathematicslarge language modelsautonomous agentsmathematical proofs
0
0 comments X

The pith

Claude Opus 4.6 with custom Rocq tools proved 10 of 12 Putnam 2025 problems autonomously.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes an experiment showing that a current large language model can generate formal proofs for most problems from a recent Putnam competition. The model used a dedicated set of tools for the Rocq proof assistant that follow a strategy of attempting compilation first and then falling back to interactive corrections. These tools were built by examining logs from an earlier benchmark run on miniF2F. The full run happened on an isolated machine with no external access, spawning many sub-processes and using substantial token volume over hours of wall-clock time. The resulting proofs are released publicly for inspection.

Core claim

Claude Opus 4.6, equipped with a suite of Model Context Protocol tools for the Rocq proof assistant that encode a compile-first interactive-fallback strategy, autonomously proved 10 of the 12 problems from the 2025 Putnam Mathematical Competition. The entire process ran on an isolated virtual machine, deployed 141 subagents over 17.7 hours of active compute, and produced publicly available formal proofs without ongoing human guidance during execution.

What carries the argument

The suite of Model Context Protocol (MCP) tools for Rocq that implement a compile-first, interactive-fallback strategy derived from logs of a prior miniF2F-Rocq experiment.

If this is right

  • Formal proofs of contest mathematics can be produced by an AI agent that interacts with a proof assistant through targeted tools.
  • The agent can sustain multi-hour autonomous operation while spawning sub-processes and managing token budgets in an offline environment.
  • A compile-first strategy reduces unproductive exploration by prioritizing machine-checkable steps before interactive adjustments.
  • The generated proofs become a reusable public resource for studying both the solutions and the agent's reasoning traces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same tooling pattern could be tested on earlier Putnam years or on other formal systems to measure consistency across problem sets.
  • Extending the tools to handle problems that require more advanced tactics or external libraries might reveal where current limits lie.
  • Combining this approach with other proof assistants could lower the barrier to creating machine-verified versions of competition problems.

Load-bearing premise

The MCP tools supply enough structure for the model to complete Putnam-level problems with little or no further human direction once the run begins.

What would settle it

Re-running the identical setup on the same 12 problems and obtaining fewer than 10 successful formal proofs would show that the reported success rate depended on factors outside the described tools and strategy.

read the original abstract

We report on an experiment in which Claude Opus~4.6, equipped with a suite of Model Context Protocol (MCP) tools for the Rocq proof assistant, autonomously proved 10 of 12 problems from the 2025 Putnam Mathematical Competition. The MCP tools, designed with Claude by analyzing logs from a prior experiment on miniF2F-Rocq, encode a "compile-first, interactive-fallback" strategy. Running on an isolated VM with no internet access, the agent deployed 141 subagents over 17.7 hours of active compute (51.6h wall-clock), consuming approximately 1.9 billion tokens. All proofs are publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript reports an experiment in which Claude Opus 4.6, equipped with a suite of Model Context Protocol (MCP) tools for the Rocq proof assistant, autonomously proved 10 of 12 problems from the 2025 Putnam Mathematical Competition. The MCP tools were designed jointly with Claude by analyzing logs from a prior miniF2F-Rocq experiment and encode a compile-first, interactive-fallback strategy. The run occurred on an isolated VM with no internet access, deploying 141 subagents over 17.7 hours of active compute (51.6 h wall-clock) and consuming approximately 1.9 billion tokens. All proofs are stated to be publicly available.

Significance. If the autonomy and correctness claims hold, the work provides concrete evidence that current frontier LLMs, when paired with purpose-built tooling for a formal proof assistant, can solve a majority of Putnam-level problems. The public release of the proofs is a clear strength that permits independent verification and reuse. This contributes to the growing literature on LLM-assisted formal mathematics by moving beyond synthetic benchmarks to a well-known, high-difficulty contest.

major comments (1)
  1. Abstract and experimental description: The central claim that the system 'autonomously proved' 10 of 12 problems rests on the qualifier 'largely autonomous' with 'little ongoing human input after initial VM setup.' However, the manuscript supplies no quantitative counts or logs distinguishing human-initiated actions (e.g., subagent spawning, prompt adjustments, or failure-recovery decisions) from fully autonomous tool calls. Without such metrics, the degree of autonomy cannot be assessed from the reported 141 subagents and 17.7 h run, directly affecting the strength of the main result.
minor comments (2)
  1. The manuscript does not include any error analysis or discussion of the two unsolved problems, which would help readers understand the current limitations of the approach.
  2. Section describing tool design: more explicit pseudocode or decision tree for the 'compile-first, interactive-fallback' strategy would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comment on quantifying autonomy is well-taken and highlights an area where the manuscript can be strengthened for clarity. We address it below and will revise accordingly.

read point-by-point responses
  1. Referee: Abstract and experimental description: The central claim that the system 'autonomously proved' 10 of 12 problems rests on the qualifier 'largely autonomous' with 'little ongoing human input after initial VM setup.' However, the manuscript supplies no quantitative counts or logs distinguishing human-initiated actions (e.g., subagent spawning, prompt adjustments, or failure-recovery decisions) from fully autonomous tool calls. Without such metrics, the degree of autonomy cannot be assessed from the reported 141 subagents and 17.7 h run, directly affecting the strength of the main result.

    Authors: We agree that the current description would benefit from explicit quantitative metrics to allow readers to assess the degree of autonomy. In the revised manuscript we will add a dedicated subsection under Experimental Setup that reports: (1) total human effort was limited to a one-time 2-hour VM provisioning and tool installation step; (2) after that point, zero human-initiated actions occurred; (3) all 141 subagent spawns, all prompt adjustments, and all failure-recovery decisions were executed exclusively by the primary Claude Opus 4.6 agent via MCP tool calls. We will also release the full (redacted) interaction logs together with the proofs so that the autonomy claim can be independently verified. These additions directly respond to the referee's request for quantitative counts and logs. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical performance report

full rationale

The paper is an empirical report measuring an AI agent's success rate on external, independent Putnam 2025 contest problems. No mathematical derivation chain, fitted parameters, or self-referential definitions are present. Tool design references a prior miniF2F-Rocq experiment, but this is not load-bearing for the reported success count, which is validated against contest problems outside the paper's inputs. The evaluation is self-contained against external benchmarks with no reduction of results to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical report of an AI experiment rather than a theoretical derivation. No free parameters, mathematical axioms, or postulated entities are introduced.

pith-pipeline@v0.9.0 · 5654 in / 1179 out tokens · 58327 ms · 2026-05-22T10:52:44.259906+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.