Evaluating Research-Level Math Proofs via Strict Step-Level Verification

Yifeng Sun

arxiv: 2606.10799 · v1 · pith:PJCCLG6Bnew · submitted 2026-06-09 · 💻 cs.AI

Evaluating Research-Level Math Proofs via Strict Step-Level Verification

Yifeng Sun This is my paper

Pith reviewed 2026-06-27 13:15 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM proof verificationstep-level evaluationdeductive constraintscontext poisoningmathematical error localizationresearch-level proofsFirstProof challengepedantic hyper-rigor

0 comments

The pith

Strict step-level verification is required to localize subtle logical errors that global LLM prompts miss in research-level math proofs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that large language models cannot reliably verify complex mathematical proofs using standard global evaluation because superficially correct statements hide logical flaws through context poisoning. Shifting to strict step-level verification, where each deduction maintains its own context and theorem sources are tightly constrained, allows the model to catch those errors. Ablation experiments on an adversarial set of research-level proofs show that removing the constraints causes the method to fail at error localization. The remaining failures turn out to be cases of excessive pedantry rather than outright hallucinations, which in turn reveals unstated conventions inside the benchmark proofs themselves. A reader would care because this approach suggests a practical way to make automated verification more trustworthy on proofs that go beyond what the model already knows.

Core claim

The central claim is that deductive constraints applied at each individual proof step are indispensable for accurate verification. On the FirstProof challenge suite, unconstrained global prompting consistently fails to identify subtle logical errors, while the constrained step-level method succeeds at localization. Error analysis then shows that the new failure mode is mainly pedantic hyper-rigor arising from implicit domain conventions rather than severe logical hallucinations, thereby exposing ambiguities already present in the expert-written proofs.

What carries the argument

The step-level verification framework that maintains separate detailed context for each deduction step while strictly constraining the allowed sources of applied theorems.

If this is right

Unconstrained global prompting fails to localize subtle logical errors in research-level proofs.
The constrained step-level method outperforms global evaluation on the same proofs.
Most remaining rejections after step-level verification are instances of pedantic hyper-rigor caused by unstated domain conventions.
The method thereby exposes implicit ambiguities inside the expert benchmark itself.
Organizing verification notes in a cautious human-mathematician style improves an agent's ability to separate rigorous proofs from flawed ones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same per-step constraint technique might improve verification on long chains of reasoning outside pure mathematics.
The shift from hallucination errors to hyper-rigor errors suggests that future benchmarks should explicitly document their implicit conventions.
Agents could be trained to generate the required step notes automatically, turning the method into a scalable review tool.
Testing the framework on proofs that deliberately omit standard conventions would directly measure how much of the improvement comes from handling implicit knowledge.

Load-bearing premise

The FirstProof challenge contains a representative sample of research-level proofs whose subtle errors reliably expose the weaknesses of global verification methods.

What would settle it

Apply the identical ablation study to a fresh collection of research-level proofs and find that unconstrained global prompting localizes subtle logical errors at least as accurately as the constrained step-level method.

read the original abstract

Large Language Models (LLMs) struggle to rigorously verify complex mathematical proofs. Standard global evaluation approaches suffer from "context poisoning," in which superficially plausible statements mask subtle logical flaws, leading to hallucination or over-skepticism. To address this, we shift from global evaluation to strict step-level verification: our framework maintains detailed context for each deduction step and strictly constrains the sources of applied theorems. We evaluate on a carefully curated adversarial diagnostic suite of research-level proofs drawn from the FirstProof challenge. A systematic ablation study demonstrates that these deductive constraints are indispensable, as unconstrained global prompting consistently fails to localize subtle logical errors. Beyond outperforming global evaluation, our approach fundamentally alters the failure taxonomy. Error analysis reveals that, rather than exhibiting severe logical hallucinations, remaining rejections are primarily instances of "pedantic hyper-rigor" stemming from unstated domain conventions, effectively exposing implicit ambiguities within the expert benchmark itself. Our findings suggest that prompting agents to organize their verification notes in a cautious, human-mathematician-like manner can substantially improve their ability to distinguish rigorous proofs from flawed ones, with the potential to strengthen agentic reasoning on frontier mathematical concepts that the base model does not already know well, and to lay a theoretical foundation for future automated proof-review systems. Code and prompts are available at GitHub.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Step-level verification with theorem constraints changes the error profile on their benchmark but the paper supplies no numbers or benchmark details to support the strong claims.

read the letter

The paper tries to fix LLM struggles with verifying long math proofs by switching from one-shot global checks to strict per-step verification. Each step keeps its own context and theorem uses are limited to explicitly allowed sources. They test this on a set of research-level proofs they call the FirstProof challenge and report that global prompting misses subtle errors while their method does better and mostly surfaces only overly strict rejections about missing conventions.

What is actually new is the combination of maintained step context plus source constraints applied to adversarial research-level material, plus the observation that the leftover failures are pedantic rather than hallucinatory. The error taxonomy shift and the public code are the concrete contributions.

The main weakness is the complete absence of numbers. The abstract mentions an ablation and error analysis but gives no success rates, no proof count, no domain breakdown, and no example of a logical error that global prompting misses. Without those, the claim that the constraints are indispensable rests on an undescribed benchmark whose representativeness is not shown. The stress-test concern about FirstProof being a reliable diagnostic suite holds up on the supplied text.

This is for people working on agentic math reasoning and automated proof review. A reader who wants prompting ideas for catching subtle flaws would find the approach worth trying, but anyone needing evidence that it generalizes would have to wait for the full numbers.

Send it to referees if the manuscript adds the missing quantitative results and benchmark description; the direction is worth checking even if the current write-up is thin on evidence.

Referee Report

2 major / 1 minor

Summary. The paper claims that standard global LLM evaluation of research-level math proofs suffers from context poisoning, where plausible statements mask subtle flaws, and proposes strict step-level verification with deductive constraints on theorem sources to address this. It evaluates the approach on a curated adversarial suite from the FirstProof challenge, reporting via ablation that unconstrained global prompting consistently fails to localize errors while the constrained method succeeds and shifts the failure taxonomy from logical hallucinations to pedantic hyper-rigor arising from unstated domain conventions.

Significance. If the ablation results and error analysis hold with adequate benchmark documentation, the work could strengthen agentic mathematical reasoning by demonstrating the value of human-like structured verification notes and provide a foundation for automated proof-review systems. The public release of code and prompts is a positive contribution that supports reproducibility.

major comments (2)

[Abstract] Abstract: the claim that a 'systematic ablation study demonstrates that these deductive constraints are indispensable' and that 'unconstrained global prompting consistently fails' is not supported by any quantitative results, error rates, dataset sizes, or concrete examples of localized errors in the provided text, preventing assessment of the central claim.
[Abstract] Abstract and Evaluation section: the FirstProof challenge is described only as a 'carefully curated adversarial diagnostic suite' with no selection criteria, proof sources, domains, count of proofs, difficulty metrics, or even one example of a subtle logical error that global methods miss; this leaves the representativeness of the benchmark and the generality of the failure-taxonomy shift unanchored.

minor comments (1)

[Abstract] Abstract: the term 'context poisoning' is introduced without a formal definition or citation to prior literature on LLM verification failures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and benchmark documentation. The comments correctly identify areas where additional quantitative support and details would strengthen the presentation. We address each point below and commit to revisions that incorporate the requested information without altering the core claims or results.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that a 'systematic ablation study demonstrates that these deductive constraints are indispensable' and that 'unconstrained global prompting consistently fails' is not supported by any quantitative results, error rates, dataset sizes, or concrete examples of localized errors in the provided text, preventing assessment of the central claim.

Authors: We agree that the abstract, as a high-level summary, does not include the specific quantitative results, error rates, dataset sizes, or examples needed to fully support the claims on its own. The Evaluation section of the manuscript contains the ablation study with these details. To address the concern directly, we will revise the abstract to include key quantitative metrics (e.g., error localization rates and dataset size) and a brief example of a localized error, enabling standalone assessment of the central claim. revision: yes
Referee: [Abstract] Abstract and Evaluation section: the FirstProof challenge is described only as a 'carefully curated adversarial diagnostic suite' with no selection criteria, proof sources, domains, count of proofs, difficulty metrics, or even one example of a subtle logical error that global methods miss; this leaves the representativeness of the benchmark and the generality of the failure-taxonomy shift unanchored.

Authors: We acknowledge that the current descriptions in the abstract and Evaluation section lack the requested specifics on selection criteria, sources, domains, count, difficulty metrics, and a concrete example. This limits evaluation of benchmark representativeness. We will expand both sections to include these details, along with an example of a subtle logical error missed by global methods, to better anchor the failure-taxonomy analysis and generality of the results. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on external benchmark and ablations

full rationale

The paper's central claim—that strict step-level verification outperforms global prompting—rests on ablation studies and evaluation against the external FirstProof challenge benchmark. No self-definitional relations, fitted inputs renamed as predictions, load-bearing self-citations, uniqueness theorems imported from the authors, ansatzes smuggled via citation, or renamings of known results appear in the abstract or described methodology. The derivation chain is self-contained against an independent diagnostic suite and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the validity of the FirstProof benchmark and the assumption that step-level constraints address context poisoning without introducing new biases.

axioms (1)

domain assumption The FirstProof challenge provides a representative set of adversarial research-level proofs suitable for diagnosing LLM verification failures.
Used as the sole evaluation suite in the described experiments.

invented entities (1)

context poisoning no independent evidence
purpose: Describes the mechanism by which global evaluation allows superficially plausible statements to mask logical flaws.
Introduced to explain why global prompting fails; no independent evidence provided beyond the paper's own observations.

pith-pipeline@v0.9.1-grok · 5753 in / 1141 out tokens · 19212 ms · 2026-06-27T13:15:05.691705+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 6 canonical work pages · 5 internal anchors

[1]

Deepseek-R1: Incentivizing rea- soning capability in LLMs via reinforcement learning,

DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y . Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li,et al., “Deepseek-R1: Incentiv...

2025
[2]

Aletheia tackles firstproof au- tonomously,

T. Feng, J. Jung, S.-h. Kim, C. Pagano, S. Gukov, C.-C. Tsai, D. Woodruff, A. Javanmard, A. Mokhtari, D. Hwang,et al., “Aletheia tackles firstproof au- tonomously,”arXiv preprint arXiv:2602.21201, 2026

work page arXiv 2026
[3]

Automated Conjecture Resolution with Formal Verification

H. Ju, G. Gao, J. Jiang, B. Wu, Z. Sun, L. Chen, Y . Wang, Y . Wang, Z. Wang, W. He,et al., “Automated conjec- ture resolution with formal verification,”arXiv preprint arXiv:2604.03789, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

The lean 4 theorem prover and programming language,

L. de Moura and S. Ullrich, “The lean 4 theorem prover and programming language,” inAutomated Deduction– CADE 28: 28th International Conference on Automated Deduction, Virtual Event, July 12–15, 2021, Proceedings 28, pp. 625–635, Springer, 2021

2021
[5]

The Lean mathematical li- brary,

The mathlib Community, “The Lean mathematical li- brary,” 2019

2019
[6]

Autoformalization with large lan- guage models,

Y . Wu, A. Q. Jiang, W. Li, M. N. Rabe, C. Staats, M. Jam- nik, and C. Szegedy, “Autoformalization with large lan- guage models,” 2022

2022
[7]

Draft, sketch, and prove: Guiding formal theorem provers with infor- mal proofs,

A. Q. Jiang, S. Welleck, J. P. Zhou, W. Li, J. Liu, M. Jam- nik, T. Lacroix, Y . Wu, and G. Lample, “Draft, sketch, and prove: Guiding formal theorem provers with infor- mal proofs,” 2022

2022
[8]

Proofnet: Autoformalizing and formally proving undergraduate-level mathematics,

Z. Azerbayev, B. Piotrowski, H. Schoelkopf, E. W. Ayers, D. Radev, and J. Avigad, “Proofnet: Autoformalizing and formally proving undergraduate-level mathematics,” 2023

2023
[9]

Le- andojo: Theorem proving with retrieval-augmented lan- guage models,

K. Yang, A. M. Swope, A. Gu, R. Chalamala, P. Song, S. Yu, S. Godil, R. Prenger, and A. Anandkumar, “Le- andojo: Theorem proving with retrieval-augmented lan- guage models,” 2023

2023
[10]

Minif2f: a cross- system benchmark for formal olympiad-level mathemat- ics,

K. Zheng, J. M. Han, and S. Polu, “Minif2f: a cross- system benchmark for formal olympiad-level mathemat- ics,” 2021

2021
[11]

Let's Verify Step by Step

H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe, “Let’s verify step by step,”arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Training verifiers to solve math word problems,

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, 7 C. Hesse, and J. Schulman, “Training verifiers to solve math word problems,” 2021

2021
[13]

First proof,

M. Abouzaid, A. J. Blumberg, M. Hairer, J. Kileel, T. G. Kolda, P. D. Nelson, D. Spielman, N. Srivastava, R. Ward, S. Weinberger, and L. Williams, “First proof,” 2026

2026
[14]

First batch | first proof project,

First Proof Project, “First batch | first proof project,”
[15]

Project page for the first batch of First Proof problems
[16]

First proof solutions and comments,

First Proof Project, “First proof solutions and comments,”
[17]

Official solutions and commentary PDF for the First Proof project
[18]

Weinberger,Variations on a Theme of Borel: An Es- say on the Role of the Fundamental Group in Rigidity

S. Weinberger,Variations on a Theme of Borel: An Es- say on the Role of the Fundamental Group in Rigidity. Cambridge University Press, 2022

2022
[19]

D. W. Morris,Introduction to arithmetic groups, vol. 2. Deductive Press Lieu de publication inconnu, 2015

2015
[20]

Planar point sets with many unit distances,

OpenAI, “Planar point sets with many unit distances,”
[21]

https://cdn.openai.com/pdf/74c24085-19b0-4534- 9c90-465b8e29ad73/unit-distance-proof.pdf
[22]

Optimal bend-and-break for foliations

J. Liu, Z. Sun, and J. Jiang, “Optimal bounds in bend-and- break for foliations,”arXiv preprint arXiv:2605.20754, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Shokurov's global index conjecture for threefold foliations

J. Liu and S. Qin, “Shokurov’s global index conjecture for threefold foliations,”arXiv preprint arXiv:2605.22735, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Boundedness of total Cartier indices for rational singularities in families

J. Liu, R. Hu, and S. Qin, “Boundedness of total cartier indices for rational singularities in families,”arXiv preprint arXiv:2605.22782, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

The mathematical vernacular, a language for mathematics with typed sets,

N. de Bruijn, “The mathematical vernacular, a language for mathematics with typed sets,” inSelected Papers on Automath(R. Nederpelt, J. Geuvers, and R. de Vrijer, eds.), vol. 133 ofStudies in Logic and the Foundations of Mathematics, pp. 865–935, Elsevier, 1994

1994
[26]

The language of mathematics,

M. Ganesalingam, “The language of mathematics,” in The Language of Mathematics: A Linguistic and Philo- sophical Investigation, pp. 17–38, Springer, 2013

2013
[27]

Research report: some con- structions in polyhedral symplectic topology suggested by ai,

A. Alfieri and C. Novak, “Research report: some con- structions in polyhedral symplectic topology suggested by ai,” 2026

2026
[28]

Finiteness properties for some rational Poincaré duality groups,

J. Fowler, “Finiteness properties for some rational Poincaré duality groups,”Illinois Journal of Mathemat- ics, vol. 56, no. 2, pp. 281–299, 2012. A Prompts for the Verification Agent In this appendix, we present the prompts used to drive our state-based verification agent. The prompts are administered to the Codex agent as fixed instruction prompts. A.1 ...

2012
[29]

Closing D6b1a.The agent verifies step_15b1a_signature_class_to_assembly. The successful route here is to replace an indirect complex-K- theoretic justification with a more exact real KO-theoretic source: Rosenberg and Weinberger’s Lipschitz-signature theorem is used to construct the real KO-homology signature class and to identify its assembly image with ...
[30]

found the proof

Closing D6b1b.The agent closes this gap by rewriting step_15b1b_signature_class_to_lclass. Here the successful move is not simply to cite a missing theorem verbatim. Instead, the agent builds a local bridge: (1) identify the real signature class inKO-homology; (2) complexify it to the standard complex signature class; (3) invoke the known complex K-homolo...

[1] [1]

Deepseek-R1: Incentivizing rea- soning capability in LLMs via reinforcement learning,

DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y . Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li,et al., “Deepseek-R1: Incentiv...

2025

[2] [2]

Aletheia tackles firstproof au- tonomously,

T. Feng, J. Jung, S.-h. Kim, C. Pagano, S. Gukov, C.-C. Tsai, D. Woodruff, A. Javanmard, A. Mokhtari, D. Hwang,et al., “Aletheia tackles firstproof au- tonomously,”arXiv preprint arXiv:2602.21201, 2026

work page arXiv 2026

[3] [3]

Automated Conjecture Resolution with Formal Verification

H. Ju, G. Gao, J. Jiang, B. Wu, Z. Sun, L. Chen, Y . Wang, Y . Wang, Z. Wang, W. He,et al., “Automated conjec- ture resolution with formal verification,”arXiv preprint arXiv:2604.03789, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

The lean 4 theorem prover and programming language,

L. de Moura and S. Ullrich, “The lean 4 theorem prover and programming language,” inAutomated Deduction– CADE 28: 28th International Conference on Automated Deduction, Virtual Event, July 12–15, 2021, Proceedings 28, pp. 625–635, Springer, 2021

2021

[5] [5]

The Lean mathematical li- brary,

The mathlib Community, “The Lean mathematical li- brary,” 2019

2019

[6] [6]

Autoformalization with large lan- guage models,

Y . Wu, A. Q. Jiang, W. Li, M. N. Rabe, C. Staats, M. Jam- nik, and C. Szegedy, “Autoformalization with large lan- guage models,” 2022

2022

[7] [7]

Draft, sketch, and prove: Guiding formal theorem provers with infor- mal proofs,

A. Q. Jiang, S. Welleck, J. P. Zhou, W. Li, J. Liu, M. Jam- nik, T. Lacroix, Y . Wu, and G. Lample, “Draft, sketch, and prove: Guiding formal theorem provers with infor- mal proofs,” 2022

2022

[8] [8]

Proofnet: Autoformalizing and formally proving undergraduate-level mathematics,

Z. Azerbayev, B. Piotrowski, H. Schoelkopf, E. W. Ayers, D. Radev, and J. Avigad, “Proofnet: Autoformalizing and formally proving undergraduate-level mathematics,” 2023

2023

[9] [9]

Le- andojo: Theorem proving with retrieval-augmented lan- guage models,

K. Yang, A. M. Swope, A. Gu, R. Chalamala, P. Song, S. Yu, S. Godil, R. Prenger, and A. Anandkumar, “Le- andojo: Theorem proving with retrieval-augmented lan- guage models,” 2023

2023

[10] [10]

Minif2f: a cross- system benchmark for formal olympiad-level mathemat- ics,

K. Zheng, J. M. Han, and S. Polu, “Minif2f: a cross- system benchmark for formal olympiad-level mathemat- ics,” 2021

2021

[11] [11]

Let's Verify Step by Step

H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe, “Let’s verify step by step,”arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Training verifiers to solve math word problems,

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, 7 C. Hesse, and J. Schulman, “Training verifiers to solve math word problems,” 2021

2021

[13] [13]

First proof,

M. Abouzaid, A. J. Blumberg, M. Hairer, J. Kileel, T. G. Kolda, P. D. Nelson, D. Spielman, N. Srivastava, R. Ward, S. Weinberger, and L. Williams, “First proof,” 2026

2026

[14] [14]

First batch | first proof project,

First Proof Project, “First batch | first proof project,”

[15] [15]

Project page for the first batch of First Proof problems

[16] [16]

First proof solutions and comments,

First Proof Project, “First proof solutions and comments,”

[17] [17]

Official solutions and commentary PDF for the First Proof project

[18] [18]

Weinberger,Variations on a Theme of Borel: An Es- say on the Role of the Fundamental Group in Rigidity

S. Weinberger,Variations on a Theme of Borel: An Es- say on the Role of the Fundamental Group in Rigidity. Cambridge University Press, 2022

2022

[19] [19]

D. W. Morris,Introduction to arithmetic groups, vol. 2. Deductive Press Lieu de publication inconnu, 2015

2015

[20] [20]

Planar point sets with many unit distances,

OpenAI, “Planar point sets with many unit distances,”

[21] [21]

https://cdn.openai.com/pdf/74c24085-19b0-4534- 9c90-465b8e29ad73/unit-distance-proof.pdf

[22] [22]

Optimal bend-and-break for foliations

J. Liu, Z. Sun, and J. Jiang, “Optimal bounds in bend-and- break for foliations,”arXiv preprint arXiv:2605.20754, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

Shokurov's global index conjecture for threefold foliations

J. Liu and S. Qin, “Shokurov’s global index conjecture for threefold foliations,”arXiv preprint arXiv:2605.22735, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

Boundedness of total Cartier indices for rational singularities in families

J. Liu, R. Hu, and S. Qin, “Boundedness of total cartier indices for rational singularities in families,”arXiv preprint arXiv:2605.22782, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

The mathematical vernacular, a language for mathematics with typed sets,

N. de Bruijn, “The mathematical vernacular, a language for mathematics with typed sets,” inSelected Papers on Automath(R. Nederpelt, J. Geuvers, and R. de Vrijer, eds.), vol. 133 ofStudies in Logic and the Foundations of Mathematics, pp. 865–935, Elsevier, 1994

1994

[26] [26]

The language of mathematics,

M. Ganesalingam, “The language of mathematics,” in The Language of Mathematics: A Linguistic and Philo- sophical Investigation, pp. 17–38, Springer, 2013

2013

[27] [27]

Research report: some con- structions in polyhedral symplectic topology suggested by ai,

A. Alfieri and C. Novak, “Research report: some con- structions in polyhedral symplectic topology suggested by ai,” 2026

2026

[28] [28]

Finiteness properties for some rational Poincaré duality groups,

J. Fowler, “Finiteness properties for some rational Poincaré duality groups,”Illinois Journal of Mathemat- ics, vol. 56, no. 2, pp. 281–299, 2012. A Prompts for the Verification Agent In this appendix, we present the prompts used to drive our state-based verification agent. The prompts are administered to the Codex agent as fixed instruction prompts. A.1 ...

2012

[29] [29]

Closing D6b1a.The agent verifies step_15b1a_signature_class_to_assembly. The successful route here is to replace an indirect complex-K- theoretic justification with a more exact real KO-theoretic source: Rosenberg and Weinberger’s Lipschitz-signature theorem is used to construct the real KO-homology signature class and to identify its assembly image with ...

[30] [30]

found the proof

Closing D6b1b.The agent closes this gap by rewriting step_15b1b_signature_class_to_lclass. Here the successful move is not simply to cite a missing theorem verbatim. Instead, the agent builds a local bridge: (1) identify the real signature class inKO-homology; (2) complexify it to the standard complex signature class; (3) invoke the known complex K-homolo...