Failure Modes of Large Language Models on Research-Level Mathematics: A Taxonomy and an Empirical Characterisation

Arnesh Banerjee; Ayushi Bhattacharjee

arxiv: 2606.24902 · v1 · pith:SV4WQ2FPnew · submitted 2026-06-12 · 💻 cs.DL · cs.AI

Failure Modes of Large Language Models on Research-Level Mathematics: A Taxonomy and an Empirical Characterisation

Arnesh Banerjee , Ayushi Bhattacharjee This is my paper

Pith reviewed 2026-06-27 05:07 UTC · model grok-4.3

classification 💻 cs.DL cs.AI

keywords large language modelsmathematical reasoninghallucinationpremise smugglingfailure mode taxonomyresearch-level mathematics

0 comments

The pith

LLMs on research math questions smuggle unproven premises in every audited proof, with no fabricated citations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper starts from the observation that LLMs produce confidently wrong proofs on research-level math questions and asks what specific mechanisms cause the errors. It extracts four failure modes from existing post-mortems and then applies two new instruments to eight one-shot proofs generated by Gemini 2.5 Flash. The audit shows that citation fabrication did not occur in any proof, yet premise smuggling—asserting a load-bearing claim as a standard result with no justification—appeared in every proof. A premise-audit instrument detects the smuggling at 100 percent precision on the judged flags. The work concludes that detection after generation is insufficient and that inference-time pipelines must be designed to block the modes before they appear.

Core claim

From the First Proof benchmark post-mortems the paper defines four failure modes: citation fabrication (F1), premise smuggling (F2), silent problem reformulation (F3), and local-to-global compatibility gaps (F4). An empirical check of eight Gemini 2.5 Flash proofs on benchmark questions 1, 2, and 5 found zero confirmed fabricated citations but at least one instance of F2 in each proof. The premise-audit instrument introduced for the study flags these un-justified load-bearing claims at 100 percent precision and 50 percent proof-level recall in the corpus.

What carries the argument

The premise-audit instrument, which surfaces load-bearing claims presented as fundamental results or standard arguments without attached justification.

If this is right

Citation-based verification tools miss the dominant observed failure mode because premise smuggling requires no false reference.
Retrieval-augmented generation alone cannot eliminate the errors identified here.
Inference-time pipelines that block premise smuggling before output are required to address the failure modes at source.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The taxonomy could be applied to design training objectives that penalize unjustified assertions during generation.
Similar audits on other math benchmarks might show whether premise smuggling is common beyond this small corpus.
Addressing the modes may require changes to prompting or model architecture rather than post-hoc checks.

Load-bearing premise

The post-mortems in the First Proof benchmark Appendix A correctly identify the errors, and the human judges used to confirm premise-smuggling flags are accurate and unbiased.

What would settle it

Independent re-examination of the same eight proofs in which judges determine that one or more of the flagged premises are in fact standard results with readily available justifications.

read the original abstract

The "First Proof" benchmark [1] posed ten research-level mathematics questions to the strongest publicly available LLMs and found them consistently wrong-not silent, but confidently, fluently wrong. This paper asks why. Working from the per-question post-mortems in First Proof's Appendix A, I identify four failure modes: citation fabrication (F1), premise smuggling (F2), silent problem reformulation (F3), and local-to-global compatibility gaps (F4). I then audit eight one-shot proofs generated by Gemini 2.5 Flash on Questions 1, 2, and 5 of the benchmark, using two instruments built specifically to surface F1 and F2. The central finding is uncomfortable for anyone who sees retrieval-augmented generation (RAG) as the obvious fix: not one of the eight proofs contained a confirmed fabricated citation, yet every single one contained at least one load-bearing claim asserted as a "fundamental result" or "standard argument" with no justification attached. That failure mode-F2, premise smuggling-is invisible to citation verification by design. A premise-audit instrument I introduce flags it at 100% precision (5/5 judge-confirmed flags are true positives) and 50% proof-level recall in this corpus. The taxonomy and the audit together suggest that the right long-term objective is building inference-time pipelines that prevent these failure modes from occurring, not just detecting them after the fact. Index Terms--Large language models, mathematical reasoning, hallucination, premise smuggling, failure-mode taxonomy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's audit shows premise smuggling in all eight proofs with no fabricated citations, backed by a new four-mode taxonomy and audit instrument, though the small sample and reliance on external post-mortems keep the claims provisional.

read the letter

The main thing to know is that this audit of eight Gemini proofs on three First Proof questions found zero confirmed fabricated citations but at least one load-bearing unjustified premise in every proof. The authors call this F2, premise smuggling, and argue it is invisible to citation checks by design.

What is new is the four-mode taxonomy (F1 citation fabrication, F2 premise smuggling, F3 silent reformulation, F4 local-to-global gaps) plus the premise-audit instrument they built to surface F2. They report 100% precision on the five judge-confirmed flags and 50% proof-level recall in this set. That split between F1 and F2 is a clean empirical observation and gives people working on math LLMs a sharper way to talk about failure than generic hallucination labels.

The soft spots are straightforward. The sample is tiny—eight proofs across three questions—so the "every proof" result is descriptive of this corpus rather than a broad claim. The whole analysis depends on the accuracy of the original benchmark's Appendix A post-mortems for what background each question requires, and on the human judges' consistency when labeling claims as load-bearing and unjustified. The 50% recall also means the instrument misses cases. These are real limits on how far the numbers can be pushed.

This is for people building or evaluating automated math tools who want a structured way to diagnose where the models go wrong. It is worth sending to peer review because the taxonomy and the F2 observation are concrete enough to be checked and extended, even if the current evidence needs more data to carry heavier weight.

Referee Report

2 major / 1 minor

Summary. The paper develops a four-mode taxonomy of LLM failures on research-level mathematics (F1: citation fabrication, F2: premise smuggling, F3: silent reformulation, F4: compatibility gaps) drawn from the First Proof benchmark post-mortems. It then audits eight one-shot Gemini 2.5 Flash proofs on Questions 1, 2 and 5, reports zero confirmed F1 instances yet F2 in every proof, and introduces a premise-audit instrument that flags F2 at 100% precision (5/5) and 50% proof-level recall. The central conclusion is that premise smuggling is invisible to citation-based checks and that inference-time pipelines, rather than post-hoc RAG, are required.

Significance. If the central observation holds, the work is significant because it isolates a failure mode (F2) that standard retrieval and citation verification cannot detect, supplies a concrete empirical demonstration on real model outputs, and offers a reusable audit instrument. The taxonomy supplies a structured vocabulary that future studies of mathematical reasoning can adopt. The absence of self-referential derivations or fitted parameters strengthens the empirical character of the contribution.

major comments (2)

[Abstract and audit results section] Abstract and the section describing the audit results: the claim that F2 occurs in every one of the eight proofs (and that the instrument reaches 100% precision) is load-bearing for the paper's central finding, yet rests entirely on the accuracy of the external First Proof Appendix A post-mortems for Questions 1, 2 and 5 plus the human judges' classification of which asserted statements count as load-bearing 'fundamental results' lacking justification. Any variation in those two steps directly changes both the 'every proof' count and the precision/recall figures.
[Instrument introduction and results] The section introducing the premise-audit instrument: the reported 50% proof-level recall means that the instrument misses half the F2 instances even in this small corpus; this directly weakens the claim that the instrument 'surfaces' the failure mode effectively and therefore limits the strength of the recommendation for inference-time pipelines.

minor comments (1)

[Abstract] The abstract states the sample size and model but does not explicitly name the three questions audited; adding this detail would improve clarity without altering the argument.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. We address each major comment below, agreeing on points of dependence and limitation where they are valid, and outlining targeted revisions to increase transparency without altering the core empirical findings.

read point-by-point responses

Referee: [Abstract and audit results section] Abstract and the section describing the audit results: the claim that F2 occurs in every one of the eight proofs (and that the instrument reaches 100% precision) is load-bearing for the paper's central finding, yet rests entirely on the accuracy of the external First Proof Appendix A post-mortems for Questions 1, 2 and 5 plus the human judges' classification of which asserted statements count as load-bearing 'fundamental results' lacking justification. Any variation in those two steps directly changes both the 'every proof' count and the precision/recall figures.

Authors: We agree that the reported counts and precision figures depend on the publicly available post-mortems in First Proof Appendix A and on the human classification of load-bearing claims. The manuscript already cites the benchmark as the source for the taxonomy and ground-truth expectations. To strengthen transparency, we will revise the audit section to include an explicit statement of this dependence, add a detailed description of the classification criteria applied by the judges, and append the complete list of flagged premises with their judge verdicts. These changes will not modify the reported numbers but will enable independent scrutiny. revision: yes
Referee: [Instrument introduction and results] The section introducing the premise-audit instrument: the reported 50% proof-level recall means that the instrument misses half the F2 instances even in this small corpus; this directly weakens the claim that the instrument 'surfaces' the failure mode effectively and therefore limits the strength of the recommendation for inference-time pipelines.

Authors: We concur that the 50% recall rate, already stated in the manuscript, shows the instrument does not detect every F2 instance even in this corpus and therefore qualifies the strength of any claim that it comprehensively surfaces the failure mode. The primary recommendation for inference-time pipelines rests on the pervasiveness of F2 across all eight proofs and its structural invisibility to citation checks, not on the instrument's recall. We will revise the discussion to explicitly acknowledge the recall limitation and reframe the instrument as a high-precision detector whose value lies in surfacing otherwise undetectable cases rather than as a complete audit solution. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical audit of external LLM outputs against cited benchmark

full rationale

The paper performs an empirical taxonomy and audit of LLM-generated proofs against the external First Proof benchmark [1] and its Appendix A post-mortems. It introduces failure-mode categories and a premise-audit instrument, then applies them to eight generated proofs with human-judge validation for precision/recall. No derivations, equations, fitted parameters, or predictions exist that reduce to the paper's own inputs by construction. The central F2 finding is an observed count in the audited corpus, not a self-definitional or self-citation-forced result. Self-citation to [1] is to an independent prior benchmark whose post-mortems are treated as external input; the work remains self-contained against that benchmark without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on two domain assumptions about the source benchmark and human judgment reliability; no free parameters or new postulated entities with independent evidence are introduced.

axioms (2)

domain assumption The per-question post-mortems in the First Proof benchmark's Appendix A accurately capture the failure modes present in the LLM outputs.
The four-mode taxonomy is derived directly from these post-mortems as stated in the abstract.
domain assumption Human judges can reliably identify and confirm instances of premise smuggling in the audited proofs.
The reported 100% precision and 50% recall figures depend on these judge confirmations.

pith-pipeline@v0.9.1-grok · 5814 in / 1441 out tokens · 30210 ms · 2026-06-27T05:07:26.402932+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Abouzaid, A

M. Abouzaid, A. J. Blumberg, M. Hairer, J. Kileel, T. G. Kolda, P. D. Nelson, D. Spielman, N. Srivastava, R. Ward, S. Weinberger, and L. Williams, “First Proof,”arXiv preprint arXiv:2602.05192v2, Mar. 2026

work page arXiv 2026
[2]

FrontierMath: A benchmark for advanced mathematical reasoning,

Epoch AI, “FrontierMath: A benchmark for advanced mathematical reasoning,” 2024. [Online]. Available: https://epoch.ai/frontiermath

2024
[3]

IM- ProofBench: Benchmarking AI on research-level mathematical proof generation,

J. Schmitt, G. B ´erczi, J. Dekoninck, J. Feusi, T. Gehringer, R. Ap- penzeller, J. Bryan, N. Canova, T. de Wolff, F. Gaia,et al., “IM- ProofBench: Benchmarking AI on research-level mathematical proof generation,”arXiv preprint arXiv:2509.26076, 2025

work page arXiv 2025
[4]

RealMath: A con- tinuous benchmark for evaluating language models on research-level mathematics,

J. Zhang, C. Petrui, K. Nikoli ´c, and F. Tram `er, “RealMath: A con- tinuous benchmark for evaluating language models on research-level mathematics,”arXiv preprint arXiv:2505.12575, 2025

work page arXiv 2025
[5]

Lemmanaid: Neuro-symbolic lemma conjec- turing with LLM templates,

Lemmanaid Authors, “Lemmanaid: Neuro-symbolic lemma conjec- turing with LLM templates,” 2025, preprint

2025
[6]

Let's Verify Step by Step

H. Lightmanet al., “Let’s verify step by step,”arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Y . Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch, “Improv- ing factuality and reasoning in language models through multi-agent debate,”arXiv preprint arXiv:2305.14325, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Retrieval-augmented generation for knowledge- intensive NLP tasks,

P. Lewiset al., “Retrieval-augmented generation for knowledge- intensive NLP tasks,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 9459–9474

2020
[9]

Survey of hallucination in natural language generation,

Z. Jiet al., “Survey of hallucination in natural language generation,” ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023

2023

[1] [1]

Abouzaid, A

M. Abouzaid, A. J. Blumberg, M. Hairer, J. Kileel, T. G. Kolda, P. D. Nelson, D. Spielman, N. Srivastava, R. Ward, S. Weinberger, and L. Williams, “First Proof,”arXiv preprint arXiv:2602.05192v2, Mar. 2026

work page arXiv 2026

[2] [2]

FrontierMath: A benchmark for advanced mathematical reasoning,

Epoch AI, “FrontierMath: A benchmark for advanced mathematical reasoning,” 2024. [Online]. Available: https://epoch.ai/frontiermath

2024

[3] [3]

IM- ProofBench: Benchmarking AI on research-level mathematical proof generation,

J. Schmitt, G. B ´erczi, J. Dekoninck, J. Feusi, T. Gehringer, R. Ap- penzeller, J. Bryan, N. Canova, T. de Wolff, F. Gaia,et al., “IM- ProofBench: Benchmarking AI on research-level mathematical proof generation,”arXiv preprint arXiv:2509.26076, 2025

work page arXiv 2025

[4] [4]

RealMath: A con- tinuous benchmark for evaluating language models on research-level mathematics,

J. Zhang, C. Petrui, K. Nikoli ´c, and F. Tram `er, “RealMath: A con- tinuous benchmark for evaluating language models on research-level mathematics,”arXiv preprint arXiv:2505.12575, 2025

work page arXiv 2025

[5] [5]

Lemmanaid: Neuro-symbolic lemma conjec- turing with LLM templates,

Lemmanaid Authors, “Lemmanaid: Neuro-symbolic lemma conjec- turing with LLM templates,” 2025, preprint

2025

[6] [6]

Let's Verify Step by Step

H. Lightmanet al., “Let’s verify step by step,”arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Y . Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch, “Improv- ing factuality and reasoning in language models through multi-agent debate,”arXiv preprint arXiv:2305.14325, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Retrieval-augmented generation for knowledge- intensive NLP tasks,

P. Lewiset al., “Retrieval-augmented generation for knowledge- intensive NLP tasks,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 9459–9474

2020

[9] [9]

Survey of hallucination in natural language generation,

Z. Jiet al., “Survey of hallucination in natural language generation,” ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023

2023