arxiv: 2605.12524 · v1 · submitted 2026-04-07 · 💻 cs.LO · cs.AI

Recognition: unknown

Stress-Testing the Reasoning Competence of LLMs With Proofs Under Minimal Formalism

Konstantine Arkoudas, Serafim Batzoglou

Pith reviewed 2026-05-14 20:59 UTC · model grok-4.3

classification 💻 cs.LO cs.AI

keywords LLM reasoning evaluationproof benchmarksnatural deductionmachine-checkable proofsepistemic stabilitycombinatorial reasoningformal verification

0 comments

The pith

Frontier LLMs perform well on foundational proof tasks but fail at those requiring global combinatorial reasoning or low-level proof synthesis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ProofGrid, a benchmark suite of 15 tasks that evaluate LLM reasoning through machine-checkable proofs written in minimal formal notation such as NDL. This setup enables precise, reproducible assessment by avoiding heavy dependence on domain knowledge, external solvers, or extended context. Evaluations across open and proprietary models indicate steady improvement on basic tasks yet persistent shortfalls on structurally demanding ones that require global combinatorial reasoning or low-level proof construction. Additional findings include epistemic instability, where models produce flawed proofs but correctly reject the same local steps when presented alone, along with statistical tools such as 2PL IRT analysis and task-discrimination measures.

Core claim

Using the ProofGrid benchmark with its instrumented proof-checking pipeline, the authors demonstrate that current large language models exhibit rapid progress on foundational reasoning but substantial limitations persist on structurally rich tasks that demand global combinatorial reasoning or low-level proof synthesis.

What carries the argument

ProofGrid benchmark suite expressed in NDL, a compact natural-deduction language, together with an instrumented proof-checking pipeline that locates the first substantive reasoning failure while tolerating minor surface deviations.

If this is right

Frontier models succeed on foundational tasks but struggle with global combinatorial reasoning and low-level proof synthesis.
Models display epistemic instability, generating flawed proofs while correctly rejecting similar inferences in isolation.
The Epistemic Stability Index quantifies inconsistency between generated proofs and isolated local judgments.
2PL IRT analyses, Wright maps, and normalized Fisher-information measures provide fine-grained assessment of task difficulty and model discrimination.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Minimal-formalism tasks may isolate core reasoning more cleanly than benchmarks that permit pattern matching or external tool use.
The benchmark could serve as a stable yardstick for measuring incremental gains in LLM reasoning over successive model releases.
Hybrid systems that pair LLMs with formal checkers might close the observed gaps in proof synthesis and gap-filling.

Load-bearing premise

That performance on these minimal-formalism proof tasks reliably indicates general reasoning competence without being confounded by domain knowledge, solver delegation, or long-context effects.

What would settle it

Demonstrating that one or more current frontier models can consistently solve the challenge tasks requiring global combinatorial reasoning or low-level proof synthesis would falsify the claim of substantial remaining limits.

Figures

Figures reproduced from arXiv: 2605.12524 by Konstantine Arkoudas, Serafim Batzoglou.

**Figure 1.** Figure 1: A radar-chart comparison of PROOFGRID and 5 logic-based CNL benchmarks. In summary, due to the absence of a canonical proof set, the profound dependency on the chosen formal system, the variable granularity of tactics, and the disconnect from human-centric problem difficulty, any single number for the “maximum reasoning depth” of MiniF2F would likely be arbitrary. Similar remarks apply to the range of reas… view at source ↗

**Figure 2.** Figure 2: Scatterplot of strict accuracy vs instrumentation deltas across the 24 models. [PITH_FULL_IMAGE:figures/full_fig_p030_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of first-error types under the instrumented proof checker. [PITH_FULL_IMAGE:figures/full_fig_p031_3.png] view at source ↗

**Figure 4.** Figure 4: PL1-PC full-task accuracy vs rank. 45As mentioned earlier, this measure, the mean pairwise absolute deviation of model accuracies, is commonly known in statistics 39 [PITH_FULL_IMAGE:figures/full_fig_p039_4.png] view at source ↗

**Figure 5.** Figure 5: Lorenz-style curve of model accuracies on the full [PITH_FULL_IMAGE:figures/full_fig_p040_5.png] view at source ↗

**Figure 6.** Figure 6: Types of root errors made by models on PL1-PC. Error Analysis We partition all possible root errors into 5 types:47 1. Type-1 errors: The model mistakenly claims that the input proof is correct. If we view PL1-PC as a debugging task whose objective is the detection of reasoning failures, then type-1 errors can be regarded as false negatives. 2. Type-2 errors: The model mistakenly claims that the proof is w… view at source ↗

**Figure 7.** Figure 7: PL1-PC accuracies for the top 10 models as a function of proof length. Formatting errors represent instruction-following failures and are rare. The rest of the error types are arranged in order of seriousness: Type-1 errors are perhaps the most serious: failing to detect an incorrect proof. Type-2 errors (mistakenly claiming that a proof is incorrect) are also fundamental. These are the green and blue erro… view at source ↗

**Figure 8.** Figure 8: PL1-PC accuracies as a function of proof length for the bottom 14 models. 43 [PITH_FULL_IMAGE:figures/full_fig_p043_8.png] view at source ↗

**Figure 9.** Figure 9: PL1-PM problem statistics light inter-dependencies that become obvious after a single pass over the proof tree, which typically suffices because the constraints are acyclic (determined by the linear order of the proof)50 and nearly deterministic. The PL1-PM task slice contains 300 proofs that have been masked as described above. All 300 proofs are correct, containing no errors of any kind. 80 of these proo… view at source ↗

**Figure 10.** Figure 10: Conditional (hypothetical) reasoning is very common in this corpus, largely because about half of these proofs are [PITH_FULL_IMAGE:figures/full_fig_p051_10.png] view at source ↗

**Figure 10.** Figure 10: Inferential profile of PL1-PM proofs. simultaneously resolving many fine-grained local uncertainties. Essentially, it must treat the proof itself as an object of reasoning, reasoning about reasoning rather than merely executing it. This gives proof masking a distinct meta-reasoning flavor. A system that can perform well in this task demonstrates not only procedural knowledge of inferential mechanisms, but… view at source ↗

**Figure 11.** Figure 11: PL1-PM accuracy vs proof length for the top 10 models; variable-width bins with n ≥ 25. unmasked scales approximately as pM, where M is the number of masks. Thus, even small per-mask error rates compound multiplicatively into rapid instance-level degradation. In addition, longer proofs impose heavier burdens on transformer attention and retrieval, with the model having to track more assumptions, rule depe… view at source ↗

**Figure 12.** Figure 12: PL1-PM accuracy vs proof length for the bottom 14 models; variable-width bins with n ≥ 25. as correct only if all masks are given valid values that collectively yield a correct proof. So even though generation is incremental, the reward landscape is globally conjunctive, with success requiring a perfect joint configuration of all micro-decisions made by a model. 54 [PITH_FULL_IMAGE:figures/full_fig_p054_… view at source ↗

**Figure 13.** Figure 13: PL1-PM errors These 6 categories represent severe errors. It is one thing to mix up one rule for another, but listing a rule as the value of a mask that can only range over formulas is a much more serious mistake. Likewise for the rest of the above error types, with the possible exception of emptyAnswers, a relatively less flagrant type of failure. There are two main types of errors that do not fall into … view at source ↗

**Figure 14.** Figure 14: PL1-GF problem statistics 59 [PITH_FULL_IMAGE:figures/full_fig_p059_14.png] view at source ↗

**Figure 15.** Figure 15: PL1-GF instrumented model accuracies vs total error repairs per 100 lines. checker fixes are ultra-local and low-impact systematic deviations that violate strict formalism without breaking logical coherence. These are exactly the sort of lapses that the instrumented checker was designed to tolerate. As shown in [PITH_FULL_IMAGE:figures/full_fig_p062_15.png] view at source ↗

**Figure 16.** Figure 16: As in other tasks, three main cohorts emerge from this picture: [PITH_FULL_IMAGE:figures/full_fig_p063_16.png] view at source ↗

**Figure 17.** Figure 17: PL1-GF accuracy vs gapped proof length for the top 12 models sharp and steadily monotonic performance collapse, with many hitting near-zero accuracy around 40 lines, the declines of the stronger models are much more gradual and less catastrophic.61 A challenge version of PL1-GF In this section we discuss a more challenging version of this task, PL1-GF-c, with 50 problems featuring longer proofs. The media… view at source ↗

**Figure 18.** Figure 18: PL1-GF accuracy vs gapped proof length for the bottom 12 models. 61Some models either plateau or show a small uptick at the tail end. This is likely due to a gap-fraction confound (by “gap fraction” we mean the proportion of the original proof that was replaced by gaps). Generally, once we control for proof length, the problem gets harder as the gap fraction increases. Likewise, if the gap fraction is fix… view at source ↗

**Figure 19.** Figure 19: PL2-PW proof-size distributions per model: all proofs (blue) vs correct proofs (green); min–max line, 5–95% band, IQR box, median tick, mean dot. M, the figure shows the distribution of length over the set of all proofs written by M (blue line) as well as the length distribution over the set of correct proofs written by M. Note that average proof sizes for both sets are about an order of magnitude higher … view at source ↗

**Figure 20.** Figure 20: PL2-PW proof-size median ratios by each model (a higher value means that the all-proofs median is longer than the correct-proofs-only median size) [PITH_FULL_IMAGE:figures/full_fig_p070_20.png] view at source ↗

**Figure 21.** Figure 21: PL2-PW error corrections made by the instrumented checker per model (sorted by total). deepseek-v3.1 made a total of 6640 errors that were detected and overlooked by the instrumented checker, the vast majority of them syntactic; this comes to about 33 errors per proof. Observe that performance correlates with the proportion of syntax errors. Weaker models tend to make a lot of syntactic errors, many more … view at source ↗

**Figure 22.** Figure 22: PL2-PW error types flagged by the instrumented checker. Interestingly, most models prefer to write incomplete proofs with gaps in them than to not write anything. The best models keep this to a minimum, and when they do it they tend to fill out most of the details and leave only small gaps. Less able models do it more often. A search for comment keywords like skip, omit, brevity and skeleton reveals that … view at source ↗

**Figure 23.** Figure 23: PL2-PW accuracies for the top 12 models as a function of problem size. Finally, performance drops as problem size increases [PITH_FULL_IMAGE:figures/full_fig_p073_23.png] view at source ↗

**Figure 24.** Figure 24: The directed acyclic reasoning graph of a proof for a medium-complexity Tseitin-formula problem. [PITH_FULL_IMAGE:figures/full_fig_p083_24.png] view at source ↗

**Figure 25.** Figure 25: The distribution of reasoning depth across all PL [PITH_FULL_IMAGE:figures/full_fig_p083_25.png] view at source ↗

**Figure 26.** Figure 26: Mean values of key numeric features of PL3-PW problems, organized by problem type [PITH_FULL_IMAGE:figures/full_fig_p084_26.png] view at source ↗

**Figure 27.** Figure 27: PL3-PW model accuracy vs. problem type heatmap. 1. DAG problems are structurally transparent. They feature low fan-in and uniform dependencies, inviting linear forward inference. Nevertheless, even these straightforward problems are tractable only for models with precise multi-hop control. Mid-tier and older models still lose track of dependency order and lemma management, despite the shallow problem stru… view at source ↗

**Figure 28.** Figure 28: PL3-PW distribution of first errors. • a wrong-conclusion error, whereby the proof (or a subproof) does manage to derive some conclusion, but not the expected one [PITH_FULL_IMAGE:figures/full_fig_p087_28.png] view at source ↗

**Figure 29.** Figure 29: Mean values of key numeric features of PL3-PC problems, organized by problem type. 94 [PITH_FULL_IMAGE:figures/full_fig_p094_29.png] view at source ↗

**Figure 30.** Figure 30: PL3-PC accuracy per problem type. We see that type-2 errors dominate across all models and ability levels. Deepseek-V3, in particular, makes a stunning 247/247 type-2 errors—a false-positive error rate of 100%. The model invariably hallucinates non-existent errors. Type-1 errors, where a buggy proof is wrongly pronounced correct, are much less common, with Deepseek-Reasoner taking the lead with 28 such er… view at source ↗

**Figure 31.** Figure 31: Error distribution across the 4 error types for [PITH_FULL_IMAGE:figures/full_fig_p096_31.png] view at source ↗

**Figure 32.** Figure 32: PL3-PC accuracy, excluding De Bruijn problems, as a function of proof length for the top 12 models [PITH_FULL_IMAGE:figures/full_fig_p097_32.png] view at source ↗

**Figure 33.** Figure 33: ESI curves for different values of the curvature parameter ( [PITH_FULL_IMAGE:figures/full_fig_p099_33.png] view at source ↗

**Figure 34.** Figure 34: Problem size distributions for PL4-PW (left) and PL4-PW-c (right). PL4-PW, the (easy) 200-problem task, was constructed as follows: (a) 20 problems were hand-picked simple tautologies that appear as standard textbook exercises, such as (A ⇒ (¬A ⇒ B)), (A ⇒ ¬¬A), and (A ∨ ¬A); and (b) 180 randomly selected problems from the same distribution from which PL1-PW problems were sampled. We targeted short proble… view at source ↗

**Figure 35.** Figure 35: The distribution of error types in PL4-PW under lenient proof checking, for the first error that caused a proof failure (or the failure to return a proof). 104 [PITH_FULL_IMAGE:figures/full_fig_p104_35.png] view at source ↗

**Figure 36.** Figure 36: PL4-PW accuracy as a function of problem size. 105 [PITH_FULL_IMAGE:figures/full_fig_p105_36.png] view at source ↗

**Figure 37.** Figure 37: PL4-PW proof-size distributions per model: all proofs (blue) vs correct proofs (green); min–max line, 5–95% band, IQR box, median tick, mean dot. gemini-2.5-pro produced the longest proofs, with an average of 9.4 steps. Remarkably, most models found a proof of the same length for each problem, suggesting that the models converged on the same proof. 9.2.3 Proof Similarity Analysis To further investigate th… view at source ↗

**Figure 38.** Figure 38: Pairwise proof similarity between models on [PITH_FULL_IMAGE:figures/full_fig_p107_38.png] view at source ↗

**Figure 39.** Figure 39: Level-2 error distribution for equational proof checking. [PITH_FULL_IMAGE:figures/full_fig_p115_39.png] view at source ↗

**Figure 40.** Figure 40: Normalized distribution of step-level ER errors. [PITH_FULL_IMAGE:figures/full_fig_p118_40.png] view at source ↗

**Figure 41.** Figure 41: Completion length distribution by model (using boxplots and log scale). [PITH_FULL_IMAGE:figures/full_fig_p121_41.png] view at source ↗

**Figure 42.** Figure 42: Error typology for the gap-filling task. [PITH_FULL_IMAGE:figures/full_fig_p122_42.png] view at source ↗

**Figure 43.** Figure 43: Prevalence of errors of different types in the gap-filling task. [PITH_FULL_IMAGE:figures/full_fig_p123_43.png] view at source ↗

**Figure 44.** Figure 44: Gap-filling accuracy for the top 10 models as a function of gap size. [PITH_FULL_IMAGE:figures/full_fig_p124_44.png] view at source ↗

**Figure 45.** Figure 45: Gap-filling accuracy for the top 10 models as a function of equational theory size. [PITH_FULL_IMAGE:figures/full_fig_p124_45.png] view at source ↗

**Figure 46.** Figure 46: Consolidated accuracy heat map for gap filling, as a function of gap length and equational theory size. [PITH_FULL_IMAGE:figures/full_fig_p125_46.png] view at source ↗

**Figure 47.** Figure 47: Gap-filling calibration errors across all models (bin-level and overall ECE). [PITH_FULL_IMAGE:figures/full_fig_p126_47.png] view at source ↗

**Figure 48.** Figure 48: Definition of the NDL interpreter function. [PITH_FULL_IMAGE:figures/full_fig_p135_48.png] view at source ↗

**Figure 49.** Figure 49: Three ICCs for three sample PL2-PW problems. b ≈ −0.68) are easier, with even lower-ability models having a good chance at writing a correct proof. The thin vertical lines (in light blue color) at each b visualize this “where P = 0.5” point, and the darker vertical line at θ = 0 plus the horizontal line at P = 0.5 help to identify those crossings. The steepness of a curve around its midpoint reflects the … view at source ↗

**Figure 50.** Figure 50: Fisher information of a task item as a function of different [PITH_FULL_IMAGE:figures/full_fig_p141_50.png] view at source ↗

**Figure 51.** Figure 51: The task-information curve of proof masking over the entire observed ability range. [PITH_FULL_IMAGE:figures/full_fig_p146_51.png] view at source ↗

**Figure 52.** Figure 52: Ability parameters estimated from PL4-PW results, alongside the task-information curve of Hilbert proofwriting over the observed ability range. Information is low for models of ability θ < 0. 0.89 | | 14 1.08 | claude-opus-4-5-20251101 | 7 1.28 | | 10 1.48 | | 4 8 9 18 22 24 1.68 | grok4, gpt-5.2 | 6 12 17 27 1.88 | | 11 16 19 20 23 25 2.07 | | 2.27 | | 2.47 | | 5 13 15 21 26 28 29 2.67 | | 2.87 | | 3.06… view at source ↗

**Figure 53.** Figure 53: API call timeline. C.5 Inputs and Outputs Each task used its own prompt, typically in few-shot style with several ICL examples in the prompt. The prompts for all tasks can be found on the Github page. Most tasks require JSON outputs, the sole exception being gap filling for PL1 (PL1-GF). There, a model generates a plain textual answer organized into labeled GAP-i: segments. Our evaluation pipeline then re… view at source ↗

read the original abstract

We introduce ProofGrid, a benchmark suite for evaluating LLM reasoning through machine-checkable proofs rather than final answers alone. ProofGrid contains 15 tasks spanning proof writing, proof checking, proof masking, and proof gap-filling. Tasks are expressed in minimal formal notation, especially NDL, a compact natural-deduction language that fits in short prompts and supports precise, auditable verification. This yields mechanical, reproducible, and fine-grained evaluation rather than judgments by humans or LLMs. ProofGrid covers a calibrated difficulty spectrum, from foundational reasoning tests to structurally rich challenge tasks that no current model solves, while minimizing reliance on domain knowledge, solver delegation, and long-context artifacts. We also develop a comparative framework for reasoning benchmarks and use it to situate ProofGrid relative to existing work in terms of representation, verification guarantees, and reasoning depth. Methodologically, we introduce an instrumented proof-checking pipeline that tolerates minor surface deviations while locating the first substantive reasoning failure, improving measurement resolution and separating proof planning from low-level execution noise. Using this pipeline, we evaluate a broad range of open and proprietary models. Results show rapid progress but substantial remaining limits: frontier models perform well on several foundational tasks, yet difficult tasks, especially those requiring global combinatorial reasoning or low-level proof synthesis, remain far from solved. We also identify epistemic instability, where models generate flawed proofs yet correctly reject those local inferences in isolation, and formalize this with an Epistemic Stability Index. Finally, we complement accuracy with 2PL IRT analyses, Wright maps, and a normalized task-discrimination measure based on Fisher information.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ProofGrid gives a clean machine-checkable benchmark for LLM proof reasoning with some useful new metrics, but the results on reasoning limits would be stronger with more controls on confounds.

read the letter

The punchline is that ProofGrid gives us a machine-verifiable way to test LLMs on proof tasks in minimal formalism, and the results show clear limits on harder reasoning despite progress on basics. What stands out as new is the benchmark itself with its NDL representation, the instrumented checking pipeline, the Epistemic Stability Index, and the use of IRT for analysis. The comparative framework for situating it against other benchmarks is also useful. The paper does well in emphasizing mechanical verification and fine-grained error location, which improves on typical LLM eval setups. The tasks span a good range from easy to currently unsolvable. On the soft side, the stress-test concern about independence from domain knowledge and long-context holds some weight. The design aims for minimalism, but without explicit comparisons or ablations in the reported work, it's possible some gaps reflect other factors. The abstract claims directional findings, but full quantitative data and error analysis would make the limits more convincing. This is for researchers focused on LLM reasoning and formal verification. A reader interested in better benchmarks for logical competence would get value from the methodology and the new metrics. It deserves a serious referee because the core ideas are grounded and the verification approach is reproducible.

Referee Report

2 major / 2 minor

Summary. The paper introduces ProofGrid, a benchmark of 15 tasks for evaluating LLM reasoning via machine-checkable proofs expressed in minimal natural-deduction notation (NDL). Tasks span proof writing, checking, masking, and gap-filling; an instrumented checker locates the first substantive failure. Evaluations across open and proprietary models show frontier systems succeeding on foundational tasks while failing on those requiring global combinatorial reasoning or low-level synthesis. The authors introduce an Epistemic Stability Index to quantify inconsistency between generated proofs and isolated inferences, and supplement accuracy metrics with 2PL IRT, Wright maps, and Fisher-information-based discrimination measures. A comparative framework situates ProofGrid against prior benchmarks on representation, verification, and reasoning depth.

Significance. If the minimal-formalism design and instrumented pipeline genuinely isolate reasoning competence, the work supplies a reproducible, auditable evaluation instrument that advances formal-methods approaches to LLM assessment and supplies concrete, falsifiable evidence of remaining limits on combinatorial and synthesis tasks.

major comments (2)

[Section 4] Section 4 and the comparative framework: the central claim that performance gaps reflect reasoning competence rather than pretraining artifacts, prompt-length effects, or residual domain cues rests on the assertion that minimal NDL minimizes those confounds, yet no ablations (NDL vs. equivalent natural-language prompts, fixed vs. variable context length, or solver-delegation controls) are reported. Without these quantitative checks the interpretation of the reported limits on global-combinatorial and low-level-synthesis tasks remains under-supported.
[Epistemic Stability Index] Epistemic Stability Index definition and measurement: the index is introduced to capture the observed inconsistency between flawed proof generation and correct rejection of the same local inferences in isolation, but the manuscript does not supply the precise formula, the aggregation method across tasks, or statistical validation that the index is independent of the instrumented checker’s tolerance parameters.

minor comments (2)

[Abstract] Abstract: the phrase 'rapid progress' is used without accompanying quantitative deltas, model identifiers, or task-level accuracy figures, reducing immediate readability.
[Task definitions] The manuscript would benefit from an explicit table listing the 15 tasks, their NDL encodings, and the precise verification criteria applied by the instrumented checker.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas where additional evidence and clarity would strengthen the manuscript's claims about isolating reasoning competence. We address each major comment below and will revise the paper accordingly.

read point-by-point responses

Referee: [Section 4] Section 4 and the comparative framework: the central claim that performance gaps reflect reasoning competence rather than pretraining artifacts, prompt-length effects, or residual domain cues rests on the assertion that minimal NDL minimizes those confounds, yet no ablations (NDL vs. equivalent natural-language prompts, fixed vs. variable context length, or solver-delegation controls) are reported. Without these quantitative checks the interpretation of the reported limits on global-combinatorial and low-level-synthesis tasks remains under-supported.

Authors: We agree that the interpretation of the performance gaps would be more robust with explicit ablations. The manuscript's design rationale emphasizes that minimal NDL reduces prompt length and domain cues compared to natural-language formulations, and the instrumented checker separates planning from execution noise. However, we acknowledge that without direct quantitative comparisons these claims rest on design arguments rather than empirical controls. In the revision we will add a new subsection (likely 4.3) reporting ablations on a representative subset of tasks: (i) NDL vs. equivalent natural-language prompts, (ii) fixed vs. variable context lengths, and (iii) a solver-delegation control where models are allowed to call external provers. These results will be used to quantify the contribution of each confound and to support the claim that the observed limits on combinatorial and synthesis tasks primarily reflect reasoning competence. revision: yes
Referee: [Epistemic Stability Index] Epistemic Stability Index definition and measurement: the index is introduced to capture the observed inconsistency between flawed proof generation and correct rejection of the same local inferences in isolation, but the manuscript does not supply the precise formula, the aggregation method across tasks, or statistical validation that the index is independent of the instrumented checker’s tolerance parameters.

Authors: We appreciate the referee highlighting the missing formal details. The Epistemic Stability Index is intended to quantify the discrepancy between a model's ability to reject incorrect local inferences when presented in isolation versus its tendency to produce globally flawed proofs. In the current draft the definition is described only at a high level. In the revised manuscript we will insert a precise definition in Section 3: ESI = (1/T) * sum_t (C_t / E_t), where C_t is the number of correctly rejected isolated inferences for task t, E_t is the number of substantive errors in the generated proof for task t, and T is the number of tasks; aggregation is performed as a difficulty-weighted average using the 2PL IRT discrimination parameters already reported. We will also add a sensitivity analysis demonstrating that ESI values remain stable across the range of checker tolerance parameters used in the evaluation (0.05–0.15). These additions will be accompanied by the exact pseudocode and a small table of validation results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation with independent results

full rationale

The paper introduces ProofGrid as a benchmark suite and reports direct experimental outcomes from evaluating LLMs on machine-checkable proof tasks in minimal NDL notation. No derivation chain exists that reduces predictions or results to fitted inputs, self-definitions, or self-citation load-bearing premises. The Epistemic Stability Index is newly formalized from observed model behavior, 2PL IRT analyses apply standard psychometrics to the collected data, and the comparative framework situates the benchmark without invoking prior author results as uniqueness theorems. All central claims rest on reproducible verification pipelines and model outputs rather than any construction that equates outputs to inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that minimal formal proofs isolate reasoning competence; the paper introduces two new entities (ProofGrid and Epistemic Stability Index) without independent external validation.

axioms (1)

domain assumption Minimal formal notation (NDL) suffices to evaluate reasoning competence while minimizing domain knowledge and long-context effects
Invoked in the design of the 15 tasks and the claim that the benchmark reduces reliance on domain knowledge and solver delegation.

invented entities (2)

ProofGrid no independent evidence
purpose: Benchmark suite spanning proof writing, checking, masking, and gap-filling with machine verification
Newly introduced in this work; no prior reference in the abstract.
Epistemic Stability Index no independent evidence
purpose: Quantify inconsistency between generated proofs and isolated inference judgments
Formalized in this paper to capture observed epistemic instability.

pith-pipeline@v0.9.0 · 5595 in / 1375 out tokens · 65540 ms · 2026-05-14T20:59:21.683069+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

143 extracted references · 27 canonical work pages · 6 internal anchors

[1]

Karl Popper: Critical Assessments of Leading Philosophers , publisher =

Alan Musgrave , title =. Karl Popper: Critical Assessments of Leading Philosophers , publisher =. 2003 , pages =

2003
[2]

Journal for General Philosophy of Science , year =

Musgrave, Alan , title =. Journal for General Philosophy of Science , year =
[3]

Argumentation , year =

Deductivism Within Pragma-Dialectics , author =. Argumentation , year =
[4]

1985 , publisher=

Item Response Theory: Principles and Applications , author=. 1985 , publisher=

1985
[5]

Handbook of Item Response Theory Modeling: Applications to Typical Performance Assessment , publisher =
[6]

Recherches sur la th

Herbrand, Jacques , year =. Recherches sur la th. Travaux de la Soci
[7]

1972 , publisher=

Introduction to Combinatory Logic , author=. 1972 , publisher=

1972
[8]

A tutorial on

Ly, Alexander and Myung, Jay and Pitt, Mark , journal=. A tutorial on. 2017 , doi=

2017
[9]

2013 , publisher=

Item Response Theory: Principles and Applications , author=. 2013 , publisher=

2013
[10]

Comptes Rendus des S

Tarski, Alfred , title =. Comptes Rendus des S. 1930 , note =

1930
[11]

IEEE Trans

Bengio, Yoshua and Courville, Aaron and Vincent, Pascal , title =. IEEE Trans. Pattern Anal. Mach. Intell. , pages =. 2013 , issue_date =

2013
[12]

International Conference on Learning Representations (ICLR) , year=

Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , author=. International Conference on Learning Representations (ICLR) , year=
[13]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , url =

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric and Zhang, Hao and Gonzalez, Joseph E and Stoica, Ion , booktitle =. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , url =
[14]

2010 , publisher=

Introduction to Mathematical Logic , author=. 2010 , publisher=

2010
[15]

Hambleton and Hariharan Swaminathan and H

Ronald K. Hambleton and Hariharan Swaminathan and H. Jane Rogers , title =
[16]

, title =

Spearman, C. , title =. The American Journal of Psychology , year =
[17]

Psychological Bulletin , volume=

The measurement of adult intelligence , author=. Psychological Bulletin , volume=. 1943 , publisher=

1943
[18]

1998 , publisher=

Universal Nonverbal Intelligence Test , author=. 1998 , publisher=

1998
[19]

and Dumont, Ron and Kaufman, Alan S

Walrath, Robert and Willis, John O. and Dumont, Ron and Kaufman, Alan S. , title =. The Cambridge Handbook of Intelligence , editor =. 2020 , publisher =

2020
[20]

Intelligence , volume=

CHC theory and the human cognitive abilities project: Standing on the shoulders of the giants of psychometric intelligence research , author=. Intelligence , volume=. 2009 , publisher=

2009
[21]

1939 , publisher=

The Measurement of Adult Intelligence , author=. 1939 , publisher=

1939
[22]

Intelligence , volume=

Raven's is not a pure measure of general intelligence: Implications for g factor theory and the brief measurement of g , author=. Intelligence , volume=. 2015 , publisher=

2015
[23]

and Schneider, W

McGrew, Kevin S. and Schneider, W. Joel and Decker, Scott L. and Bulut, Okan , journal=. A Psychometric Network Analysis of. 2023 , publisher=. doi:10.3390/jintelligence11010019 , url=

work page doi:10.3390/jintelligence11010019 2023
[24]

European Journal of Psychological Assessment , year =

Beauducel, André and Kersting, Martin , title =. European Journal of Psychological Assessment , year =. doi:10.1027//1015-5759.18.2.97 , url =

work page doi:10.1027//1015-5759.18.2.97
[25]

and Kell, Harrison J

Lakin, Joni M. and Kell, Harrison J. , title =. The Cambridge Handbook of Intelligence , editor =. 2020 , publisher =

2020
[26]

1993 , publisher=

Human Cognitive Abilities: A Survey of Factor-Analytic Studies , author=. 1993 , publisher=

1993
[27]

Journal of Educational Psychology , volume=

Theory of fluid and crystallized intelligence: A critical experiment , author=. Journal of Educational Psychology , volume=. 1963 , publisher=

1963
[28]

1927 , publisher =

Spearman, Charles , title =. 1927 , publisher =

1927
[29]

Ramirez and H

M. Ramirez and H. Geffner , title =. Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10) , year =
[30]

C. L. Baker and R. Saxe and J. B. Tenenbaum , journal =. Action understanding as inverse planning , year =
[31]

Heim, Irene and Kratzer, Angelika , isbn =
[32]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[33]

2025 , eprint=

Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study , author=. 2025 , eprint=

2025
[34]

2025 , eprint=

TRAIL: Trace Reasoning and Agentic Issue Localization , author=. 2025 , eprint=

2025
[35]

AgentBench: Evaluating LLMs as Agents , url =

Liu, Xiao and Yu, Hao and Zhang, Hanchen and Xu, Yifan and Lei, Xuanyu and Lai, Hanyu and Gu, Yu and Ding, Hangliang and Men, Kaiwen and Yang, Kejuan and Zhang, Shudan and Deng, Xiang and Zeng, Aohan and Du, Zhengxiao and Zhang, Chenhui and Shen, Sheng and Zhang, Tianjun and Su, Yu and Sun, Huan and Huang, Minlie and Dong, Yuxiao and Tang, Jie , booktitle...
[36]

Barnett and Stephen J

Susan M. Barnett and Stephen J. Ceci , title =. 2002 , publisher =. doi:10.1037/0033-2909.128.4.612 , url =

work page doi:10.1037/0033-2909.128.4.612 2002
[37]

2023 , publisher=

Introduction to Transfer Learning: Algorithms and Practice , author=. 2023 , publisher=

2023
[38]

2017 , publisher =

Konstantine Arkoudas and David Musser , title =. 2017 , publisher =

2017
[39]

Proceedings of the 22nd International Conference on Automated Deduction , pages =

Weidenbach, Christoph and Dimova, Dilyana and Fietzke, Arnaud and Kumar, Rohit and Suda, Martin and Wischnewski, Patrick , title =. Proceedings of the 22nd International Conference on Automated Deduction , pages =. 2009 , isbn =

2009
[40]

Publications Manual , year = "1983", publisher =

1983
[41]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[42]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[43]

2006 , url =

Geoff Sutcliffe and Stephan Schulz and Koen Claessen and Allen Van Gelder , editor =. 2006 , url =

2006
[44]

Explaining Answers with Entailment Trees

Dalvi, Bhavana and Jansen, Peter and Tafjord, Oyvind and Xie, Zhengnan and Smith, Hannah and Pipatanangkura, Leighanna and Clark, Peter. Explaining Answers with Entailment Trees. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.585

work page doi:10.18653/v1/2021.emnlp-main.585 2021
[45]

2005 , publisher =

Bos, Johan and Markert, Katja , editor =. 2005 , publisher =

2005
[46]

Kamp, Hans , booktitle =
[47]

Mathematical proceedings of the Cambridge philosophical society , volume=

On the structure of abstract algebras , author=. Mathematical proceedings of the Cambridge philosophical society , volume=. 1935 , organization=

1935
[48]

Tversky and D

A. Tversky and D. Kahneman , journal =
[49]

L ogic A sker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models

Wan, Yuxuan and Wang, Wenxuan and Yang, Yiliu and Yuan, Youliang and Huang, Jen-tse and He, Pinjia and Jiao, Wenxiang and Lyu, Michael. L ogic A sker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.128

work page doi:10.18653/v1/2024.emnlp-main.128 2024
[50]

L ogi C o T : Logical Chain-of-Thought Instruction Tuning

Liu, Hanmeng and Teng, Zhiyang and Cui, Leyang and Zhang, Chaoli and Zhou, Qiji and Zhang, Yue. L ogi C o T : Logical Chain-of-Thought Instruction Tuning. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.191

work page doi:10.18653/v1/2023.findings-emnlp.191 2023
[51]

Proceedings of ICLR 2022 workshop on Objects, Structure and Causality , year=

Onta\. Proceedings of ICLR 2022 workshop on Objects, Structure and Causality , year=

2022
[52]

Research on Language and Computation , year =

Copestake, Ann and Flickinger, Dan and Pollard, Carl and Sag, Ivan , title =. Research on Language and Computation , year =
[53]

2005 , isbn =

Akhmatova, Elena and Moll\'. 2005 , isbn =

2005
[54]

Bos, Johan and Markert, Katja , TITLE =
[55]

Manning , TITLE =

Christopher D. Manning , TITLE =
[56]

2005 , publisher =

Sinz, Carsten , title =. 2005 , publisher =. doi:10.1007/11564751_73 , booktitle =

work page doi:10.1007/11564751_73 2005
[57]

2502.17387 , archivePrefix=

Alon Albalak and Duy Phung and Nathan Lile and Rafael Rafailov and Kanishk Gandhi and Louis Castricato and Anikait Singh and Chase Blagden and Violet Xiang and Dakota Mahan and Nick Haber , year=. 2502.17387 , archivePrefix=

work page arXiv
[58]

and Elffers, Jan and Nordstr

Lauria, M. and Elffers, Jan and Nordstr. CNFgen : A generator of crafted benchmarks , series =. 20th International Conference on Theory and Applications of Satisfiability Testing, SAT 2017 , institution =. doi:10.1007/978-3-319-66263-3_30 , year =

work page doi:10.1007/978-3-319-66263-3_30 2017
[59]

2025 , month=

Anthropic , TITLE =. 2025 , month=

2025
[60]

Journal of Automated Reasoning , month =

Arkoudas, Konstantine , title =. Journal of Automated Reasoning , month =. 2005 , publisher =

2005
[61]

2025 , month=

Google , TITLE =. 2025 , month=

2025
[62]

Dan Gusfield , title =. 1997

1997
[63]

Buss and J

S. Buss and J. Nordström , title =. Handbook of Satisfiability , publisher =
[64]

Annual Conference on Neural Information Processing Systems , year=

Enhancing Reasoning Capabilities of LLMs via Principled Synthetic Logic Corpus , author=. Annual Conference on Neural Information Processing Systems , year=
[65]

and Angeli, Gabor and Potts, Christopher and Manning, Christopher D

Bowman, Samuel R. and Angeli, Gabor and Potts, Christopher and Manning, Christopher D. A large annotated corpus for learning natural language inference. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015. doi:10.18653/v1/D15-1075

work page doi:10.18653/v1/d15-1075 2015
[66]

2023 , editor =

Morishita, Terufumi and Morio, Gaku and Yamaguchi, Atsuki and Sogawa, Yasuhiro , booktitle =. 2023 , editor =

2023
[67]

P roof W riter: Generating Implications, Proofs, and Abductive Statements over Natural Language

Tafjord, Oyvind and Dalvi, Bhavana and Clark, Peter. P roof W riter: Generating Implications, Proofs, and Abductive Statements over Natural Language. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021

2021
[68]

Pushing the Limits of Rule Reasoning in Transformers through Natural Language Satisfiability

Richardson, Kyle and Sabharwal, Ashish. Pushing the Limits of Rule Reasoning in Transformers through Natural Language Satisfiability. Proceedings of the AAAI Conference on Artificial Intelligence. 2022

2022
[69]

Recognizing textual entailment: Rational, evaluation and approaches , url =

Dagan, Ido and Dolan, Bill and Magnini, Bernardo and Roth, Dan , journal =. Recognizing textual entailment: Rational, evaluation and approaches , url =
[70]

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence,

Logically Consistent Adversarial Attacks for Soft Theorem Provers , author =. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence,. 2022 , month =

2022
[71]

2021 , booktitle =

Clark, Peter and Tafjord, Oyvind and Richardson, Kyle , title =. 2021 , booktitle =

2021
[72]

2023 , url=

Abulhair Saparov and He He , booktitle=. 2023 , url=

2023
[73]

T heorem QA : A Theorem-driven Question Answering Dataset

Chen, Wenhu and Yin, Ming and Ku, Max and Lu, Pan and Wan, Yixin and Ma, Xueguang and Xu, Jianyu and Wang, Xinyi and Xia, Tony. T heorem QA : A Theorem-driven Question Answering Dataset. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023

2023
[74]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord , title =. arXiv:1803.05457v1 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[75]

Diagnosing the First-Order Logical Reasoning Ability Through L ogic NLI

Tian, Jidong and Li, Yitian and Chen, Wenqing and Xiao, Liqiang and He, Hao and Jin, Yaohui. Diagnosing the First-Order Logical Reasoning Ability Through L ogic NLI. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021

2021
[76]

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

Williams, Adina and Nangia, Nikita and Bowman, Samuel. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018

2018
[77]

IEEE/ACM Trans

Liu, Hanmeng and Liu, Jian and Cui, Leyang and Teng, Zhiyang and Duan, Nan and Zhou, Ming and Zhang, Yue , title =. IEEE/ACM Trans. Audio, Speech and Lang. Proc. , month = jul, pages =. 2023 , issue_date =

2023
[78]

FOLIO : Natural Language Reasoning with First-Order Logic

Han, Simeng and Schoelkopf, Hailey and Zhao, Yilun and Qi, Zhenting and Riddell, Martin and Zhou, Wenfei and Coady, James and Peng, David and Qiao, Yujie and Benson, Luke and Sun, Lucy and Wardle-Solano, Alexander and Szab. FOLIO : Natural Language Reasoning with First-Order Logic. Proceedings of the 2024 Conference on Empirical Methods in Natural Languag...

2024
[79]

Athena: A Language for Proof Engineering , howpublished =
[80]

and Bommasani, R

Liang, P. and Bommasani, R. and Lee, T. and Tsipras, D. and Soylu, D. and Yasunaga, M. and Zhang, Y. and Narayanan, D. and Wu, Y. and Kumar, A. and Newman, B. and Yuan, B. and Yan, B. and Zhang, C. and Cosgrove, C. and Manning, C. D and Re, C. and Acosta-Navas, D. and Hudson, D. A. and Zelikman, E. and Durmus, E. and Ladhak, F. and Rong, F. and Ren, H. an...

Showing first 80 references.