arxiv: 2602.18843 · v3 · submitted 2026-02-21 · 💻 cs.AI · cs.SC

Recognition: no theorem link

ABD: Default Exception Abduction in Finite First Order Worlds

Serafim Batzoglou

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:17 UTC · model grok-4.3

classification 💻 cs.AI cs.SC

keywords abductioncompletionexceptionsfinitefirst-orderregimesworldsabnormality

0 comments

The pith

ABD benchmark evaluates LLMs on producing parsimonious first-order exception formulas in three observation regimes using SMT verification, finding high validity but persistent parsimony and generalization gaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a test called ABD for a specific kind of logical reasoning called default-exception abduction. A background theory contains an abnormality predicate that marks exceptions. Given some observed facts in relational structures, the task is to write a first-order logic formula that says exactly which things are abnormal. This formula must make the entire theory consistent again while using as few exceptions as possible. Three different observation settings are defined: closed-world assumes unmentioned facts are false, existential completion adds some possible existences, and universal completion adds universal statements. An SMT solver checks exactly whether each proposed formula works. Ten leading large language models were run on 600 generated instances. The strongest models produced valid formulas most of the time, yet they tended to introduce more exceptions than needed. On held-out test cases the models failed to generalize in different ways depending on which observation regime was used.

Core claim

Evaluating ten frontier LLMs on 600 instances, the best models achieve high validity but parsimony gaps remain, and holdout evaluation reveals distinct generalization failure modes across regimes.

Load-bearing premise

The three formalized observation regimes (closed-world, existential completion, universal completion) together capture the essential difficulties of default-exception abduction in real finite first-order settings.

read the original abstract

We introduce ABD, a benchmark for default-exception abduction over finite first-order worlds. Given a background theory with an abnormality predicate and a set of relational structures, a model must output a first-order formula that defines exceptions, restoring satisfiability while keeping exceptions sparse. We formalize three observation regimes (closed-world, existential completion, universal completion) with exact SMT verification. Evaluating ten frontier LLMs on 600 instances, the best models achieve high validity but parsimony gaps remain, and holdout evaluation reveals distinct generalization failure modes across regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ABD introduces a verifiable benchmark for default-exception abduction but needs better justification for its observation regimes.

read the letter

The paper's main contribution is a new benchmark called ABD for default-exception abduction over finite first-order worlds. It defines three observation regimes—closed-world, existential completion, and universal completion—and uses SMT solvers to exactly verify whether a proposed exception formula restores satisfiability while keeping exceptions sparse. The evaluation runs ten frontier LLMs on 600 instances and reports high validity scores alongside persistent parsimony gaps, plus distinct holdout failure modes across regimes.

Referee Report

2 major / 2 minor

Summary. The paper introduces the ABD benchmark for default-exception abduction over finite first-order worlds. Given a background theory containing an abnormality predicate and relational structures, the task requires outputting a first-order formula defining exceptions that restores satisfiability while minimizing the number of exceptions. Three observation regimes (closed-world, existential completion, universal completion) are formalized and verified exactly via SMT. Ten frontier LLMs are evaluated on 600 instances; the strongest models achieve high validity but exhibit persistent parsimony gaps, with holdout tests revealing regime-specific generalization failure modes.

Significance. If the benchmark construction is shown to be representative, the work supplies a rigorously verifiable testbed for LLM abductive reasoning that goes beyond informal prompting. The use of exact SMT verification for both validity and parsimony is a clear methodological strength, eliminating reliance on approximate or human-judged metrics. The reported regime-dependent failure modes, if robust, could usefully inform targeted improvements in logical generalization.

major comments (2)

[Formalization of observation regimes] The central claim that the three observation regimes jointly expose the essential difficulties of default-exception abduction (and thereby produce meaningful generalization failure modes) lacks any coverage argument or necessity argument. No analysis is given showing that other natural finite-FO patterns (e.g., partial completions with cardinality constraints or noisy observations) would not induce qualitatively different LLM behavior. This justification is load-bearing for the holdout-evaluation conclusions.
[Experimental evaluation] The experimental section reports aggregate results over 600 instances and claims statistically meaningful gaps in parsimony and distinct holdout failure modes, yet supplies no description of the instance-generation procedure, the distribution of formula sizes or predicate arities, or any statistical significance tests. Without these details the robustness of the model comparisons cannot be assessed.

minor comments (2)

[Preliminaries] The notation for the abnormality predicate and the precise definition of parsimony (number of exceptions versus size of the defining formula) should be stated once in a single location rather than distributed across the formalization and evaluation sections.
[Results] Table captions and axis labels on the holdout-performance plots would benefit from explicit mention of the three regimes to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below, providing clarifications and committing to targeted revisions that strengthen the justification and experimental details without overstating our claims.

read point-by-point responses

Referee: The central claim that the three observation regimes jointly expose the essential difficulties of default-exception abduction (and thereby produce meaningful generalization failure modes) lacks any coverage argument or necessity argument. No analysis is given showing that other natural finite-FO patterns (e.g., partial completions with cardinality constraints or noisy observations) would not induce qualitatively different LLM behavior. This justification is load-bearing for the holdout-evaluation conclusions.

Authors: We selected the closed-world, existential-completion, and universal-completion regimes because they instantiate standard, well-studied assumptions from the default-reasoning literature (closed-world assumption and completion semantics) and systematically vary the strength of observational constraints on the abnormality predicate. Our experiments demonstrate that these variations already produce distinct, regime-specific generalization failures. We do not claim the three regimes are exhaustive or formally necessary; a full coverage argument would require enumerating all possible finite-FO observation patterns, which lies outside the paper's scope. In revision we will add a short subsection (new Section 3.4) that (i) motivates the choice by reference to the literature, (ii) explicitly states the scope limitation, and (iii) notes that extensions to noisy or cardinality-constrained observations remain open. This clarifies rather than removes the load-bearing role of the regimes. revision: partial
Referee: The experimental section reports aggregate results over 600 instances and claims statistically meaningful gaps in parsimony and distinct holdout failure modes, yet supplies no description of the instance-generation procedure, the distribution of formula sizes or predicate arities, or any statistical significance tests. Without these details the robustness of the model comparisons cannot be assessed.

Authors: We agree that these details are required for reproducibility and robustness assessment. Instances were generated by uniformly sampling finite domains (3–8 constants), predicates (arity 1–3), and background theories (5–15 clauses) while enforcing satisfiability under each regime; the 600 instances were partitioned evenly across regimes and difficulty strata. In the revised manuscript we will (i) provide pseudocode for the generation procedure, (ii) include tables and histograms reporting the empirical distributions of formula size, predicate arity, domain size, and clause count, and (iii) add statistical significance results (paired Wilcoxon tests for parsimony gaps and ANOVA with post-hoc Tukey HSD for multi-model comparisons, all with p < 0.01 thresholds). These additions will be placed in an expanded Section 5 and the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark definition and LLM evaluation are independent

full rationale

The paper defines ABD as a benchmark with three observation regimes formalized via SMT, generates 600 instances, and reports external LLM performance metrics (validity, parsimony, holdout generalization). No derivation step equates a claimed result to its own inputs by construction, no parameters are fitted and then relabeled as predictions, and no load-bearing claim rests on self-citation chains or smuggled ansatzes. The regimes are presented as chosen formalizations rather than derived from the LLM outcomes, making the evaluation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central contribution is the definition of the ABD task and three regimes; these rest on standard assumptions about finite structures and SMT decidability rather than new free parameters or invented entities.

axioms (1)

domain assumption Finite first-order structures with abnormality predicates admit exact satisfiability checking via SMT solvers
Invoked to enable exact verification of proposed exception formulas

invented entities (1)

ABD benchmark no independent evidence
purpose: Standardized test for default-exception abduction
Newly defined task and evaluation protocol

pith-pipeline@v0.9.0 · 5373 in / 1182 out tokens · 40130 ms · 2026-05-15T20:17:29.955463+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 2 internal anchors

[1]

In Christodoulopoulos, C.; Chakraborty, T.; Rose, C.; and Peng, V ., eds.,Findings of the Association for Computational Linguistics: EMNLP 2025, 12376–12394

Stress-testing the reasoning competence of language models with formal proofs. In Christodoulopoulos, C.; Chakraborty, T.; Rose, C.; and Peng, V ., eds.,Findings of the Association for Computational Linguistics: EMNLP 2025, 12376–12394. Suzhou, China: Association for Computational Linguistics. Azerbayev, Z.; Piotrowski, B.; Schoelkopf, H.; Ayers, E. W.; R...

work page 2025
[2]

Bansal, A.; Loos, S.; Rabe, M.; Szegedy, C.; and Wilcox, S

ProofNet: Autoformalizing and formally proving undergraduate-level mathematics.arXiv preprint arXiv:2302.12433. Bansal, A.; Loos, S.; Rabe, M.; Szegedy, C.; and Wilcox, S

work page arXiv
[3]

InProceedings of the 36th International Conference on Machine Learning (ICML), volume 97 of Proceedings of Machine Learning Research, 454–463

HOList: An environment for machine learning of higher order logic theorem proving. InProceedings of the 36th International Conference on Machine Learning (ICML), volume 97 of Proceedings of Machine Learning Research, 454–463. Baral, C. 2003.Knowledge Representation, Reasoning and Declarative Problem Solving. Cambridge, UK: Cambridge University Press. Bhag...

work page 2003
[4]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168. Cropper, A., and Muggleton, S

work page internal anchor Pith review Pith/arXiv arXiv
[5]

In Advances in Neural Information Processing Systems 32 (NeurIPS 2019), 2811–2822

Bridging machine learning and logical reasoning by abductive learning. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019), 2811–2822. d’Avila Garcez, A., and Lamb, L. C

work page 2019
[6]

InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, 23752–23759

Inductive learning of logical theories with LLMs: An expressivity-graded analysis. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, 23752–23759. Gebser, M.; Kaminski, R.; Kaufmann, B.; and Schaub, T. 2012.Answer Set Solving in Practice. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publish...

work page 2012
[7]

InProceedings of the 5th International Conference and Symposium on Logic Programming (ICLP), 1070–1080

The stable model semantics for logic programming. InProceedings of the 5th International Conference and Symposium on Logic Programming (ICLP), 1070–1080. Hamscher, W.; Console, L.; and de Kleer, J. 1992.Readings in Model-Based Diagnosis. Morgan Kaufmann. Han, S.; Schoelkopf, H.; Zhao, Y .; Qi, Z.; Riddell, M.; Zhou, W.; Coady, J.; Peng, D.; Qiao, Y .; Ben...

work page 1992
[8]

InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 22017–22031

FOLIO: Natural language reasoning with first-order logic. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 22017–22031. Association for Computational Linguistics. Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; and Steinhardt, J

work page 2024
[9]

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring mathematical problem solving with the MATH dataset.arXiv preprint arXiv:2103.03874. Hobbs, J. R.; Stickel, M. E.; Appelt, D. E.; and Martin, P

work page internal anchor Pith review Pith/arXiv arXiv
[10]

InLogics in Artificial Intelligence: 14th European Conference, JELIA 2014, Funchal, Madeira, Portugal, September 24-26,

Inductive learning of answer set programs. InLogics in Artificial Intelligence: 14th European Conference, JELIA 2014, Funchal, Madeira, Portugal, September 24-26,

work page 2014
[11]

InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 839–849

A corpus and cloze evaluation for deeper understanding of commonsense stories. InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 839–849. San Diego, California: Association for Computational Linguistics. Muggleton, S., and Raedt, L. D

work page 2016
[12]

Poole, D

Generative language modeling for automated theorem proving.arXiv preprint arXiv:2009.03393. Poole, D

work page arXiv 2009
[13]

Raedt, L

Learning logical definitions from relations.Machine Learning5(3):239–266. Raedt, L. D. 2008.Logical and Relational Learning. Springer. Reiter, R

work page 2008
[14]

InLogical Formalizations of Commonsense Reasoning, Papers from the 2011 AAAI Spring Symposium, Technical Report SS-11-06, Stanford, California, USA, March 21–23,

Choice of plausible alternatives: An evaluation of commonsense causal reasoning. InLogical Formalizations of Commonsense Reasoning, Papers from the 2011 AAAI Spring Symposium, Technical Report SS-11-06, Stanford, California, USA, March 21–23,

work page 2011
[15]

In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 404–415

Combinatorial sketching for finite programs. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 404–415. Solar-Lezama, A. 2008.Program Synthesis by Sketching. Ph.D. Dissertation, University of California, Berkeley. Srinivasan, A

work page 2008
[16]

Tafjord, Ø.; Dalvi, B.; and Clark, P

Language models do not follow occam’s razor: A benchmark for inductive and abductive reasoning.CoRRabs/2509.03345. Tafjord, Ø.; Dalvi, B.; and Clark, P

work page arXiv
[17]

InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 3621–3634

ProofWriter: Generating implications, proofs, and abductive statements over natural language. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 3621–3634. Torlak, E., and Bodík, R

work page 2021