Recognition: no theorem link
ABD: Default Exception Abduction in Finite First Order Worlds
Pith reviewed 2026-05-15 20:17 UTC · model grok-4.3
The pith
ABD benchmark evaluates LLMs on producing parsimonious first-order exception formulas in three observation regimes using SMT verification, finding high validity but persistent parsimony and generalization gaps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Evaluating ten frontier LLMs on 600 instances, the best models achieve high validity but parsimony gaps remain, and holdout evaluation reveals distinct generalization failure modes across regimes.
Load-bearing premise
The three formalized observation regimes (closed-world, existential completion, universal completion) together capture the essential difficulties of default-exception abduction in real finite first-order settings.
read the original abstract
We introduce ABD, a benchmark for default-exception abduction over finite first-order worlds. Given a background theory with an abnormality predicate and a set of relational structures, a model must output a first-order formula that defines exceptions, restoring satisfiability while keeping exceptions sparse. We formalize three observation regimes (closed-world, existential completion, universal completion) with exact SMT verification. Evaluating ten frontier LLMs on 600 instances, the best models achieve high validity but parsimony gaps remain, and holdout evaluation reveals distinct generalization failure modes across regimes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the ABD benchmark for default-exception abduction over finite first-order worlds. Given a background theory containing an abnormality predicate and relational structures, the task requires outputting a first-order formula defining exceptions that restores satisfiability while minimizing the number of exceptions. Three observation regimes (closed-world, existential completion, universal completion) are formalized and verified exactly via SMT. Ten frontier LLMs are evaluated on 600 instances; the strongest models achieve high validity but exhibit persistent parsimony gaps, with holdout tests revealing regime-specific generalization failure modes.
Significance. If the benchmark construction is shown to be representative, the work supplies a rigorously verifiable testbed for LLM abductive reasoning that goes beyond informal prompting. The use of exact SMT verification for both validity and parsimony is a clear methodological strength, eliminating reliance on approximate or human-judged metrics. The reported regime-dependent failure modes, if robust, could usefully inform targeted improvements in logical generalization.
major comments (2)
- [Formalization of observation regimes] The central claim that the three observation regimes jointly expose the essential difficulties of default-exception abduction (and thereby produce meaningful generalization failure modes) lacks any coverage argument or necessity argument. No analysis is given showing that other natural finite-FO patterns (e.g., partial completions with cardinality constraints or noisy observations) would not induce qualitatively different LLM behavior. This justification is load-bearing for the holdout-evaluation conclusions.
- [Experimental evaluation] The experimental section reports aggregate results over 600 instances and claims statistically meaningful gaps in parsimony and distinct holdout failure modes, yet supplies no description of the instance-generation procedure, the distribution of formula sizes or predicate arities, or any statistical significance tests. Without these details the robustness of the model comparisons cannot be assessed.
minor comments (2)
- [Preliminaries] The notation for the abnormality predicate and the precise definition of parsimony (number of exceptions versus size of the defining formula) should be stated once in a single location rather than distributed across the formalization and evaluation sections.
- [Results] Table captions and axis labels on the holdout-performance plots would benefit from explicit mention of the three regimes to improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major point below, providing clarifications and committing to targeted revisions that strengthen the justification and experimental details without overstating our claims.
read point-by-point responses
-
Referee: The central claim that the three observation regimes jointly expose the essential difficulties of default-exception abduction (and thereby produce meaningful generalization failure modes) lacks any coverage argument or necessity argument. No analysis is given showing that other natural finite-FO patterns (e.g., partial completions with cardinality constraints or noisy observations) would not induce qualitatively different LLM behavior. This justification is load-bearing for the holdout-evaluation conclusions.
Authors: We selected the closed-world, existential-completion, and universal-completion regimes because they instantiate standard, well-studied assumptions from the default-reasoning literature (closed-world assumption and completion semantics) and systematically vary the strength of observational constraints on the abnormality predicate. Our experiments demonstrate that these variations already produce distinct, regime-specific generalization failures. We do not claim the three regimes are exhaustive or formally necessary; a full coverage argument would require enumerating all possible finite-FO observation patterns, which lies outside the paper's scope. In revision we will add a short subsection (new Section 3.4) that (i) motivates the choice by reference to the literature, (ii) explicitly states the scope limitation, and (iii) notes that extensions to noisy or cardinality-constrained observations remain open. This clarifies rather than removes the load-bearing role of the regimes. revision: partial
-
Referee: The experimental section reports aggregate results over 600 instances and claims statistically meaningful gaps in parsimony and distinct holdout failure modes, yet supplies no description of the instance-generation procedure, the distribution of formula sizes or predicate arities, or any statistical significance tests. Without these details the robustness of the model comparisons cannot be assessed.
Authors: We agree that these details are required for reproducibility and robustness assessment. Instances were generated by uniformly sampling finite domains (3–8 constants), predicates (arity 1–3), and background theories (5–15 clauses) while enforcing satisfiability under each regime; the 600 instances were partitioned evenly across regimes and difficulty strata. In the revised manuscript we will (i) provide pseudocode for the generation procedure, (ii) include tables and histograms reporting the empirical distributions of formula size, predicate arity, domain size, and clause count, and (iii) add statistical significance results (paired Wilcoxon tests for parsimony gaps and ANOVA with post-hoc Tukey HSD for multi-model comparisons, all with p < 0.01 thresholds). These additions will be placed in an expanded Section 5 and the appendix. revision: yes
Circularity Check
No circularity: benchmark definition and LLM evaluation are independent
full rationale
The paper defines ABD as a benchmark with three observation regimes formalized via SMT, generates 600 instances, and reports external LLM performance metrics (validity, parsimony, holdout generalization). No derivation step equates a claimed result to its own inputs by construction, no parameters are fitted and then relabeled as predictions, and no load-bearing claim rests on self-citation chains or smuggled ansatzes. The regimes are presented as chosen formalizations rather than derived from the LLM outcomes, making the evaluation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Finite first-order structures with abnormality predicates admit exact satisfiability checking via SMT solvers
invented entities (1)
-
ABD benchmark
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Stress-testing the reasoning competence of language models with formal proofs. In Christodoulopoulos, C.; Chakraborty, T.; Rose, C.; and Peng, V ., eds.,Findings of the Association for Computational Linguistics: EMNLP 2025, 12376–12394. Suzhou, China: Association for Computational Linguistics. Azerbayev, Z.; Piotrowski, B.; Schoelkopf, H.; Ayers, E. W.; R...
work page 2025
-
[2]
Bansal, A.; Loos, S.; Rabe, M.; Szegedy, C.; and Wilcox, S
ProofNet: Autoformalizing and formally proving undergraduate-level mathematics.arXiv preprint arXiv:2302.12433. Bansal, A.; Loos, S.; Rabe, M.; Szegedy, C.; and Wilcox, S
-
[3]
HOList: An environment for machine learning of higher order logic theorem proving. InProceedings of the 36th International Conference on Machine Learning (ICML), volume 97 of Proceedings of Machine Learning Research, 454–463. Baral, C. 2003.Knowledge Representation, Reasoning and Declarative Problem Solving. Cambridge, UK: Cambridge University Press. Bhag...
work page 2003
-
[4]
Training Verifiers to Solve Math Word Problems
Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168. Cropper, A., and Muggleton, S
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
In Advances in Neural Information Processing Systems 32 (NeurIPS 2019), 2811–2822
Bridging machine learning and logical reasoning by abductive learning. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019), 2811–2822. d’Avila Garcez, A., and Lamb, L. C
work page 2019
-
[6]
InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, 23752–23759
Inductive learning of logical theories with LLMs: An expressivity-graded analysis. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, 23752–23759. Gebser, M.; Kaminski, R.; Kaufmann, B.; and Schaub, T. 2012.Answer Set Solving in Practice. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publish...
work page 2012
-
[7]
The stable model semantics for logic programming. InProceedings of the 5th International Conference and Symposium on Logic Programming (ICLP), 1070–1080. Hamscher, W.; Console, L.; and de Kleer, J. 1992.Readings in Model-Based Diagnosis. Morgan Kaufmann. Han, S.; Schoelkopf, H.; Zhao, Y .; Qi, Z.; Riddell, M.; Zhou, W.; Coady, J.; Peng, D.; Qiao, Y .; Ben...
work page 1992
-
[8]
FOLIO: Natural language reasoning with first-order logic. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 22017–22031. Association for Computational Linguistics. Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; and Steinhardt, J
work page 2024
-
[9]
Measuring Mathematical Problem Solving With the MATH Dataset
Measuring mathematical problem solving with the MATH dataset.arXiv preprint arXiv:2103.03874. Hobbs, J. R.; Stickel, M. E.; Appelt, D. E.; and Martin, P
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Inductive learning of answer set programs. InLogics in Artificial Intelligence: 14th European Conference, JELIA 2014, Funchal, Madeira, Portugal, September 24-26,
work page 2014
-
[11]
A corpus and cloze evaluation for deeper understanding of commonsense stories. InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 839–849. San Diego, California: Association for Computational Linguistics. Muggleton, S., and Raedt, L. D
work page 2016
- [12]
- [13]
-
[14]
Choice of plausible alternatives: An evaluation of commonsense causal reasoning. InLogical Formalizations of Commonsense Reasoning, Papers from the 2011 AAAI Spring Symposium, Technical Report SS-11-06, Stanford, California, USA, March 21–23,
work page 2011
-
[15]
Combinatorial sketching for finite programs. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 404–415. Solar-Lezama, A. 2008.Program Synthesis by Sketching. Ph.D. Dissertation, University of California, Berkeley. Srinivasan, A
work page 2008
-
[16]
Tafjord, Ø.; Dalvi, B.; and Clark, P
Language models do not follow occam’s razor: A benchmark for inductive and abductive reasoning.CoRRabs/2509.03345. Tafjord, Ø.; Dalvi, B.; and Clark, P
-
[17]
InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 3621–3634
ProofWriter: Generating implications, proofs, and abductive statements over natural language. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 3621–3634. Torlak, E., and Bodík, R
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.