pith. machine review for the scientific record. sign in

arxiv: 2604.16672 · v1 · submitted 2026-04-17 · 💻 cs.AI

Recognition: unknown

From Subsumption to Satisfiability: LLM-Assisted Active Learning for OWL Ontologies

David Flynn, Duncan Guthrie, Haoruo Zhao, Michele Sevegnani, Paul Harvey, Wenshuo Tang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:15 UTC · model grok-4.3

classification 💻 cs.AI
keywords errorsllmssubsumptionactivecorrespondingcounter-conceptlanguagelearning
0
0 comments X

The pith

LLM-assisted active learning reformulates OWL subsumption checks as satisfiability queries, queries models for counter-concept examples, and ensures errors are only Type II delays rather than inconsistencies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Active learning builds ontologies by asking yes-no questions about whether one concept is a subclass of another. The authors turn each such question into a satisfiability problem by creating a counter-concept that would exist if the answer were no. They then ask an LLM to name real-world things that would be examples of that counter-concept. If the LLM is wrong, the worst outcome is missing some needed axioms, which only slows construction; it cannot add axioms that contradict the rest of the ontology. Tests across 13 commercial LLMs found recall stayed stable on several standard ontologies.

Core claim

This design property ensures that only Type II errors may occur in ontology modelling; in the worst case, these errors merely delay the construction process without introducing inconsistencies. Experimental results on 13 commercial LLMs show that recall remains stable across several well-established ontologies.

Load-bearing premise

That verbalizing counter-concepts in controlled natural language preserves enough logical meaning for LLMs to generate useful and unbiased example instances, and that the reduction from subsumption to satisfiability remains sound when the oracle is an LLM rather than a perfect reasoner.

Figures

Figures reproduced from arXiv: 2604.16672 by David Flynn, Duncan Guthrie, Haoruo Zhao, Michele Sevegnani, Paul Harvey, Wenshuo Tang.

Figure 1
Figure 1. Figure 1: An LLM-assisted Active Learning Setting semantics of DLs are non-trivial, which makes it difficult to guarantee a correct interpretation of queries by domain experts. In our work, we retain these assumptions, but LLMs are used to complement background knowledge with ontology-aware examples, thereby reducing the burden on domain experts, as shown in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: LLM Fuzzing Method [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Confusion Matrix for Active Learning learning does not incur Type I or Type II errors, since it is as￾sumed that domain experts answer all MQs correctly. When LLMs are treated as teachers (Magnini et al. 2025), active learning may exhibit both Type I and Type II errors due to LLM hallucinations. In our approach, LLMs are treated as a third component rather than as teachers, and only Type II errors may be t… view at source ↗
Figure 4
Figure 4. Figure 4: LLM-Generated Real-World Example for the Counter [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

In active learning, membership queries (MQs) allow a learner to pose questions to a teacher, such as ''Is every apple a fruit?'', to which the teacher responds correctly with yes or no. These MQs can be viewed as subsumption tests with respect to the target ontology. Inspired by the standard reduction of subsumption to satisfiability in description logics, we reformulate each candidate axiom into its corresponding counter-concept and verbalise it in controlled natural language before presenting it to Large Language Models (LLMs). We introduce LLMs as a third component that provides real-world examples approximating an instance of the counter-concept. This design property ensures that only Type II errors may occur in ontology modelling; in the worst case, these errors merely delay the construction process without introducing inconsistencies. Experimental results on 13 commercial LLMs show that recall, corresponding to Type II errors in our framework, remains stable across several well-established ontologies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes reformulating membership queries in active learning for OWL ontologies as subsumption tests, reduced to satisfiability via counter-concepts (C ⊓ ¬D). These are verbalized in controlled natural language and presented to LLMs, which act as oracles by attempting to generate real-world instances. The central claim is that this design ensures only Type II errors (delaying axiom addition) can occur, with no risk of introducing inconsistencies; experiments across 13 commercial LLMs report stable recall on several established ontologies.

Significance. If the Type-II-only error guarantee were rigorously established, the approach would offer a low-risk method for LLM-assisted ontology construction, addressing a key barrier in scaling knowledge engineering. The use of the standard DL reduction combined with LLMs as an external oracle is a clear strength, and the reported stability of recall across multiple models and ontologies provides initial evidence of practical utility. However, the significance is limited by the need to substantiate the safety property against LLM incompleteness.

major comments (2)
  1. [Abstract] Abstract: The assertion that the design 'ensures that only Type II errors may occur... without introducing inconsistencies' does not follow from the reduction. Absence of an LLM-generated instance for the counter-concept leads to accepting the subsumption C ⊑ D; because LLMs have incomplete coverage and verbalization may obscure edge cases, a satisfiable counter-concept can be misclassified as unsatisfiable, wrongly adding the axiom (a Type I error). This directly contradicts the safety claim and is load-bearing for the paper's main contribution.
  2. [Experimental results] Experimental results section (referenced in abstract): The claim of 'stable recall' across 13 LLMs lacks any description of prompt engineering, instance generation/verification procedure, statistical tests for stability, or controls for verbalization fidelity. Without these, it is impossible to assess whether the reported recall truly measures only Type II errors or whether the oracle implementation introduces the very false-positive subsumptions the framework claims to avoid.
minor comments (2)
  1. [Abstract] The abstract is overly dense; expanding the description of how counter-concepts are verbalized and how LLM responses are interpreted as satisfiability verdicts would improve readability.
  2. No discussion of related work on LLM reliability for logical reasoning or prior active learning frameworks in description logics is visible in the provided abstract; adding targeted citations would strengthen context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify the presentation of our core safety claim and experimental methodology. We address each major comment below and indicate the planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that the design 'ensures that only Type II errors may occur... without introducing inconsistencies' does not follow from the reduction. Absence of an LLM-generated instance for the counter-concept leads to accepting the subsumption C ⊑ D; because LLMs have incomplete coverage and verbalization may obscure edge cases, a satisfiable counter-concept can be misclassified as unsatisfiable, wrongly adding the axiom (a Type I error). This directly contradicts the safety claim and is load-bearing for the paper's main contribution.

    Authors: We agree that the original abstract wording overstates the guarantee. The framework queries LLMs for real-world instances of the counter-concept C ⊓ ¬D; generation of a valid instance rejects the subsumption (preventing addition of a false axiom), while absence leads to acceptance. This yields only Type II errors (delays) under the assumption that LLMs produce no hallucinations, but LLM incompleteness can indeed cause missed counterexamples and thus Type I errors. We will revise the abstract to qualify the claim explicitly—stating that the approach reduces inconsistency risk relative to direct membership queries, with the Type-II-only property holding only when hallucinations are absent—and add a short discussion of this assumption and its implications for the safety property. revision: yes

  2. Referee: [Experimental results] Experimental results section (referenced in abstract): The claim of 'stable recall' across 13 LLMs lacks any description of prompt engineering, instance generation/verification procedure, statistical tests for stability, or controls for verbalization fidelity. Without these, it is impossible to assess whether the reported recall truly measures only Type II errors or whether the oracle implementation introduces the very false-positive subsumptions the framework claims to avoid.

    Authors: We concur that the experimental section requires substantially more methodological detail to allow proper evaluation. We will expand it to document: the exact prompt templates and any prompt-engineering steps employed; the full instance-generation protocol together with verification procedures (including whether generated instances were manually inspected for validity or automatically checked for consistency with the verbalization); the statistical measures or tests applied to assess recall stability across the 13 models; and any controls or fidelity checks performed on the controlled-natural-language verbalizations. These additions will clarify whether the observed recall corresponds to the intended error profile and will be included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity in the derivation chain

full rationale

The paper's central derivation invokes the standard DL reduction of subsumption to satisfiability (C ⊑ D iff C ⊓ ¬D unsatisfiable), an established external result, then treats LLMs as an external oracle for counter-concept instances after verbalization. No equations, parameters, or definitions reduce to themselves by construction; the Type-II-only error claim is presented as a direct consequence of the oracle's placement in the reduction rather than a self-referential fit or self-citation load. The approach remains self-contained against external benchmarks with no enumerated circularity patterns exhibited.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard description-logic reduction of subsumption to satisfiability and on the assumption that LLMs can act as approximate oracles for counterexamples; no free parameters or new entities are introduced.

axioms (1)
  • standard math Subsumption can be reduced to satisfiability in description logics
    Invoked explicitly as the inspiration for reformulating candidate axioms into counter-concepts.

pith-pipeline@v0.9.0 · 5476 in / 1166 out tokens · 43316 ms · 2026-05-10T08:15:43.517101+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

6 extracted references · 1 canonical work pages · 1 internal anchor

  1. [1]

    InProceedings of the AAAI Symposium Series, volume 4, 188–198

    Investigating open source LLMs to retrofit compe- tency questions in ontology engineering. InProceedings of the AAAI Symposium Series, volume 4, 188–198. Alharbi, R.; Tamma, V .; Payne, T. R.; and de Berardinis, J. 2025.A Comparative Study of Competency Question Elicitation Methods from Ontology Requirements. Angluin, D. 1988. Queries and concept learning...

  2. [2]

    Consortium, G

    OWL web ontology language reference.W3C recom- mendation10(2):1–53. Consortium, G. O. 2019. The gene ontology resource: 20 years and still going strong.Nucleic acids research 47(D1):D330–D338. Cregan, A.; Schwitter, R.; Meyer, T.; et al. 2007. Sydney OWL syntax-towards a controlled natural language syntax for owl 1.1. InOWLED, volume 258. Cuenca Grau, B.;...

  3. [3]

    International Joint Conferences on Artificial Intelligence Organization

    Learning description logic concepts: when can pos- itive and negative examples be separated? InProceedings of the Twenty-Eighth International Joint Conference on Ar- tificial Intelligence, volume 2019, 1682–1688. International Joint Conferences on Artificial Intelligence Organization. Funk, M.; Jung, J. C.; and Lutz, C. 2021. Actively learning concepts an...

  4. [4]

    Why Language Models Hallucinate

    Why language models hallucinate.arXiv preprint arXiv:2509.04664. Kazakov, Y . 2008.RIQandSROIQare harder than SHOIQ. 274–284. AAAI Press. Konev, B.; Lutz, C.; Ozaki, A.; and Wolter, F. 2018. Exact learning of lightweight description logic ontologies.Journal of Machine Learning Research18(201):1–63. Lehmann, J., and Hitzler, P. 2010. Concept learning in de...

  5. [5]

    InProceedings of the 27th European Con- ference on Artificial Intelligence (ECAI 2025), 1792–1799

    Actively learning el terminologies from large lan- guage models. InProceedings of the 27th European Con- ference on Artificial Intelligence (ECAI 2025), 1792–1799. Muggleton, S., and De Raedt, L. 1994. Inductive logic pro- gramming: Theory and methods.The Journal of Logic Pro- gramming19:629–679. Ozaki, A. 2025. Actively learning from machine learning mod...

  6. [6]

    Sirin, E.; Parsia, B.; Grau, B

    SNOMED reaching its adolescence: Ontologists’ and logicians’ health check.International journal of medical informatics78:S86–S94. Sirin, E.; Parsia, B.; Grau, B. C.; Kalyanpur, A.; and Katz, Y . 2007. Pellet: A practical OWL-DL reasoner.Journal of Web Semantics5(2):51–53. Stehman, S. V . 1997. Selecting and interpreting measures of thematic classification...