arxiv: 2604.16672 · v1 · submitted 2026-04-17 · 💻 cs.AI

Recognition: unknown

From Subsumption to Satisfiability: LLM-Assisted Active Learning for OWL Ontologies

David Flynn, Duncan Guthrie, Haoruo Zhao, Michele Sevegnani, Paul Harvey, Wenshuo Tang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:15 UTC · model grok-4.3

classification 💻 cs.AI

keywords errorsllmssubsumptionactivecorrespondingcounter-conceptlanguagelearning

0 comments

The pith

LLM-assisted active learning reformulates OWL subsumption checks as satisfiability queries, queries models for counter-concept examples, and ensures errors are only Type II delays rather than inconsistencies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Active learning builds ontologies by asking yes-no questions about whether one concept is a subclass of another. The authors turn each such question into a satisfiability problem by creating a counter-concept that would exist if the answer were no. They then ask an LLM to name real-world things that would be examples of that counter-concept. If the LLM is wrong, the worst outcome is missing some needed axioms, which only slows construction; it cannot add axioms that contradict the rest of the ontology. Tests across 13 commercial LLMs found recall stayed stable on several standard ontologies.

Core claim

This design property ensures that only Type II errors may occur in ontology modelling; in the worst case, these errors merely delay the construction process without introducing inconsistencies. Experimental results on 13 commercial LLMs show that recall remains stable across several well-established ontologies.

Load-bearing premise

That verbalizing counter-concepts in controlled natural language preserves enough logical meaning for LLMs to generate useful and unbiased example instances, and that the reduction from subsumption to satisfiability remains sound when the oracle is an LLM rather than a perfect reasoner.

Figures

Figures reproduced from arXiv: 2604.16672 by David Flynn, Duncan Guthrie, Haoruo Zhao, Michele Sevegnani, Paul Harvey, Wenshuo Tang.

**Figure 1.** Figure 1: An LLM-assisted Active Learning Setting semantics of DLs are non-trivial, which makes it difficult to guarantee a correct interpretation of queries by domain experts. In our work, we retain these assumptions, but LLMs are used to complement background knowledge with ontology-aware examples, thereby reducing the burden on domain experts, as shown in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: LLM Fuzzing Method [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Confusion Matrix for Active Learning learning does not incur Type I or Type II errors, since it is assumed that domain experts answer all MQs correctly. When LLMs are treated as teachers (Magnini et al. 2025), active learning may exhibit both Type I and Type II errors due to LLM hallucinations. In our approach, LLMs are treated as a third component rather than as teachers, and only Type II errors may be t… view at source ↗

**Figure 4.** Figure 4: LLM-Generated Real-World Example for the Counter [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

In active learning, membership queries (MQs) allow a learner to pose questions to a teacher, such as ''Is every apple a fruit?'', to which the teacher responds correctly with yes or no. These MQs can be viewed as subsumption tests with respect to the target ontology. Inspired by the standard reduction of subsumption to satisfiability in description logics, we reformulate each candidate axiom into its corresponding counter-concept and verbalise it in controlled natural language before presenting it to Large Language Models (LLMs). We introduce LLMs as a third component that provides real-world examples approximating an instance of the counter-concept. This design property ensures that only Type II errors may occur in ontology modelling; in the worst case, these errors merely delay the construction process without introducing inconsistencies. Experimental results on 13 commercial LLMs show that recall, corresponding to Type II errors in our framework, remains stable across several well-established ontologies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's reduction-based design for LLM oracles in ontology active learning is a clean way to bound errors to delays only, but the safety claim looks overstated once LLM incompleteness is factored in.

read the letter

The main contribution is reformulating subsumption membership queries as satisfiability checks on counter-concepts, verbalizing those in controlled natural language, and letting LLMs supply example instances. If an instance appears the subsumption is rejected; if none appears it is accepted. This is positioned to guarantee that any LLM mistake only postpones axiom addition rather than introducing inconsistencies into the ontology. That explicit tie to the standard DL reduction and the resulting one-sided error bound is the freshest part and not just a rehash of prior LLM ontology papers. The experiments running the same setup across thirteen commercial models on several established ontologies, with recall holding steady, give practical data that anyone trying to integrate LLMs into ontology tools would find useful. The work is grounded in the existing reduction rather than inventing new formal machinery, which keeps the central claim straightforward. The soft spot is the completeness assumption on the LLM oracle. The safety property requires that whenever a counter-concept is satisfiable the LLM will produce a witness instance; otherwise a genuine non-subsumption gets accepted as true and a false axiom is added. Verbalization into controlled natural language can easily obscure edge cases, and LLMs have well-known coverage gaps, so the reduction alone does not deliver the claimed guarantee. The reported stable recall only speaks to the Type II direction and does not address this risk. Prompt engineering details and any checks on whether the verbalized concepts preserve the original semantics are not visible in the abstract, which leaves the empirical side thin. This is aimed at researchers working on AI-assisted OWL ontology construction or hybrid symbolic-neural knowledge bases. A reader already familiar with description logic reductions and active learning will see the value in the error-bounding framing and the multi-model results. The paper shows clear engagement with the logical foundations and deserves a serious referee to examine the prompt design, the verbalization step, and whether the safety property survives real LLM behavior.

Referee Report

2 major / 2 minor

Summary. The paper proposes reformulating membership queries in active learning for OWL ontologies as subsumption tests, reduced to satisfiability via counter-concepts (C ⊓ ¬D). These are verbalized in controlled natural language and presented to LLMs, which act as oracles by attempting to generate real-world instances. The central claim is that this design ensures only Type II errors (delaying axiom addition) can occur, with no risk of introducing inconsistencies; experiments across 13 commercial LLMs report stable recall on several established ontologies.

Significance. If the Type-II-only error guarantee were rigorously established, the approach would offer a low-risk method for LLM-assisted ontology construction, addressing a key barrier in scaling knowledge engineering. The use of the standard DL reduction combined with LLMs as an external oracle is a clear strength, and the reported stability of recall across multiple models and ontologies provides initial evidence of practical utility. However, the significance is limited by the need to substantiate the safety property against LLM incompleteness.

major comments (2)

[Abstract] Abstract: The assertion that the design 'ensures that only Type II errors may occur... without introducing inconsistencies' does not follow from the reduction. Absence of an LLM-generated instance for the counter-concept leads to accepting the subsumption C ⊑ D; because LLMs have incomplete coverage and verbalization may obscure edge cases, a satisfiable counter-concept can be misclassified as unsatisfiable, wrongly adding the axiom (a Type I error). This directly contradicts the safety claim and is load-bearing for the paper's main contribution.
[Experimental results] Experimental results section (referenced in abstract): The claim of 'stable recall' across 13 LLMs lacks any description of prompt engineering, instance generation/verification procedure, statistical tests for stability, or controls for verbalization fidelity. Without these, it is impossible to assess whether the reported recall truly measures only Type II errors or whether the oracle implementation introduces the very false-positive subsumptions the framework claims to avoid.

minor comments (2)

[Abstract] The abstract is overly dense; expanding the description of how counter-concepts are verbalized and how LLM responses are interpreted as satisfiability verdicts would improve readability.
No discussion of related work on LLM reliability for logical reasoning or prior active learning frameworks in description logics is visible in the provided abstract; adding targeted citations would strengthen context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify the presentation of our core safety claim and experimental methodology. We address each major comment below and indicate the planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that the design 'ensures that only Type II errors may occur... without introducing inconsistencies' does not follow from the reduction. Absence of an LLM-generated instance for the counter-concept leads to accepting the subsumption C ⊑ D; because LLMs have incomplete coverage and verbalization may obscure edge cases, a satisfiable counter-concept can be misclassified as unsatisfiable, wrongly adding the axiom (a Type I error). This directly contradicts the safety claim and is load-bearing for the paper's main contribution.

Authors: We agree that the original abstract wording overstates the guarantee. The framework queries LLMs for real-world instances of the counter-concept C ⊓ ¬D; generation of a valid instance rejects the subsumption (preventing addition of a false axiom), while absence leads to acceptance. This yields only Type II errors (delays) under the assumption that LLMs produce no hallucinations, but LLM incompleteness can indeed cause missed counterexamples and thus Type I errors. We will revise the abstract to qualify the claim explicitly—stating that the approach reduces inconsistency risk relative to direct membership queries, with the Type-II-only property holding only when hallucinations are absent—and add a short discussion of this assumption and its implications for the safety property. revision: yes
Referee: [Experimental results] Experimental results section (referenced in abstract): The claim of 'stable recall' across 13 LLMs lacks any description of prompt engineering, instance generation/verification procedure, statistical tests for stability, or controls for verbalization fidelity. Without these, it is impossible to assess whether the reported recall truly measures only Type II errors or whether the oracle implementation introduces the very false-positive subsumptions the framework claims to avoid.

Authors: We concur that the experimental section requires substantially more methodological detail to allow proper evaluation. We will expand it to document: the exact prompt templates and any prompt-engineering steps employed; the full instance-generation protocol together with verification procedures (including whether generated instances were manually inspected for validity or automatically checked for consistency with the verbalization); the statistical measures or tests applied to assess recall stability across the 13 models; and any controls or fidelity checks performed on the controlled-natural-language verbalizations. These additions will clarify whether the observed recall corresponds to the intended error profile and will be included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity in the derivation chain

full rationale

The paper's central derivation invokes the standard DL reduction of subsumption to satisfiability (C ⊑ D iff C ⊓ ¬D unsatisfiable), an established external result, then treats LLMs as an external oracle for counter-concept instances after verbalization. No equations, parameters, or definitions reduce to themselves by construction; the Type-II-only error claim is presented as a direct consequence of the oracle's placement in the reduction rather than a self-referential fit or self-citation load. The approach remains self-contained against external benchmarks with no enumerated circularity patterns exhibited.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard description-logic reduction of subsumption to satisfiability and on the assumption that LLMs can act as approximate oracles for counterexamples; no free parameters or new entities are introduced.

axioms (1)

standard math Subsumption can be reduced to satisfiability in description logics
Invoked explicitly as the inspiration for reformulating candidate axioms into counter-concepts.

pith-pipeline@v0.9.0 · 5476 in / 1166 out tokens · 43316 ms · 2026-05-10T08:15:43.517101+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 1 canonical work pages · 1 internal anchor

[1]

InProceedings of the AAAI Symposium Series, volume 4, 188–198

Investigating open source LLMs to retrofit compe- tency questions in ontology engineering. InProceedings of the AAAI Symposium Series, volume 4, 188–198. Alharbi, R.; Tamma, V .; Payne, T. R.; and de Berardinis, J. 2025.A Comparative Study of Competency Question Elicitation Methods from Ontology Requirements. Angluin, D. 1988. Queries and concept learning...

2025
[2]

Consortium, G

OWL web ontology language reference.W3C recom- mendation10(2):1–53. Consortium, G. O. 2019. The gene ontology resource: 20 years and still going strong.Nucleic acids research 47(D1):D330–D338. Cregan, A.; Schwitter, R.; Meyer, T.; et al. 2007. Sydney OWL syntax-towards a controlled natural language syntax for owl 1.1. InOWLED, volume 258. Cuenca Grau, B.;...

2019
[3]

International Joint Conferences on Artificial Intelligence Organization

Learning description logic concepts: when can pos- itive and negative examples be separated? InProceedings of the Twenty-Eighth International Joint Conference on Ar- tificial Intelligence, volume 2019, 1682–1688. International Joint Conferences on Artificial Intelligence Organization. Funk, M.; Jung, J. C.; and Lutz, C. 2021. Actively learning concepts an...

2019
[4]

Why Language Models Hallucinate

Why language models hallucinate.arXiv preprint arXiv:2509.04664. Kazakov, Y . 2008.RIQandSROIQare harder than SHOIQ. 274–284. AAAI Press. Konev, B.; Lutz, C.; Ozaki, A.; and Wolter, F. 2018. Exact learning of lightweight description logic ontologies.Journal of Machine Learning Research18(201):1–63. Lehmann, J., and Hitzler, P. 2010. Concept learning in de...

work page internal anchor Pith review arXiv 2008
[5]

InProceedings of the 27th European Con- ference on Artificial Intelligence (ECAI 2025), 1792–1799

Actively learning el terminologies from large lan- guage models. InProceedings of the 27th European Con- ference on Artificial Intelligence (ECAI 2025), 1792–1799. Muggleton, S., and De Raedt, L. 1994. Inductive logic pro- gramming: Theory and methods.The Journal of Logic Pro- gramming19:629–679. Ozaki, A. 2025. Actively learning from machine learning mod...

2025
[6]

Sirin, E.; Parsia, B.; Grau, B

SNOMED reaching its adolescence: Ontologists’ and logicians’ health check.International journal of medical informatics78:S86–S94. Sirin, E.; Parsia, B.; Grau, B. C.; Kalyanpur, A.; and Katz, Y . 2007. Pellet: A practical OWL-DL reasoner.Journal of Web Semantics5(2):51–53. Stehman, S. V . 1997. Selecting and interpreting measures of thematic classification...

2007