When CQs Go Wrong: Challenges in CQ Verification with OE-Assist

Aldo Gangemi; Andrea Giovanni Nuzzolese; Anna Sofia Lippolis; Eva Blomqvist; Mohammad Javad Saeedizade; Robin Keskis\"arkk\"a

arxiv: 2606.24619 · v1 · pith:GKQ4ZPRTnew · submitted 2026-06-23 · 💻 cs.AI

When CQs Go Wrong: Challenges in CQ Verification with OE-Assist

Anna Sofia Lippolis , Mohammad Javad Saeedizade , Robin Keskis\"arkk\"a , Aldo Gangemi , Eva Blomqvist , Andrea Giovanni Nuzzolese This is my paper

Pith reviewed 2026-06-25 23:35 UTC · model grok-4.3

classification 💻 cs.AI

keywords Competency QuestionsOntology VerificationCQ VerificationLLM AssistantOntology EngineeringAmbiguityComplexityRefinement Tool

0 comments

The pith

Competency questions with ambiguities and excessive complexity hinder reliable ontology verification and require a dedicated refinement tool before publication.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines what makes competency questions challenging during the verification process that checks whether an ontology meets its intended purpose. Experiments involved 19 participants completing 20 tasks with an LLM-based assistant, revealing that linguistic nuances often produce inconsistent modelling decisions. The results establish that these issues persist even with assistance and point to the need for a tool that refines CQs prior to their use in evaluation. A sympathetic reader would see this as a practical step to reduce errors in later ontology engineering phases.

Core claim

CQ-verification is time-consuming and error-prone because it requires careful interpretation of linguistic nuances and precise alignment with formal ontology constructs; ambiguities and complexity in CQs lead to inconsistent modelling decisions, and the experiments demonstrate the necessity of a tool to refine CQs before publishing them to avoid these problems in the ontology engineering process.

What carries the argument

OE-Assist, the LLM assistant deployed to support participants in CQ-verification tasks, used to surface specific interpretation challenges across the 20 tasks.

If this is right

Refined CQs produce more consistent alignment between natural language questions and formal ontology constructs.
Unrevised CQs increase the likelihood of error-prone and time-consuming verification outcomes.
A pre-publication refinement step reduces ambiguity that otherwise propagates into later ontology engineering phases.
LLM assistance alone does not eliminate the need for prior CQ refinement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A similar refinement step could apply to other natural-language specifications used in knowledge engineering.
Detection rules for common CQ ambiguities might be added to existing ontology tools to automate part of the process.
Refinement could be tested as a standard checkpoint before any CQ-based evaluation begins.

Load-bearing premise

The challenges observed with 19 participants across 20 tasks using OE-Assist are representative of typical CQ verification difficulties and would persist without a dedicated refinement tool.

What would settle it

A controlled comparison in which the same 20 tasks are rerun after systematic refinement of the CQs, measuring whether the rate of inconsistent modelling decisions drops.

Figures

Figures reproduced from arXiv: 2606.24619 by Aldo Gangemi, Andrea Giovanni Nuzzolese, Anna Sofia Lippolis, Eva Blomqvist, Mohammad Javad Saeedizade, Robin Keskis\"arkk\"a.

**Figure 1.** Figure 1: Correlation analysis of CQ completion time, user correctness, ontology size, and the two complexity metrics introduced by Alharabi et al. [2]. The figure additionally includes a box plot of completion time stratified by ontology size and statistics of time spent on each CQ duration and CQ complexity. The results indicate that only one statistically significant association was observed. Specifically, a mod… view at source ↗

read the original abstract

Competency Questions (CQs) are the central component of CQ-verification, an established process in which an ontology is evaluated against a set of natural language questions to determine whether the intended purpose of the ontology has been properly modelled. However, CQ-verification is often time-consuming and error-prone, as it requires careful interpretation of linguistic nuances and precise alignment with formal ontology constructs. Ambiguities and complexity in CQs can further complicate this process, leading to inconsistent modelling decisions and verification outcomes. In this paper, we investigate what makes a CQ challenging and possible solutions to enhance the users' performance in the CQ-verification process. We experimented with the data of 19 participants who performed CQ-verification on 20 tasks using an LLM assistant to support ontology evaluation. The results show the necessity of a tool to refine CQs before publishing them to avoid ambiguity or excessive complexity in later phases of the ontology engineering process.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Small single-arm study with no metrics or baseline can't support the claim that a CQ refinement tool is necessary.

read the letter

The paper runs a 19-person study where participants used an LLM assistant called OE-Assist to verify competency questions against ontologies. It reports that people ran into ambiguities and complexity, then concludes a pre-publication refinement tool is needed.

What the work actually does is document some practical friction points that ontology engineers already know exist when CQs are poorly worded. The participant tasks and the assistant are described at a high level, and the authors correctly flag that linguistic issues affect later modeling steps.

The soft spot is the leap from observations to necessity. There is no control condition, no pre/post refinement comparison, no correctness scores, and no statistical checks. With N=19 and 20 tasks in one arm only, you cannot separate problems caused by raw CQs from problems caused by the assistant itself, task choice, or individual differences. The abstract supplies none of the numbers or demographics that would let a reader judge whether the pattern is reliable.

This is the kind of incremental tooling note that might interest a narrow group working on ontology evaluation interfaces. It does not yet give enough evidence to change practice or to justify a dedicated refinement component.

I would not bring it to reading group and would not cite it. A serious editor should desk-reject rather than send it out, because the central inference rests on an unsupported design.

Referee Report

2 major / 1 minor

Summary. The manuscript reports an empirical user study in which 19 participants performed CQ-verification on 20 tasks using the OE-Assist LLM assistant. It identifies challenges arising from ambiguities and excessive complexity in natural-language CQs and concludes that these results demonstrate the necessity of a dedicated pre-publication CQ refinement tool to improve subsequent ontology engineering phases.

Significance. If the central claim were supported by appropriate controls and metrics, the work would provide a useful empirical illustration of practical difficulties in CQ-based ontology verification and could motivate tool development in the ontology engineering community. The study design incorporates real participants interacting with an LLM assistant, which supplies a concrete, practice-oriented data point.

major comments (2)

[Abstract] Abstract: the single-condition design (19 participants, 20 tasks, OE-Assist only) supplies no baseline arm (e.g., pre-refined CQs or no assistant) and reports no quantitative correctness metrics, statistical tests, or raw data. Consequently the observed errors cannot be attributed specifically to unrefined CQs, which is load-bearing for the necessity claim.
[Experimental setup / Results] Experimental setup / Results: without a control condition or within-subject comparison, the data cannot test whether the reported ambiguities and complexity would be reliably reduced by a refinement tool, leaving the causal inference that such a tool is necessary unsupported.

minor comments (1)

[Abstract] Abstract: participant demographics, task selection criteria, and exact performance measures are omitted, reducing the reader's ability to assess representativeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the referee's insightful comments. We recognize the limitations of our single-condition study design and will revise the manuscript to clarify the exploratory nature of the work, qualify our conclusions regarding the necessity of a CQ refinement tool, and add a limitations section to address the lack of baseline comparisons and quantitative metrics.

read point-by-point responses

Referee: [Abstract] Abstract: the single-condition design (19 participants, 20 tasks, OE-Assist only) supplies no baseline arm (e.g., pre-refined CQs or no assistant) and reports no quantitative correctness metrics, statistical tests, or raw data. Consequently the observed errors cannot be attributed specifically to unrefined CQs, which is load-bearing for the necessity claim.

Authors: The referee correctly identifies that our study employs a single-condition design without a baseline. This was intentional as the goal was to investigate challenges in CQ verification as it is currently practiced with LLM assistance, rather than to evaluate the impact of refinement. The analysis was qualitative, focusing on participant feedback and observed issues, which explains the absence of quantitative metrics and statistical tests. We will revise the abstract to better describe the study as exploratory and to moderate the claim about demonstrating necessity, instead highlighting the observed challenges as motivation for tool development. Raw data can be made available upon request in a revision if it aids transparency. revision: partial
Referee: [Experimental setup / Results] Experimental setup / Results: without a control condition or within-subject comparison, the data cannot test whether the reported ambiguities and complexity would be reliably reduced by a refinement tool, leaving the causal inference that such a tool is necessary unsupported.

Authors: We concur that the current data does not support a causal claim about the effectiveness of a refinement tool, as no comparison was made. The recommendation for such a tool stems from the identification of specific ambiguities and complexities that hinder verification, suggesting that preemptive refinement could mitigate these. In the revised manuscript, we will reframe the conclusions to present this as a motivated recommendation for future research and tool building, rather than an empirically proven necessity. We will explicitly state in a new limitations paragraph that controlled experiments are needed to confirm the benefits of CQ refinement tools. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical user study with observational inference

full rationale

The paper reports results from a single-arm user study (19 participants, 20 tasks) using OE-Assist and draws the necessity claim directly from observed ambiguities and errors in that data. No equations, parameters, derivations, or self-citations are present that reduce any result to its own inputs by construction. The inference chain is data collection to qualitative observation, which is self-contained and externally falsifiable via replication with different participants or baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical user study containing no mathematical model, free parameters, axioms, or postulated entities.

pith-pipeline@v0.9.1-grok · 5710 in / 992 out tokens · 21381 ms · 2026-06-25T23:35:35.513814+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references

[1]

and others , title =

Rebboud, Y. and others , title =. ESWC , year =
[2]

arXiv preprint arXiv:2507.02989 , year=

A Comparative Study of Competency Question Elicitation Methods from Ontology Requirements , author=. arXiv preprint arXiv:2507.02989 , year=

arXiv
[3]

and others , title =

Ciroku, F. and others , title =. arXiv preprint arXiv:240X.XXXXX , year =
[4]

and others , title =

Pan, X. and others , title =. arXiv preprint arXiv:240X.XXXXX , year =
[5]

ISWC Workshop , year =

Anonymous , title =. ISWC Workshop , year =
[6]

and others , title =

McNamara, J. and others , title =. Semantic Web Journal , year =
[7]

and others , title =

Di Nuzzo, G. and others , title =. arXiv preprint arXiv:240X.XXXXX , year =
[8]

and others , title =

Taghzouti, A. and others , title =. ESWC , year =
[9]

International Semantic Web Conference , pages=

Large Language Models Assisting Ontology Evaluation , author=. International Semantic Web Conference , pages=. 2025 , organization=

2025
[10]

and others , title =

Alharbi, R. and others , title =. Knowledge Engineering Review , year =
[11]

ISWC , year =

Anonymous , title =. ISWC , year =
[12]

European Semantic Web Conference , pages=

Ontology generation using large language models , author=. European Semantic Web Conference , pages=. 2025 , organization=

2025
[13]

Large Language Models as Assistants for Ontology Engineering , author=
[14]

European semantic web conference , pages=

Navigating ontology development with large language models , author=. European semantic web conference , pages=. 2024 , organization=

2024
[15]

International Conference on Knowledge Engineering and Knowledge Management , pages=

On the roles of competency questions in ontology engineering , author=. International Conference on Knowledge Engineering and Knowledge Management , pages=. 2024 , organization=

2024
[16]

arXiv preprint arXiv:2412.13688 , year=

Discerning and characterising types of competency questions for ontologies , author=. arXiv preprint arXiv:2412.13688 , year=

arXiv
[17]

International Conference on Knowledge Engineering and Knowledge Management , pages=

A review and comparison of competency question engineering approaches , author=. International Conference on Knowledge Engineering and Knowledge Management , pages=. 2024 , organization=

2024
[18]

arXiv preprint arXiv:2505.24554 , year=

Bench4KE: Benchmarking Automated Competency Question Generation , author=. arXiv preprint arXiv:2505.24554 , year=

arXiv
[19]

International Conference on Knowledge Engineering and Knowledge Management , pages=

Ontology testing-methodology and tool , author=. International Conference on Knowledge Engineering and Knowledge Management , pages=. 2012 , organization=

2012

[1] [1]

and others , title =

Rebboud, Y. and others , title =. ESWC , year =

[2] [2]

arXiv preprint arXiv:2507.02989 , year=

A Comparative Study of Competency Question Elicitation Methods from Ontology Requirements , author=. arXiv preprint arXiv:2507.02989 , year=

arXiv

[3] [3]

and others , title =

Ciroku, F. and others , title =. arXiv preprint arXiv:240X.XXXXX , year =

[4] [4]

and others , title =

Pan, X. and others , title =. arXiv preprint arXiv:240X.XXXXX , year =

[5] [5]

ISWC Workshop , year =

Anonymous , title =. ISWC Workshop , year =

[6] [6]

and others , title =

McNamara, J. and others , title =. Semantic Web Journal , year =

[7] [7]

and others , title =

Di Nuzzo, G. and others , title =. arXiv preprint arXiv:240X.XXXXX , year =

[8] [8]

and others , title =

Taghzouti, A. and others , title =. ESWC , year =

[9] [9]

International Semantic Web Conference , pages=

Large Language Models Assisting Ontology Evaluation , author=. International Semantic Web Conference , pages=. 2025 , organization=

2025

[10] [10]

and others , title =

Alharbi, R. and others , title =. Knowledge Engineering Review , year =

[11] [11]

ISWC , year =

Anonymous , title =. ISWC , year =

[12] [12]

European Semantic Web Conference , pages=

Ontology generation using large language models , author=. European Semantic Web Conference , pages=. 2025 , organization=

2025

[13] [13]

Large Language Models as Assistants for Ontology Engineering , author=

[14] [14]

European semantic web conference , pages=

Navigating ontology development with large language models , author=. European semantic web conference , pages=. 2024 , organization=

2024

[15] [15]

International Conference on Knowledge Engineering and Knowledge Management , pages=

On the roles of competency questions in ontology engineering , author=. International Conference on Knowledge Engineering and Knowledge Management , pages=. 2024 , organization=

2024

[16] [16]

arXiv preprint arXiv:2412.13688 , year=

Discerning and characterising types of competency questions for ontologies , author=. arXiv preprint arXiv:2412.13688 , year=

arXiv

[17] [17]

International Conference on Knowledge Engineering and Knowledge Management , pages=

A review and comparison of competency question engineering approaches , author=. International Conference on Knowledge Engineering and Knowledge Management , pages=. 2024 , organization=

2024

[18] [18]

arXiv preprint arXiv:2505.24554 , year=

Bench4KE: Benchmarking Automated Competency Question Generation , author=. arXiv preprint arXiv:2505.24554 , year=

arXiv

[19] [19]

International Conference on Knowledge Engineering and Knowledge Management , pages=

Ontology testing-methodology and tool , author=. International Conference on Knowledge Engineering and Knowledge Management , pages=. 2012 , organization=

2012