arxiv: 2605.09184 · v1 · submitted 2026-05-09 · 💻 cs.AI · cs.CL· cs.DB

Recognition: no theorem link

Open Ontologies: Tool-Augmented Ontology Engineering with Stable Matching Alignment

Fabio Rovai

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:50 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.DB

keywords ontology alignmentstable matchingLLM tool accessOWL reasoningalignment benchmarksontology engineeringone-to-one matching

0 comments

The pith

Stable 1-to-1 matching is the dominant factor in ontology alignment quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes an open-source system that combines language models for building ontologies with formal reasoning and alignment capabilities. Its core result is that stable 1-to-1 matching to select non-conflicting concept pairs drives alignment performance far more than the weights assigned to different similarity signals. Tests on standard benchmark tracks show this approach reaching an F1 of 0.832 with precision 0.963 on one track, competitive with leading systems, while ablations confirm that weights change the score by less than 0.004 but dropping the matching step lowers it to 0.728. The work also reports that structured tool access lets the language model handle ontology tasks at an F1 of 0.717, outperforming both raw file reading and no-file baselines.

Core claim

The authors establish that stable 1-to-1 matching is the dominant factor in ontology alignment quality. On the Anatomy track benchmark it achieves F1 = 0.832 with precision 0.963 and recall 0.733, competitive with state-of-the-art systems and highest in precision. Ablation across five weight configurations shows signal weights are irrelevant once stable matching is applied, with F1 varying by less than 0.004, while removing stable matching drops F1 to 0.728. The same method yields F1 = 0.438 on the Conference track benchmark. Structured tool access via the Model Context Protocol enables the language model to reach F1 = 0.717, compared with 0.323 when reading raw OWL files and 0.431 with no文件

What carries the argument

Stable 1-to-1 matching procedure that selects an optimal set of conflict-free correspondences from similarity signals between ontology concepts.

Load-bearing premise

The performance advantages of stable matching observed on the tested benchmark tracks will hold for other ontology domains, larger structures, and different language model configurations.

What would settle it

Applying the same alignment method to a new pair of ontologies outside the original benchmark tracks and finding that stable matching no longer produces competitive F1 scores or that signal weights regain influence would disprove the central claim.

read the original abstract

We present Open Ontologies, an open-source ontology engineering system implemented in Rust that integrates LLM-driven construction with formal OWL reasoning and ontology alignment via the Model Context Protocol. Our primary finding is that stable 1-to-1 matching is the dominant factor in ontology alignment quality: on the OAEI Anatomy track, it achieves F1 = 0.832 (P = 0.963, R = 0.733), competitive with state-of-the-art systems and exceeding all in precision. Ablation across five weight configurations shows that signal weights are irrelevant when stable matching is applied (F1 varies by less than 0.004), while removing stable matching drops F1 to 0.728. On the Conference track, the same method achieves F1 = 0.438. On tool-augmented ontology interaction, we find a surprising result: an LLM reading a raw OWL file (F1 = 0.323) performs worse than the same LLM with no file at all (F1 = 0.431), while structured MCP tool access achieves F1 = 0.717. This demonstrates that tool structure provides a qualitatively different mode of access that the LLM cannot replicate by reading raw syntax. The system ships as a single binary under the MIT licence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Stable matching drives most of the alignment lift on the tested tracks, and structured tools beat raw OWL input for the LLM.

read the letter

Stable matching is doing the real work in the alignment numbers. On the Anatomy track the ablation shows that five different weight combinations all produce F1 within 0.004 of each other once stable matching is applied, but removing the matching step drops F1 from 0.832 to 0.728. That is a clean, useful result. The same setup reaches F1 0.438 on the Conference track. The tool-access experiment is also straightforward: raw OWL input gives the LLM worse performance than no file at all, while the MCP tool interface reaches 0.717. The paper ships a single Rust binary under MIT, which lowers the barrier to checking the claims.

Referee Report

2 major / 1 minor

Summary. The paper presents Open Ontologies, an open-source Rust system integrating LLM-driven ontology construction, formal OWL reasoning, and ontology alignment via the Model Context Protocol. Its central empirical claim is that stable 1-to-1 matching dominates alignment quality: on the OAEI Anatomy track it yields F1 = 0.832 (P = 0.963, R = 0.733), competitive with SOTA and highest in precision; an ablation across five weight configurations shows weights are irrelevant under stable matching (F1 varies < 0.004) while removing it drops F1 to 0.728. The same method obtains F1 = 0.438 on the Conference track. A secondary finding is that structured MCP tool access (F1 = 0.717) outperforms both raw OWL file reading (F1 = 0.323) and no-file baselines (F1 = 0.431) for LLMs.

Significance. If the reported dominance of stable matching holds under broader testing, the work would provide a clear, actionable principle for ontology alignment pipelines and demonstrate that structured tool interfaces offer a qualitatively different capability than raw syntax access for LLMs. The single-binary MIT-licensed implementation together with direct evaluation on public OAEI benchmarks constitutes a concrete reproducibility asset for the community.

major comments (2)

[Results / ablation experiments] The claim that stable 1-to-1 matching is the dominant factor in alignment quality rests on the Anatomy ablation (F1 drop from 0.832 to 0.728 when removed; weights irrelevant). No corresponding weight-configuration or removal ablation is reported for the Conference track (only the single F1 = 0.438 figure is given), so the generality of the dominance conclusion across ontology alignment tasks remains untested within the manuscript.
[Tool-augmented ontology interaction experiments] The tool-augmentation results (raw OWL F1 = 0.323 worse than no-file F1 = 0.431, MCP F1 = 0.717) are presented as a key qualitative finding, yet the manuscript provides no details on prompt templates, exact MCP tool signatures, number of LLM calls, or statistical significance of the differences; these omissions make it difficult to assess whether the observed ordering is robust to implementation choices.

minor comments (1)

[Abstract and §4] The abstract and results sections would benefit from explicit statement of the exact OAEI reference alignments used, data-split protocol, and whether the reported F1 figures are averaged over multiple LLM runs or single-shot.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the empirical claims and reproducibility.

read point-by-point responses

Referee: [Results / ablation experiments] The claim that stable 1-to-1 matching is the dominant factor in alignment quality rests on the Anatomy ablation (F1 drop from 0.832 to 0.728 when removed; weights irrelevant). No corresponding weight-configuration or removal ablation is reported for the Conference track (only the single F1 = 0.438 figure is given), so the generality of the dominance conclusion across ontology alignment tasks remains untested within the manuscript.

Authors: We agree that the ablation study was performed only on the Anatomy track and that the Conference track result is reported without the corresponding weight and removal ablations. This does limit the strength of the generality claim in the current manuscript. In the revised version we will add the full set of weight-configuration and removal ablations for the Conference track, using the same five weight settings and removal condition as in Anatomy, so that readers can directly assess whether the dominance of stable matching holds across both OAEI tasks. revision: yes
Referee: [Tool-augmented ontology interaction experiments] The tool-augmentation results (raw OWL F1 = 0.323 worse than no-file F1 = 0.431, MCP F1 = 0.717) are presented as a key qualitative finding, yet the manuscript provides no details on prompt templates, exact MCP tool signatures, number of LLM calls, or statistical significance of the differences; these omissions make it difficult to assess whether the observed ordering is robust to implementation choices.

Authors: We acknowledge that the current manuscript lacks the implementation details needed to evaluate robustness. In the revision we will add: (1) the exact prompt templates used for each of the three conditions, (2) the full MCP tool signatures and their parameter descriptions, (3) the number of LLM calls per condition, and (4) statistical support for the F1 differences (standard deviation across at least five independent runs with the same model and temperature). These additions will allow readers to reproduce and assess the ordering. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on public benchmarks

full rationale

The paper's central claims consist of measured F1 scores and ablation outcomes on the public OAEI Anatomy and Conference tracks. Stable-matching dominance is shown by direct experimental comparison (F1 0.832 with matching vs. 0.728 without; weight variations <0.004), not by any derivation, fitted parameter renamed as prediction, or self-citation chain. No equations, ansatzes, or uniqueness theorems are invoked that reduce to the paper's own inputs; all reported quantities are externally verifiable against fixed reference alignments.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard OWL semantics and the OAEI benchmark definitions; no new free parameters, ad-hoc axioms, or invented entities are introduced in the abstract.

axioms (2)

standard math OWL 2 semantics and reasoning are sound and complete for the tested ontologies
Invoked implicitly when using formal OWL reasoning
domain assumption OAEI Anatomy and Conference tracks are representative alignment tasks
Used to claim competitiveness with state-of-the-art

pith-pipeline@v0.9.0 · 5523 in / 1385 out tokens · 57690 ms · 2026-05-12T01:50:36.698904+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

[1]

arXiv:2512.05594 (2025)

Chen, B., Lin, Z., Li, Y., Zhang, N., Wen, H., Chen, H.: OntoAxiom: Evaluating LLM Capabilities for Ontology Axiom Identification. arXiv:2512.05594 (2025)

work page arXiv 2025
[2]

Glimm, B., Horrocks, I., Motik, B., Stoilos, G., Wang, Z.: HermiT: An OWL 2 Reasoner. J. Automated Reasoning 53(3), 245–269 (2014)

work page 2014
[3]

In: OM Workshop at ISWC 2024

Algergawy, A., Faria, D., Ferrara, A., Jimenez-Ruiz, E., et al.: Results of the On- tology Alignment Evaluation Initiative 2024. In: OM Workshop at ISWC 2024. CEUR-WS Vol-3897 (2024)

work page 2024
[4]

In: ISWC 2011, pp

Jim´ enez-Ruiz, E., Grau, B.C.: LogMap: Logic-Based and Scalable Ontology Match- ing. In: ISWC 2011, pp. 273–288. Springer (2011)

work page 2011
[5]

In: OTM 2013, pp

Faria, D., Pesquita, C., Santos, E., Palmonari, M., Cruz, I.F., Couto, F.M.: The AgreementMakerLight Ontology Matching System. In: OTM 2013, pp. 527–541. Springer (2013)

work page 2013
[6]

In: AAAI 2022, pp

He, Y., Chen, J., Antonyrajah, D., Horrocks, I.: BERTMap: A BERT-Based On- tology Alignment System. In: AAAI 2022, pp. 5570–5578 (2022)

work page 2022
[7]

In: ISWC 2023, pp

Hertling, S., Paulheim, H.: OLaLa: Ontology Matching with Large Language Mod- els. In: ISWC 2023, pp. 442–459. Springer (2023)

work page 2023
[8]

Anthropic: Model Context Protocol Specification.https:// modelcontextprotocol.io(2024)

work page 2024
[9]

Tani` ere, T.: Oxigraph: A SPARQL Graph Database.https://oxigraph.org (2023)

work page 2023
[10]

Sirin, E., Parsia, B., Grau, B.C., Kalyanpur, A., Katz, Y.: Pellet: A Practical OWL-DL Reasoner. J. Web Semantics 5(2), 51–53 (2007)

work page 2007
[11]

In: IJCAR 2006, pp

Tsarkov, D., Horrocks, I.: FaCT++ Description Logic Reasoner: System Descrip- tion. In: IJCAR 2006, pp. 292–297. Springer (2006)

work page 2006
[12]

Kazakov, Y., Kr¨ otzsch, M., Simanˇ c´ ık, F.: The Incredible ELK. J. Automated Rea- soning 53(1), 1–61 (2014)

work page 2014
[13]

Bioinformatics 40(3), btae104 (2024)

Caufield, J.H., et al.: Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES). Bioinformatics 40(3), btae104 (2024)

work page 2024
[14]

In: ESWC 2024 (2024)

Graux, D., et al.: OntoChat: A Framework for Conversational Ontology Engineer- ing. In: ESWC 2024 (2024)

work page 2024
[15]

In: ISWC 2023, pp

Babaei Giglou, H., D’Souza, J., Auer, S.: LLMs4OL: Large Language Models for Ontology Learning. In: ISWC 2023, pp. 408–427. Springer (2023)

work page 2023
[16]

In: OWL Experiences and Directions Workshop (2006)

Horridge, M., Drummond, N., Goodwin, J., Rector, A., Stevens, R., Wang, H.: The Manchester OWL Syntax. In: OWL Experiences and Directions Workshop (2006)

work page 2006
[17]

Guo, Y., Pan, Z., Heflin, J.: LUBM: A Benchmark for OWL Knowledge Base Systems. J. Web Semantics 3(2–3), 158–182 (2005)

work page 2005
[18]

HashiCorp: Terraform: Infrastructure as Code.https://www.terraform.io(2014) 10 F. Rovai

work page 2014
[19]

Semantic Web 2(1), 11–21 (2011)

Horridge, M., Bechhofer, S.: The OWL API: A Java API for OWL Ontologies. Semantic Web 2(1), 11–21 (2011)

work page 2011
[20]

BMC Bioinformatics 20, 407 (2019)

Jackson, R.C., et al.: ROBOT: A Tool for Automating Ontology Workflows. BMC Bioinformatics 20, 407 (2019)

work page 2019
[21]

Artificial In- telligence in Medicine 80, 11–28 (2017)

Lamy, J.B.: Owlready: Ontology-Oriented Programming in Python. Artificial In- telligence in Medicine 80, 11–28 (2017)

work page 2017
[22]

Database 2022, baac035 (2022)

Matentzoglu, N., et al.: A Simple Standard for Sharing Ontological Mappings (SS- SOM). Database 2022, baac035 (2022)

work page 2022
[23]

W3C Recommendation (2012)

Motik, B., Grau, B.C., Horrocks, I., et al.: OWL 2 Web Ontology Language Profiles (Second Edition). W3C Recommendation (2012)

work page 2012