arxiv: 2605.10540 · v1 · submitted 2026-05-11 · 💻 cs.DB

Recognition: 2 theorem links

· Lean Theorem

Keeping track of errors: A study of SHACL-DS for RDF dataset validation on the ERA RINF Knowledge Graph

Christophe Debruyne, Davan Chiem Dao, Ghislain Atemezing

Pith reviewed 2026-05-12 03:36 UTC · model grok-4.3

classification 💻 cs.DB

keywords SHACL-DSRDF dataset validationnamed graphsknowledge graph validationSHACLvalidation performanceERA RINF

0 comments

The pith

SHACL-DS validates large RDF datasets with named graphs faster than standard SHACL while matching its results and adding provenance tracking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests SHACL-DS, which extends SHACL to target named graphs and graph combinations directly in RDF datasets, on the ERA RINF knowledge graph built from contributions by 56 infrastructure managers. Two migration strategies convert existing SHACL shapes into SHACL-DS form and run them against the full dataset using a TopBraid implementation. Both strategies return the same validation outcomes as the baseline SHACL approach that collapses every named graph into one flat data graph, yet they finish in less time. The work establishes that SHACL-DS can handle a real industrial-scale knowledge graph while keeping validation scope inside the shapes file, forcing GRAPH clauses for provenance, and annotating reports with graph-level detail.

Core claim

SHACL-DS is at least as expressive as SHACL when applied to the ERA RINF KG. Two migration strategies produce identical validation results to the baseline yet run faster. SHACL-DS lets the validation scope be declared inside the shapes artefact itself, enforces triple provenance through GRAPH clauses, enriches validation reports with per-graph annotations, and supports shape organisation across named shapes graphs.

What carries the argument

SHACL-DS, the extension of SHACL that adds declarative targeting of named graphs and combinations of graphs for dataset-level validation.

If this is right

Validation outcomes stay identical after migration from SHACL to SHACL-DS.
Execution time improves over the flattened single-graph SHACL baseline.
Validation scope can be stated inside the shapes file without external code.
Reports carry explicit annotations for each named graph.
Shapes can be stored and referenced across multiple named graphs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Organisations that combine data from many independent sources could keep each source in its own named graph while still running unified validation.
Performance gains observed here might appear in other multi-contributor knowledge graphs that already separate data by provenance.
Future implementations of SHACL could incorporate similar native graph targeting to reduce the need for custom flattening steps.

Load-bearing premise

The TopBraid SHACL-DS implementation follows the specification exactly and the two migration strategies keep every original constraint without adding or dropping any.

What would settle it

Running the identical ERA RINF shapes and data through a second, independent SHACL-DS engine and obtaining different error counts or different per-graph annotations would show that the results do not generalise.

Figures

Figures reproduced from arXiv: 2605.10540 by Christophe Debruyne, Davan Chiem Dao, Ghislain Atemezing.

read the original abstract

SHACL-DS extends SHACL for RDF dataset validation by introducing declarative targeting of named graphs and graph combinations, but has not yet been demonstrated and assessed on a real, large-scale Knowledge Graph (KG). In this paper, we apply the SHACL-DS approach to validate its use on such a KG. We apply SHACL-DS to the European Railway Infrastructure (ERA RINF) KG, a large-scale RDF dataset in which 56 infrastructure managers contribute data to dedicated named graphs. We migrate the ERA-RINF shapes to SHACL-DS using two strategies and evaluate their performance using a TopBraid SHACL-DS implementation developed for this study. We compare the performance against the SHACL approach, which "flattens" all graphs into a single data graph. Both strategies produce the same results and are faster than the SHACL baseline. Not only do we demonstrate that SHACL-DS is at least as expressive as SHACL, but SHACL-DS also allows the validation scope to be declared inside the shapes artefact, enforces triple provenance through \texttt{GRAPH} clauses, enriches validation reports with per-graph annotations, and enables shape organisation across named shapes graphs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a solid first real-world test of SHACL-DS on a large multi-contributor KG but the performance and equivalence claims rest on an unverified custom implementation.

read the letter

The main thing to know is that this work applies SHACL-DS to the ERA RINF knowledge graph with its 56 named graphs from different infrastructure managers, migrates existing shapes in two ways, and reports that both approaches match the original results while running faster than a flattened SHACL baseline. That is the concrete new evidence the abstract presents, and it is the first reported assessment at this scale. The paper also notes practical upsides such as keeping targeting declarations inside the shapes file, using GRAPH clauses for provenance, adding per-graph annotations to reports, and organizing shapes across named graphs. Those points follow directly from the extension and are shown in the case study. The migration strategies themselves look like useful details for anyone who has to move from plain SHACL to dataset-level validation. The evidence is empirical and tied to a fixed, real dataset rather than synthetic examples, which gives it some grounding. The central soft spot is the reliance on a custom TopBraid SHACL-DS implementation built for the study. Without a reference implementation, cross-check against another engine, or formal argument that the GRAPH handling and report enrichment preserve semantics exactly, the speedups and result identity could be specific to that code rather than properties of SHACL-DS itself. The abstract does not spell out measurement methodology or checks for bias in the shape migration, so those details will matter in the full text. This paper is for readers working on RDF dataset validation, provenance in semantic data, or large-scale knowledge graphs with named graphs. It is not a theoretical advance but a practical demonstration that can inform adoption decisions. The thinking is direct and engages the existing SHACL literature without obvious internal contradictions. I would send it to peer review so the implementation questions and measurement details can be clarified.

Referee Report

2 major / 2 minor

Summary. The paper applies SHACL-DS to the ERA RINF Knowledge Graph (a large-scale RDF dataset with 56 named graphs from infrastructure managers), migrates existing SHACL shapes using two strategies, and evaluates them against a flattened SHACL baseline using a custom TopBraid SHACL-DS implementation developed for the study. It claims that both migration strategies produce identical validation results, outperform the SHACL baseline in performance, and that SHACL-DS is at least as expressive as SHACL while adding declarative named-graph targeting, GRAPH-clause provenance enforcement, per-graph report annotations, and shape organization across named shapes graphs.

Significance. If the central claims hold after verification, the work supplies a concrete, large-scale empirical demonstration of SHACL-DS on a real multi-contributor KG, showing practical gains in scoping, provenance, and reporting that standard SHACL lacks; this could support broader adoption of dataset-level validation extensions.

major comments (2)

[Abstract] Abstract and evaluation description: the claims of result equivalence and faster performance are load-bearing and rest entirely on a custom TopBraid SHACL-DS implementation developed for the study; no reference implementation, formal proof, cross-validation against another engine, or explicit checks for fidelity in GRAPH-clause handling and shape-graph organization are provided, so observed speedups and identity could be artifacts of that specific code rather than intrinsic properties of SHACL-DS.
[Evaluation] Evaluation section (implied by abstract): the manuscript states equivalent results and faster performance but supplies no details on measurement methodology, statistical analysis, potential biases introduced by the two shape-migration strategies, or error-handling behavior, which directly affects the soundness of the performance and equivalence conclusions.

minor comments (2)

Clarify the exact definition and scope of the two migration strategies in a dedicated subsection so readers can assess semantic preservation independently of the implementation.
Add explicit references to the SHACL-DS draft specification and any prior SHACL-DS literature when describing the added features (declarative targeting, GRAPH clauses, report enrichment).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps strengthen the transparency of our evaluation. We address the two major comments point by point below and will revise the manuscript to incorporate additional details on the implementation and evaluation methodology.

read point-by-point responses

Referee: [Abstract] Abstract and evaluation description: the claims of result equivalence and faster performance are load-bearing and rest entirely on a custom TopBraid SHACL-DS implementation developed for the study; no reference implementation, formal proof, cross-validation against another engine, or explicit checks for fidelity in GRAPH-clause handling and shape-graph organization are provided, so observed speedups and identity could be artifacts of that specific code rather than intrinsic properties of SHACL-DS.

Authors: We acknowledge that the evaluation depends on our custom SHACL-DS implementation, as no public reference implementation existed. In the revised version we will add a new subsection describing the implementation in detail, including our handling of GRAPH clauses, named shapes graphs, and the internal consistency checks we performed on small test cases. While a formal proof lies outside the paper's scope, the identical results obtained from two independent migration strategies (which differ substantially in shape structure) provide evidence that the equivalence is not an artifact of the code. We will also clarify that the performance advantage arises directly from the declarative targeting mechanism defined in SHACL-DS, which avoids the full-graph flattening required by standard SHACL. revision: yes
Referee: [Evaluation] Evaluation section (implied by abstract): the manuscript states equivalent results and faster performance but supplies no details on measurement methodology, statistical analysis, potential biases introduced by the two shape-migration strategies, or error-handling behavior, which directly affects the soundness of the performance and equivalence conclusions.

Authors: We agree that the current evaluation section lacks sufficient methodological detail. In the revision we will expand it to include: hardware and software environment specifications, timing methodology (including warm-up runs and repetition count), statistical reporting (means, standard deviations, and confidence intervals), an explicit discussion of potential biases from the two migration strategies (including how the strategies were chosen to be complementary), and a description of error-handling and report-comparison procedures used to establish result equivalence. These additions will allow readers to assess the reliability of the reported performance and equivalence claims. revision: yes

Circularity Check

0 steps flagged

Empirical case study with no derivation chain or self-referential elements

full rationale

The paper reports an empirical application of SHACL-DS to the ERA RINF KG. It migrates existing shapes using two strategies, runs validation via a custom TopBraid implementation, and directly measures that both strategies produce identical results and outperform the flattened SHACL baseline. No equations, fitted parameters, predictions, uniqueness theorems, or ansatzes appear. Claims rest on observable outputs from an external fixed KG rather than any reduction to the paper's own inputs by construction. The custom implementation is a methodological detail whose fidelity is an external verification question, not a circularity in any derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the chosen implementation correctly realizes SHACL-DS and that the migration strategies are semantically equivalent to the original SHACL shapes.

axioms (1)

domain assumption The TopBraid SHACL-DS implementation correctly and completely realizes the SHACL-DS specification.
All performance and equivalence claims depend on this unverified implementation fidelity.

pith-pipeline@v0.9.0 · 5526 in / 1197 out tokens · 65290 ms · 2026-05-12T03:36:58.025873+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
We migrate the ERA-RINF shapes to SHACL-DS using two strategies... Both strategies produce the same results and are faster than the SHACL baseline.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
SHACL-DS extends SHACL... by introducing declarative targeting of named graphs and graph combinations

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Posters, Demos, Workshops, and Tutorials of the 20th International Conference on Semantic Systems (SEMANTiCS 2024)

Chiem Dao, D., Debruyne, C.: Data Leakage and Validation Bypass in SHACL. Posters, Demos, Workshops, and Tutorials of the 20th International Conference on Semantic Systems (SEMANTiCS 2024). (2024)

work page 2024
[2]

In: Acosta, M., van Erp, M., Rudolph, S., Hartig, O., Spahiu, B., Rula, A., Garijo, D., Osborne, F

Chiem Dao, D., Debruyne, C.: From RDF Graph Validation to RDF Dataset Val- idation with SHACL-DS. In: Acosta, M., van Erp, M., Rudolph, S., Hartig, O., Spahiu, B., Rula, A., Garijo, D., Osborne, F. (eds.) The Semantic Web. pp. 217–

work page
[3]

Springer Nature Switzerland, Cham (2026)

work page 2026
[4]

https://github.com/dotnetrdf/dotnetrdf (2019), accessed: 2026-05-02

dotNetRDF Project: dotNetRDF. https://github.com/dotnetrdf/dotnetrdf (2019), accessed: 2026-05-02

work page 2019
[5]

https://doi.org/10.5281/zenodo.18671823

European Union Agency for Railways: Era knowledge graph (Feb 2026). https://doi.org/10.5281/zenodo.18671823

work page doi:10.5281/zenodo.18671823 2026
[6]

GitLab (2026), https://gitlab.com/era-europa-eu/public/interoperable-data-programme/era- ontology/era-ontology/-/tree/v3.2.0

European Union Agency for Railways: ERA ontology, version 3.2.0. GitLab (2026), https://gitlab.com/era-europa-eu/public/interoperable-data-programme/era- ontology/era-ontology/-/tree/v3.2.0

work page 2026
[7]

Euro- pean Union Agency for Railways (2026), https://data-interop.era.europa.eu/era- vocabulary

European Union Agency for Railways: ERA ontology, version 3.2.2. Euro- pean Union Agency for Railways (2026), https://data-interop.era.europa.eu/era- vocabulary

work page 2026
[8]

W3C Recommendation, W3C (2017), https://www.w3.org/TR/2017/REC-shacl- 20170720/

Knublauch, H., Kontokostas, D.: Shapes Constraint Language (SHACL). W3C Recommendation, W3C (2017), https://www.w3.org/TR/2017/REC-shacl- 20170720/

work page 2017
[9]

Semantic Web Journal (2025), under review (swj3972)

Martínez-Sarmiento, E., Ruckhaus, E., Toledo, J., Doña, D., Corcho, O.: ERA- SHACL-Benchmark: A real-world benchmark to assess the performance and qual- ity of in-memory SHACL engines. Semantic Web Journal (2025), under review (swj3972)

work page 2025
[10]

https://github.com/RDFLib/pySHACL (2018), accessed: 2026-05-05

RDFLib Project: pySHACL: A Python SHACL Validator. https://github.com/RDFLib/pySHACL (2018), accessed: 2026-05-05

work page 2018
[11]

In: Hotho, A., Blomqvist, E., Dietze, S., Fokoue, A., Ding, Y., Barnaghi, P., Haller, A., Dragoni, M., Alani, H

Rojas, J.A., Aguado, M., Vasilopoulou, P., Velitchkov, I., Van Assche, D., Colpaert, P., Verborgh, R.: Leveraging Semantic Technologies for Digital Interoperability in the European Railway Domain. In: Hotho, A., Blomqvist, E., Dietze, S., Fokoue, A., Ding, Y., Barnaghi, P., Haller, A., Dragoni, M., Alani, H. (eds.) The Semantic Web – ISWC 2021. pp. 648–66...

work page 2021
[12]

398–414 (10 2025)

Toledo, J., Doña, D., Ruckhaus, E., Corcho, O., Aguado, M., Patru, D., Ate- mezing, G., Vasilopoulou, P.: Using Semantic Technologies in the Railway Do- main: The Register of Infrastructure (RINF) System, pp. 398–414 (10 2025). https://doi.org/10.1007/978-3-032-09530-5_23 18 D. Chiem Dao et al

work page doi:10.1007/978-3-032-09530-5_23 2025
[13]

TopQuadrant,Inc.:TopBraidSHACLAPI.https://github.com/TopQuadrant/shacl (2017), accessed: 2026-05-02 A Namespaces Throughout the paper we use the following namespace prefix bindings: –sh: http://www.w3.org/ns/shacl# –shds: http://www.w3id.org/shacl-ds# –era-g: http://data.europa.eu/949/graph/ –era-rinf: http://data.europa.eu/949/graph/rinf/ –era-315: http:...

work page 2017