Automated Semantic Fault Localization in SysML v2: A Human-in-the-Loop Framework Using Knowledge-Graph Augmented LLMs

Haitham Al-Shami; Jari Veps\"al\"ainen; Raine Viitala; Riku Ala-Laurinaho; Rohail Malik

arxiv: 2606.23395 · v1 · pith:S3LXAYEJnew · submitted 2026-06-22 · 💻 cs.SE · cs.AI

Automated Semantic Fault Localization in SysML v2: A Human-in-the-Loop Framework Using Knowledge-Graph Augmented LLMs

Haitham Al-Shami , Rohail Malik , Riku Ala-Laurinaho , Jari Veps\"al\"ainen , Raine Viitala This is my paper

Pith reviewed 2026-06-26 07:32 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords SysML v2semantic fault localizationknowledge graphsmall language modelsmodel-based systems engineeringhuman-in-the-loopunified diff patchesfine-tuning

0 comments

The pith

A knowledge graph and fine-tuned small language model localize semantic faults in SysML v2 models and suggest repairs as unified diff patches at over 91 percent success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework that pairs a domain knowledge graph of physical compatibility rules with fine-tuned small language models to detect and repair semantic errors in SysML v2 models. These errors pass syntactic checks yet violate engineering constraints such as interface compatibility in vehicle systems. The graph supplies synthetic training examples by inserting plausible violations and later constrains the model's repair suggestions at inference time. Two models are fine-tuned to emit unified diff patches that localize the fault and offer candidate fixes. On 1,184 test samples this raises successful repair from under 3 percent to over 91 percent while shrinking output token length by more than 60 percent.

Core claim

The framework combines a knowledge graph encoding physical compatibility rules with fine-tuned small language models to automatically localize semantic faults in SysML v2 models and suggest repairs as unified diff patches. The graph generates synthetic training data by introducing plausible violations and augments inference to ensure suggestions respect domain constraints. Evaluation shows fine-tuning boosts repair success from less than 3% to more than 91% on 1,184 samples in the vehicle systems domain.

What carries the argument

Knowledge-graph-augmented fine-tuned small language model that outputs unified diff patches for semantic fault localization and repair.

If this is right

Semantic violations that survive compiler checks can be caught and presented as candidate patches before they reach integration testing.
Patch-based output reduces the length of model suggestions by more than 60 percent compared with full rewritten models.
The human engineer retains final judgment because the system produces reviewable diffs rather than autonomous edits.
The same knowledge-graph approach can in principle be rebuilt for other SysML v2 domains once their interface rules are encoded.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the graph can be maintained as designs evolve, the method could serve as a continuously updated guardrail inside existing MBSE toolchains.
Early localization of interface mismatches might shorten the feedback loop between modeling and physical prototyping in complex systems.
The synthetic-data generation step could be reused to stress-test other verification tools that currently rely only on syntactic rules.

Load-bearing premise

The knowledge graph fully and accurately encodes the physical compatibility rules and the synthetic violations it generates match the semantic errors engineers actually make.

What would settle it

Apply the trained model to a collection of real SysML v2 vehicle models that contain documented semantic faults introduced by practicing engineers and measure whether repair success stays above 50 percent.

Figures

Figures reproduced from arXiv: 2606.23395 by Haitham Al-Shami, Jari Veps\"al\"ainen, Raine Viitala, Riku Ala-Laurinaho, Rohail Malik.

**Figure 1.** Figure 1: Overview of the proposed framework. A base corpus of 256 valid SysML v2 samples is augmented through heuristic [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Breakdown of the structural and behavioral elements of the SysMLv2 code in the dataset [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Representation of different constructs in syntax and domain/semantic error examples [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Losses over training (left) and evaluation (right) datasets during training. Training is set to last 3 epochs, but is early [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

read the original abstract

SysML v2's textual syntax enables compiler-based validation of model structure and language conformance. However, semantic mistakes that preserve syntactic validity but violate domain rules cannot be detected through compilers. These errors can propagate through the design process and surface late as costly integration failures. This paper presents a human-in-the-loop framework for identifying and repairing such errors automatically. It combines a fine-tuned Small Language Model (SLM) with a domain knowledge graph encoding physical compatibility rules between system elements. The knowledge graph also guides the generation of synthetic training data by systematically introducing plausible domain violations, and augments the model at inference time to ground repair suggestions in valid engineering constraints. We demonstrate the framework using the vehicle systems domain, where the knowledge graph captures the relationships between the mechanical, electrical, fluid, and signal interfaces. Two SLMs, Qwen2.5-Coder-1.5B and DeepSeek-Coder-6.7B, are fine-tuned to output unified diff patches that localize faults and present candidate repairs for engineer review, preserving human judgment in the design process. Evaluation of 1,184 test samples shows that fine-tuning improves semantic fault repair from less than 3% to more than 91%, with patch-based output reducing token length by over 60%. The framework offers a practical path toward AI-assisted model verification that complements existing MBSE tools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The headline result depends on test cases generated from the same knowledge graph used to create the training data, so the 3% to 91% jump may not show the model handling real semantic faults.

read the letter

The paper describes a pipeline that builds a domain knowledge graph for vehicle systems, uses it to inject synthetic violations into SysML v2 models for training data, fine-tunes small coders on producing unified diff patches, and then re-uses the graph at inference to keep repairs inside valid constraints. They report the fine-tuned models reaching over 91% repair success on 1,184 held-out samples while cutting output length by more than 60%.

The concrete integration of the graph for both data generation and inference-time grounding is the clearest new piece; prior work on LLM-based fault localization does not appear to combine those two roles in this way for SysML v2. The patch-based output and the explicit human review step are also practical choices that keep the engineer in the loop.

The main limitation is that every test sample is produced by the same systematic violation process that generated the training data. Nothing in the abstract indicates a separate collection of real engineer-introduced faults or an expert check that the synthetic distribution matches actual mistakes. If the model is mainly internalizing the explicit rules encoded in the graph, the reported delta will not necessarily carry over to faults outside that distribution.

This work is aimed at MBSE teams already using SysML v2 on vehicle or similar physical systems who want an automated first pass on semantic checks. It is worth sending for peer review because the practical framing and the reported numbers are clear enough to evaluate, but the authors will need to add either real-fault validation or a stronger argument that the synthetic set is representative.

Referee Report

3 major / 1 minor

Summary. The paper presents a human-in-the-loop framework for semantic fault localization and repair in SysML v2 models. It uses a domain knowledge graph encoding physical compatibility rules for vehicle systems to generate synthetic training data with injected violations and to augment a fine-tuned SLM (Qwen2.5-Coder-1.5B or DeepSeek-Coder-6.7B) at inference. The model outputs unified diff patches for engineer review. On 1,184 test samples, fine-tuning raises repair success from <3% to >91% while cutting token length by >60%.

Significance. If the evaluation generalizes, the work would offer a practical complement to syntactic compiler checks in MBSE by addressing domain-rule violations early in design. Credit is due for the concrete before/after metrics on a sizable test set, the patch-based output format that preserves human oversight, and the dual use of the KG for data generation and grounding. However, the synthetic-only evaluation limits immediate claims about real-world utility.

major comments (3)

[Abstract] Abstract: The headline result (repair rate rising from <3% to >91% on 1,184 samples) rests entirely on test cases created by the same KG-driven violation injection process used to generate training data. No description is given of how the test split was constructed to avoid leakage, nor of any held-out set of real engineer-introduced faults or expert validation that the synthetic distribution matches actual SysML v2 modeling errors.
[Abstract] Abstract: The <3% baseline is not defined (zero-shot LLM? rule-based checker? other MBSE tool?). Without an explicit comparison to existing semantic-analysis or fault-localization methods for SysML or MBSE, the magnitude of the reported improvement cannot be assessed.
[Abstract] Abstract: The framework's claim that the KG 'fully and accurately encodes the physical compatibility rules' and that repairs are 'grounded in valid engineering constraints' is load-bearing, yet no completeness, consistency, or expert-validation study of the KG is reported.

minor comments (1)

[Abstract] The abstract would benefit from a one-sentence definition of the 'semantic fault repair' success metric (exact match to ground-truth patch? semantic equivalence? engineer acceptance rate?).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below and indicate where revisions will be made to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract] Abstract: The headline result (repair rate rising from <3% to >91% on 1,184 samples) rests entirely on test cases created by the same KG-driven violation injection process used to generate training data. No description is given of how the test split was constructed to avoid leakage, nor of any held-out set of real engineer-introduced faults or expert validation that the synthetic distribution matches actual SysML v2 modeling errors.

Authors: The evaluation uses synthetic data generated via the KG violation injection process for both training and testing. The 1,184 test samples were produced with distinct random seeds and no shared violation instances from the training set; the split was performed at the sample level post-generation to reduce leakage risk. We agree that the lack of real engineer-introduced faults and expert validation of the synthetic distribution is a limitation of the current study. We will revise the manuscript to explicitly detail the data generation and splitting procedure in the Evaluation section and add a limitations discussion on the synthetic-only nature of the dataset. revision: yes
Referee: [Abstract] Abstract: The <3% baseline is not defined (zero-shot LLM? rule-based checker? other MBSE tool?). Without an explicit comparison to existing semantic-analysis or fault-localization methods for SysML or MBSE, the magnitude of the reported improvement cannot be assessed.

Authors: The <3% figure represents the zero-shot performance of the base SLMs (Qwen2.5-Coder-1.5B and DeepSeek-Coder-6.7B) without fine-tuning or KG augmentation. We will revise the abstract and Evaluation section to define this baseline explicitly. Regarding comparisons to other MBSE semantic analysis tools, the manuscript focuses on the novel KG-augmented fine-tuning approach for SysML v2; we will expand the related work section to discuss why direct empirical comparisons were not feasible at this stage due to the absence of comparable public implementations for semantic fault localization in this domain. revision: yes
Referee: [Abstract] Abstract: The framework's claim that the KG 'fully and accurately encodes the physical compatibility rules' and that repairs are 'grounded in valid engineering constraints' is load-bearing, yet no completeness, consistency, or expert-validation study of the KG is reported.

Authors: The KG was constructed based on domain knowledge of vehicle systems interfaces (mechanical, electrical, fluid, signal) drawn from engineering standards and expert input. We acknowledge that the current manuscript does not include a formal completeness, consistency, or external expert validation study of the KG. We will revise the manuscript to add a dedicated subsection on KG construction, including the rule sources and any internal checks performed, and note the need for broader validation as future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation on held-out synthetic samples is self-contained.

full rationale

The paper's central claim is an empirical result: fine-tuning lifts semantic fault repair from <3% to >91% on 1,184 test samples. No derivation chain, equations, or self-referential definitions are present. Training and test data are both generated from the same knowledge graph, but this is standard supervised learning on held-out synthetic data rather than a reduction by construction (no fitted parameter renamed as prediction, no self-citation load-bearing the result, no ansatz smuggled in). The framework is evaluated against its own generated distribution, which is externally falsifiable via real engineer faults, satisfying the independence criteria. No load-bearing step reduces to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance claim rests on the assumption that the knowledge graph is a complete and accurate encoding of domain rules and that synthetic errors match real engineer mistakes; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption The domain knowledge graph correctly and comprehensively represents physical compatibility rules between system elements.
Invoked to generate synthetic training data and to ground repair suggestions at inference time.

pith-pipeline@v0.9.1-grok · 5805 in / 1262 out tokens · 37225 ms · 2026-06-26T07:32:38.677998+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 1 canonical work pages

[1]

Jin, Dongming and Jin, Zhi and Li, Linyu and Fang, Zheng and Li, Jia and Chen, Xiaohong , journal=
[2]

Li, Zirui and Husung, Stephan and Wang, Haoze , journal=
[3]

INCOSE International Symposium , volume=

Rafique, Khushnood Adil and Shah, Sanan and Dalecke,. INCOSE International Symposium , volume=. 2025 , organization=

2025
[4]

International Conference on Practical Applications of Agents and Multi-Agent Systems , pages=

Bouamra, Yasmine and Yun, Bruno and Poisson, Alexandre and Armetta, Fr. International Conference on Practical Applications of Agents and Multi-Agent Systems , pages=. 2025 , organization=

2025
[5]

Hui, Binyuan and Yang, Jian and Cui, Zeyu and Yang, Jiaxi and Liu, Dayiheng and Zhang, Lei and Liu, Tianyu and Zhang, Jiajun and Yu, Bowen and Lu, Keming and others , journal=
[6]

Wu and Y

Daya Guo and Qihao Zhu and Dejian Yang and Zhenda Xie and Kai Dong and Wentao Zhang and Guanting Chen and Xiao Bi and Y. Wu and Y. K. Li and Fuli Luo and Yingfei Xiong and Wenfeng Liang , year=. 2401.14196 , archivePrefix=

Pith/arXiv arXiv
[7]

Computers in Industry , volume=

Cibri. Computers in Industry , volume=. 2025 , publisher=

2025
[8]

, title =

DeHart, John K. , title =. INCOSE International Symposium , volume =. doi:https://doi.org/10.1002/iis2.13262 , url =. https://incose.onlinelibrary.wiley.com/doi/pdf/10.1002/iis2.13262 , year =

work page doi:10.1002/iis2.13262
[9]

Qualis, Richard , year =
[10]

2024 , publisher =

GitHub repository , howpublished =. 2024 , publisher =

2024
[11]

2407.01489 , archivePrefix=

Chunqiu Steven Xia and Yinlin Deng and Soren Dunn and Lingming Zhang , year=. 2407.01489 , archivePrefix=

Pith/arXiv arXiv
[12]

Pan and Shuyi Yang and Lakshya A

Mert Cemri and Melissa Z. Pan and Shuyi Yang and Lakshya A. Agrawal and Bhavya Chopra and Rishabh Tiwari and Kurt Keutzer and Aditya Parameswaran and Dan Klein and Kannan Ramchandran and Matei Zaharia and Joseph E. Gonzalez and Ion Stoica , year=. 2503.13657 , archivePrefix=

Pith/arXiv arXiv

[1] [1]

Jin, Dongming and Jin, Zhi and Li, Linyu and Fang, Zheng and Li, Jia and Chen, Xiaohong , journal=

[2] [2]

Li, Zirui and Husung, Stephan and Wang, Haoze , journal=

[3] [3]

INCOSE International Symposium , volume=

Rafique, Khushnood Adil and Shah, Sanan and Dalecke,. INCOSE International Symposium , volume=. 2025 , organization=

2025

[4] [4]

International Conference on Practical Applications of Agents and Multi-Agent Systems , pages=

Bouamra, Yasmine and Yun, Bruno and Poisson, Alexandre and Armetta, Fr. International Conference on Practical Applications of Agents and Multi-Agent Systems , pages=. 2025 , organization=

2025

[5] [5]

Hui, Binyuan and Yang, Jian and Cui, Zeyu and Yang, Jiaxi and Liu, Dayiheng and Zhang, Lei and Liu, Tianyu and Zhang, Jiajun and Yu, Bowen and Lu, Keming and others , journal=

[6] [6]

Wu and Y

Daya Guo and Qihao Zhu and Dejian Yang and Zhenda Xie and Kai Dong and Wentao Zhang and Guanting Chen and Xiao Bi and Y. Wu and Y. K. Li and Fuli Luo and Yingfei Xiong and Wenfeng Liang , year=. 2401.14196 , archivePrefix=

Pith/arXiv arXiv

[7] [7]

Computers in Industry , volume=

Cibri. Computers in Industry , volume=. 2025 , publisher=

2025

[8] [8]

, title =

DeHart, John K. , title =. INCOSE International Symposium , volume =. doi:https://doi.org/10.1002/iis2.13262 , url =. https://incose.onlinelibrary.wiley.com/doi/pdf/10.1002/iis2.13262 , year =

work page doi:10.1002/iis2.13262

[9] [9]

Qualis, Richard , year =

[10] [10]

2024 , publisher =

GitHub repository , howpublished =. 2024 , publisher =

2024

[11] [11]

2407.01489 , archivePrefix=

Chunqiu Steven Xia and Yinlin Deng and Soren Dunn and Lingming Zhang , year=. 2407.01489 , archivePrefix=

Pith/arXiv arXiv

[12] [12]

Pan and Shuyi Yang and Lakshya A

Mert Cemri and Melissa Z. Pan and Shuyi Yang and Lakshya A. Agrawal and Bhavya Chopra and Rishabh Tiwari and Kurt Keutzer and Aditya Parameswaran and Dan Klein and Kannan Ramchandran and Matei Zaharia and Joseph E. Gonzalez and Ion Stoica , year=. 2503.13657 , archivePrefix=

Pith/arXiv arXiv