arxiv: 2604.16280 · v1 · submitted 2026-04-17 · 💻 cs.AI

Using Large Language Models and Knowledge Graphs to Improve the Interpretability of Machine Learning Models in Manufacturing

Thomas Bayer , Alexander Lohr , Sarah Wei{\ss} , Bernd Michelberger , Wolfram H\"opken This is my paper

Pith reviewed 2026-05-10 08:43 UTC · model grok-4.3

classification 💻 cs.AI

keywords explainable AIknowledge graphslarge language modelsmanufacturingmachine learning interpretabilitydecision support

0 comments

The pith

A knowledge graph linked to machine learning outputs lets large language models generate accurate explanations for manufacturing decisions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a system that stores manufacturing domain knowledge, machine learning predictions, and basic explanations together in a knowledge graph. Relevant pieces of this information are selectively retrieved and passed to a large language model, which turns them into natural, user-friendly explanations. This method was tested in a real manufacturing setting against 33 questions, some standard and some more complex, using measures of accuracy, consistency, clarity, and usefulness. A sympathetic reader would care because it offers a way to make black-box machine learning models transparent enough for practical use on the factory floor, where decisions affect production and costs.

Core claim

The authors establish that by structuring domain data and ML outputs in a knowledge graph and employing selective retrieval of relevant facts from the graph to inform an LLM, the resulting explanations of ML results achieve high factual accuracy and practical usefulness in manufacturing contexts.

What carries the argument

The selective retrieval of relevant facts from the knowledge graph that conditions the large language model's generation of explanations.

Load-bearing premise

That the combination of selective facts from the knowledge graph and language model output will consistently yield explanations that are correct and valuable for real manufacturing decisions.

What would settle it

A test case in the manufacturing environment where the generated explanation contradicts known domain facts or leads operators to a suboptimal production decision.

Figures

Figures reproduced from arXiv: 2604.16280 by Alexander Lohr, Bernd Michelberger, Sarah Wei{\ss}, Thomas Bayer, Wolfram H\"opken.

**Figure 2.** Figure 2: Excerpt from Knowledge Graph by querying the KG. Since the retrieved information of each turn is collected and the knowledge graph is finite, termination of the algorithm is ensured. For prompting an LLM, we use openai-chat format, cf. [16], and the function LlmResponse ( [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Schematic example of KG traversal of our Graph-RAG based prototype, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Function call gpt-4o-2024-11-20, with temperature 1 and a fixed random seed to ensure reproducibility. #System Message The following structure illustrates the class level of the ontology, which will be used to answer the subsequent questions. The node classes have instances that are not listed here: {ontology_structure}. #User Message Only give as an answer a list of classes (following this syntax: [class1… view at source ↗

**Figure 5.** Figure 5: Prompt Template for Initial Step 4 Evaluation and Results Our prototype is applied to a manufacturing setting where a robotic manipulator places screws into holes at varying angles. The placement success is predicted based on screw geometry and robot-arm attributes. The KG described in Section 3.1 provides information on tasks, models, hardware, and their relations. 4.1 Evaluation Methodology We evaluated… view at source ↗

**Figure 6.** Figure 6: Subjective evaluation of answers for user groups worker and developer [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Kendall’s τ correlation matrices for ratings of user roles. Shows pairwise correlations; black diagonal cells indicate undefined self-correlations (NaN) aligned, with Question 6 displaying the least stability. Length appropriateness is more heterogeneous, as the worker role shows greater individual variation. Structure retains strong positive correlations, indicating stable agreement. Overall, the correlat… view at source ↗

**Figure 8.** Figure 8: Example: Ambiguous or Underspecified Request [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

read the original abstract

Explaining Machine Learning (ML) results in a transparent and user-friendly manner remains a challenging task of Explainable Artificial Intelligence (XAI). In this paper, we present a method to enhance the interpretability of ML models by using a Knowledge Graph (KG). We store domain-specific data along with ML results and their corresponding explanations, establishing a structured connection between domain knowledge and ML insights. To make these insights accessible to users, we designed a selective retrieval method in which relevant triplets are extracted from the KG and processed by a Large Language Model (LLM) to generate user-friendly explanations of ML results. We evaluated our method in a manufacturing environment using the XAI Question Bank. Beyond standard questions, we introduce more complex, tailored questions that highlight the strengths of our approach. We evaluated 33 questions, analyzing responses using quantitative metrics such as accuracy and consistency, as well as qualitative ones such as clarity and usefulness. Our contribution is both theoretical and practical: from a theoretical perspective, we present a novel approach for effectively enabling LLMs to dynamically access a KG in order to improve the explainability of ML results. From a practical perspective, we provide empirical evidence showing that such explanations can be successfully applied in real-world manufacturing environments, supporting better decision-making in manufacturing processes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a working example of pulling triplets from a manufacturing KG to ground LLM explanations of ML outputs, but the 33-question evaluation reports only absolute scores with no baselines, so the added value of the KG step stays unproven.

read the letter

The core contribution is a pipeline that stores domain data, ML results, and explanations in a knowledge graph, then selectively retrieves relevant triplets to feed an LLM for user-friendly outputs. They tested this on 33 questions from the XAI Question Bank plus some custom harder ones, scoring on accuracy, consistency, clarity, and usefulness in a real factory setting. That setup is new enough in the manufacturing context and shows a practical way to connect structured knowledge with generative explanations. The authors also make clear that the method is meant to support decision-making rather than just produce generic text. The evaluation gives concrete numbers and qualitative feedback, which is better than pure theory. The main weakness is the missing comparisons. There are no runs against a plain LLM prompt, against SHAP or LIME outputs, or against simple rule-based KG queries, so it is impossible to tell whether the selective retrieval step actually improves anything or whether the LLM alone would score similarly. The abstract and description also leave out details on how the 33 questions were selected, whether raters were blinded, or any statistical tests, which makes the usefulness claims harder to weigh. The method itself looks independent of the results and draws on an external question bank, so there is no obvious circularity. This work is aimed at applied researchers and engineers who already use KGs or LLMs in industrial XAI and want a concrete integration pattern to try. It is not going to shift the broader field, but someone building a similar system could borrow the retrieval logic. I would bring it to a reading group as a case study rather than a foundational paper. It is solid enough on the implementation side to deserve peer review, provided the authors add at least one baseline arm and more transparency on the metrics.

Referee Report

2 major / 2 minor

Summary. The paper proposes integrating a Knowledge Graph (KG) storing domain-specific manufacturing data, ML model results, and explanations with a Large Language Model (LLM) via selective triplet retrieval to generate user-friendly explanations of ML outputs. It claims this dynamic KG access improves interpretability over standard approaches, with evaluation on 33 questions (standard and tailored complex ones) from the XAI Question Bank using quantitative metrics (accuracy, consistency) and qualitative metrics (clarity, usefulness), supported by empirical evidence from a real-world manufacturing environment.

Significance. If the central claim holds under controlled evaluation, the work offers a practical bridge between structured domain knowledge and generative explanations, with potential to support better decision-making in manufacturing XAI applications. The real-world deployment and use of an external XAI Question Bank provide a concrete strength in applicability, though the lack of comparative baselines leaves the improvement attributable to KG+LLM integration under-supported.

major comments (2)

[Evaluation] Evaluation section (as described): The reported results on 33 questions provide absolute scores for accuracy, consistency, clarity, and usefulness but include no baseline arms (e.g., LLM without retrieval, standard XAI methods such as SHAP/LIME, or rule-based KG queries). This prevents isolating the contribution of selective triplet retrieval to any observed gains and weakens support for the claim that the approach improves explainability.
[Evaluation] Evaluation section: The introduction of 'more complex, tailored questions' beyond the standard XAI Question Bank raises the possibility of post-hoc selection or tailoring; without pre-specification, inter-rater agreement details, or statistical significance testing, it is unclear whether these questions fairly test the method or introduce selection bias into the usefulness and clarity assessments.

minor comments (2)

[Method] The abstract and method description would benefit from explicit details on KG construction (e.g., how ML results and explanations are encoded as triplets), the exact retrieval algorithm, and the specific LLM employed to enable reproducibility.
[Evaluation] Clarify whether the 33 questions were evaluated by multiple raters and report any inter-rater reliability metrics to strengthen the qualitative assessments of clarity and usefulness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and describe the revisions we will incorporate to strengthen the evaluation section.

read point-by-point responses

Referee: [Evaluation] Evaluation section (as described): The reported results on 33 questions provide absolute scores for accuracy, consistency, clarity, and usefulness but include no baseline arms (e.g., LLM without retrieval, standard XAI methods such as SHAP/LIME, or rule-based KG queries). This prevents isolating the contribution of selective triplet retrieval to any observed gains and weakens support for the claim that the approach improves explainability.

Authors: We agree that the lack of baseline comparisons limits the strength of claims about the specific benefits of selective triplet retrieval. In the revised manuscript we will add new experiments that include (1) an LLM-only condition without KG retrieval and (2) a rule-based KG query baseline that returns raw triplets without LLM generation. We will report the same quantitative and qualitative metrics for these conditions. Direct comparison with SHAP or LIME is not straightforward because those methods produce feature-importance scores rather than natural-language explanations grounded in manufacturing domain knowledge; we will add a short discussion clarifying this distinction and why a head-to-head numerical comparison would be misleading. revision: yes
Referee: [Evaluation] Evaluation section: The introduction of 'more complex, tailored questions' beyond the standard XAI Question Bank raises the possibility of post-hoc selection or tailoring; without pre-specification, inter-rater agreement details, or statistical significance testing, it is unclear whether these questions fairly test the method or introduce selection bias into the usefulness and clarity assessments.

Authors: We acknowledge that the current description does not provide sufficient detail on how the tailored questions were generated or evaluated. In the revision we will (1) list all 33 questions explicitly, (2) describe the criteria used to create the additional manufacturing-specific questions (including that they were formulated before running the evaluation), (3) report inter-rater agreement statistics for the qualitative ratings of clarity and usefulness, and (4) include statistical significance tests comparing the standard and tailored question sets where appropriate. These additions will allow readers to judge whether selection bias is present. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method and evaluation are independently defined against external benchmarks

full rationale

The paper presents a method for selective triplet retrieval from a KG combined with LLM generation to produce explanations of ML results, then evaluates the outputs on 33 questions drawn from the external XAI Question Bank using accuracy, consistency, clarity, and usefulness metrics. No equations, fitted parameters, or first-principles derivations appear in the described chain. The method is specified independently of its own evaluation outcomes, and the benchmark is external rather than self-generated. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim therefore does not reduce to its inputs by construction and remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that the XAI Question Bank is an appropriate and unbiased test set for manufacturing explanations and that LLM outputs remain faithful when conditioned on retrieved triplets; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The XAI Question Bank supplies a representative and sufficient set of questions for assessing explanation quality in a manufacturing context.
Invoked when the authors state they evaluated 33 questions from the bank without further justification of domain fit.

pith-pipeline@v0.9.0 · 5540 in / 1271 out tokens · 47079 ms · 2026-05-10T08:43:47.360415+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 1 canonical work pages

[1]

IEEE access : practical innovations, open solutions6, 52138–52160 (2018)

Adadi, A., Berrada, M.: Peeking inside the black-box: A survey on explainable artificial intelligence (XAI). IEEE access : practical innovations, open solutions6, 52138–52160 (2018)

2018
[2]

In: Findings of the Association for Computational Linguistics: NAACL 2024

Agarwal, S., Menon, R., Singh, S., Gardner, M., Khashabi, D.: Bring your own KG: Self-supervised program synthesis for zero-shot KGQA. In: Findings of the Association for Computational Linguistics: NAACL 2024. pp. 837–859 (2024)

2024
[3]

Information Fusion58, 82–115 (2020)

Arrieta, B.A., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A., Garcia, S., Gil-Lopez, S., Molina, D., Benjamins, R., Chatila, R., Herrera, F.: Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion58, 82–115 (2020)

2020
[4]

Intelligent Systems with Applications26, 200501 (2025)

Benhanifia, A., Cheikh, Z.B., Oliveira, P.M., Valente, A., Lima, J.: Systematic review of predictive maintenance practices in the manufacturing sector. Intelligent Systems with Applications26, 200501 (2025)

2025
[5]

In: The Semantic Web – ESWC 2024

Dasoulas, I., et al.: MLSea: A semantic layer for discoverable machine learning. In: The Semantic Web – ESWC 2024. Lecture Notes in Computer Science, Springer (2024)

2024
[6]

Doshi-Velez,F.,Kim,B.:Towardsarigorousscienceofinterpretablemachinelearn- ing (2017)

2017
[7]

Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., Metropolitansky, D., Ness, R.O., Larson, J.: From local to global: A graph RAG approach to query-focused summarization (2025)

2025
[8]

International Journal of Machine Learning12(4), 200–220 (2023)

Garcia,L.,Sanchez,M.:Theroleofontologiesindataminingandmachinelearning. International Journal of Machine Learning12(4), 200–220 (2023)

2023
[9]

Industry 4.0 Science41(2), 30–36 (2023)

Höpken,W., Stetter,R., Pfeil,M., Bayer,T.,Michelberger, B.,Schuchter, T.,Lohr, A.: Digital twins using semantic modeling and ai. Industry 4.0 Science41(2), 30–36 (2023)

2023
[10]

In: Proceedings of the 2nd International Workshop on Semantic Technologies and Deep Learning Models for Scientific

Kaplan, A., Keim, J., Schneider, M., Koziolek, A., Reussner, R.: Combining knowl- edge graphs and large language models to ease knowledge access in software archi- tecture research. In: Proceedings of the 2nd International Workshop on Semantic Technologies and Deep Learning Models for Scientific. CEUR Workshop Proceed- ings, vol. 3697, pp. 76–82 (2024) Im...

2024
[11]

Journal of Data Science and Technology8(2), 50–70 (2023)

Khan, F., Lee, C.: Evaluating ML-schema: An empirical study on data mining interoperability. Journal of Data Science and Technology8(2), 50–70 (2023)

2023
[12]

Journal of Advances in Information Technology15(10), 1157–1162 (2024)

Lan, M., Xia, Y., Zhou, G., Huang, N., Li, Z., Wu, H.: LLM4QA: Leveraging large language model for efficient knowledge graph reasoning with SPARQL query. Journal of Advances in Information Technology15(10), 1157–1162 (2024)

2024
[13]

Journal of Big Data8(3), 3 (2021)

Liang, S., Stockinger, K., Mendes de Farias, T., Anisimova, M., Gil, M.: Querying knowledge graphs in natural language. Journal of Big Data8(3), 3 (2021)

2021
[14]

In: Proceedings of the 2020 CHI conference on human factors in computing systems

Liao, Q.V., Gruen, D., Miller, S.: Questioning the ai: informing design practices for explainable ai user experiences. In: Proceedings of the 2020 CHI conference on human factors in computing systems. pp. 1–15 (2020)

2020
[15]

In: Proceedings of the International Conference on Data Mining

Nguyen, P., Barrett, D.: Advancing data mining results sharing with ML-schema. In: Proceedings of the International Conference on Data Mining. pp. 300–315 (2023)

2023
[16]

OpenAI: OpenAI chat completion API format (2024)

2024
[17]

In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Ovadia,O.,Brief,M.,Mishaeli,M.,Elisha,O.:Fine-tuningorretrieval?Comparing knowledge injection in llms. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 237–250 (Jan 2024)

2024
[18]

IEEE Transactions on Knowledge and Data Engineering36(7), 3580–3599 (2024)

Pan, S., Luo, L., Wang, Y., Chen, C., Wang, J., Wu, X.: Unifying large language models and knowledge graphs: A roadmap. IEEE Transactions on Knowledge and Data Engineering36(7), 3580–3599 (2024)

2024
[19]

ML-Schema: Exposing the Semantics of Machine Learning with Schemas and Ontologies

Publio, G.C., Esteves, D., Ławrynowicz, A., Panov, P., Soldatova, L.N., Soru, T., Vanschoren, J., Zafar, H.: ML-Schema: Exposing the semantics of machine learning with schemas and ontologies. CoRRabs/1807.05351(2018)

work page Pith review arXiv 2018
[20]

IEEE access : practical innovations, open solutions10, 70712–70723 (2022)

Rony, M.R.A.H., Kumar, U., Teucher, R., Kovriguina, L., Lehmann, J.: SGPT: A generative approach for SPARQL query generation from natural language ques- tions. IEEE access : practical innovations, open solutions10, 70712–70723 (2022)

2022
[21]

The Knowledge Engineering Review34, e17 (2019)

Sampath Kumar, V.R., Khamis, A., Fiorini, S., Carbonera, J.L., Alarcos, A.O., Habib, M., Goncalves, P., Li, H., Olszewska, J.I.: Ontologies for industry 4.0. The Knowledge Engineering Review34, e17 (2019)

2019
[22]

Procedia CIRP136, 61–66 (2025)

Schuchter, T., Saft, P., Stetter, R., Pfeil, M., Höpken, W., Till, M., Rudolph, S.: Application of artificial intelligence in model-based systems engineering of auto- mated production systems. Procedia CIRP136, 61–66 (2025)

2025
[23]

IEEE Transactions on Knowledge and Data Engineering35(1), 614–633 (2023)

von Rueden, L., Mayer, S., Beckh, K., Georgiev, B., Giesselbach, S., Heese, R., Kirsch, B., Pfrommer, J., Pick, A., Ramamurthy, R., Walczak, M., Garcke, J., Bauckhage, C., Schuecker, J.: Informed machine learning – a taxonomy and sur- vey of integrating prior knowledge into learning systems. IEEE Transactions on Knowledge and Data Engineering35(1), 614–633 (2023)

2023
[24]

Journal of Artificial Intelligence Research10(1), 40–60 (2023)

Wilson, J., Liu, H.: Integration of ML-schema with machine learning platforms. Journal of Artificial Intelligence Research10(1), 40–60 (2023)

2023
[25]

In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-

Xiao, G., Calvanese, D., Kontchakov, R., Lembo, D., Poggi, A., Rosati, R., Za- kharyaschev, M.: Ontology-based data access: A survey. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-
[26]

5511–5519 (Jul 2018)

pp. 5511–5519 (Jul 2018)

2018