Can Conversational XAI Improve User Performance? An Experimental Study

Julian Rosenberger; Lasse Bohlen; Mathias Kraus; Patrick Zschech; Sven Kruschel

arxiv: 2605.20439 · v1 · pith:DX2NQKPTnew · submitted 2026-05-19 · 💻 cs.LG · cs.HC

Can Conversational XAI Improve User Performance? An Experimental Study

Sven Kruschel , Julian Rosenberger , Lasse Bohlen , Mathias Kraus , Patrick Zschech This is my paper

Pith reviewed 2026-05-21 07:37 UTC · model grok-4.3

classification 💻 cs.LG cs.HC

keywords conversational XAIexplainable AIuser performanceexperimental designsystematic errorsprediction accuracymodel understanding

0 comments

The pith

Users can outperform an AI model by spotting and fixing its systematic errors, yet conversational explanations provide no performance advantage over Q&A formats.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether conversational XAI assistants help users achieve better prediction accuracy and understanding than traditional Q&A interfaces. Researchers created an experimental setup with an explainable-by-design model that produces consistent, detectable errors so that attentive users could compensate and exceed the model's own accuracy. In a preliminary test with 42 participants, both assistance conditions enabled users to outperform the model, but no meaningful differences appeared between conversational and Q&A groups. Engagement with the explanations remained modest overall. The results guide refinements for a larger study focused on increasing interaction and clarifying what drives the performance gains.

Core claim

Using an explainable-by-design prediction model, we create conditions where users can outperform the model by identifying and compensating for systematic errors. We compare conversational assistance against Q&A-based assistance to assess which better supports users in working with model explanations. Preliminary results from testing our experimental design show that participants (N=42) in both treatments significantly outperformed the model but reveal no performance differences between assistance types and modest engagement overall.

What carries the argument

An experimental design that pits conversational XAI assistance against Q&A assistance in a prediction task built around an explainable-by-design model whose systematic errors users can detect and correct.

If this is right

Users supplied with explanations can exceed the accuracy of the underlying model by compensating for its biases.
Conversational and Q&A formats produce equivalent performance and understanding in this setup.
Low engagement levels indicate that future designs must increase user interaction to sustain benefits.
The approach isolates the contribution of explanation format from explanation content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The similar results suggest that explanation quality may matter more than delivery format for short tasks.
Testing the same design on models without built-in explainability could show whether transparency is required for users to spot errors.
If engagement is raised through longer sessions or incentives, conversational XAI might demonstrate advantages not visible here.

Load-bearing premise

Users can reliably detect and correct the model's systematic errors when given explanations from an explainable-by-design model, and that this ability drives any performance gain.

What would settle it

A larger study in which users do not exceed the model's accuracy or in which one assistance format produces clearly higher accuracy would falsify the preliminary pattern of results.

Figures

Figures reproduced from arXiv: 2605.20439 by Julian Rosenberger, Lasse Bohlen, Mathias Kraus, Patrick Zschech, Sven Kruschel.

**Figure 2.** Figure 2: Performance measures across treatments. Lower values in prediction accuracy [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Explainable AI (XAI) techniques aim to provide insights into predictive models and enhance user performance, yet they often fall short of these expectations. Conversational XAI assistants promise to overcome such limitations, but empirical evidence on their impact on objective performance measures remains limited. We propose an experimental design for evaluating explanation assistance through prediction accuracy, model understanding, and error identification. Using an explainable-by-design prediction model, we create conditions where users can outperform the model by identifying and compensating for systematic errors. We compare conversational assistance against Q&A-based assistance to assess which better supports users in working with model explanations. Preliminary results from testing our experimental design show that participants (N=42) in both treatments significantly outperformed the model but reveal no performance differences between assistance types and modest engagement overall. These findings inform refinements for our planned full study, including enhanced engagement interventions and investigation of the mechanisms driving improved predictions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript outlines an experimental study designed to investigate whether conversational explainable AI (XAI) assistants can enhance user performance compared to traditional Q&A-based assistance. The study employs an explainable-by-design model with known systematic errors, allowing users to potentially outperform the model by identifying and correcting these errors. Key measures include prediction accuracy, model understanding, and error identification. Preliminary findings from a pilot with 42 participants indicate that users in both assistance conditions significantly outperformed the model, but there were no significant differences between the conversational and Q&A conditions, and engagement was modest overall. The results are used to refine the design for a larger full study.

Significance. Should the full study validate these preliminary findings and establish clear mechanisms linking explanations to error compensation, this research would address an important gap in XAI literature regarding objective performance improvements rather than just perceived understanding. The experimental setup with predefined model errors provides a strong foundation for testing the practical utility of different XAI interaction styles.

major comments (2)

Preliminary Results section: The assertion that participants outperformed the model by identifying and compensating for systematic errors lacks supporting data on specific error detections. Only aggregate performance metrics are presented, without per-error analysis or correlation between error identification and accuracy gains, which is central to validating the proposed benefit of the explanations.
Statistical reporting: The lack of detailed statistical reporting, such as exact p-values, effect sizes, confidence intervals, or power calculations for the observed outperformance over the model baseline, combined with the small N=42 pilot sample, reduces confidence in the preliminary conclusions and the ability to detect differences between assistance types.

minor comments (2)

The abstract mentions 'modest engagement overall' but does not specify the engagement metrics used; clarifying these would improve readability.
Additional references to recent studies on conversational XAI would strengthen the motivation and related work sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript describing a preliminary experimental study of conversational versus Q&A XAI assistance. We address each major comment below and describe the revisions we will incorporate to improve the reporting of our pilot results while preserving the focus on design refinement for the full study.

read point-by-point responses

Referee: Preliminary Results section: The assertion that participants outperformed the model by identifying and compensating for systematic errors lacks supporting data on specific error detections. Only aggregate performance metrics are presented, without per-error analysis or correlation between error identification and accuracy gains, which is central to validating the proposed benefit of the explanations.

Authors: We agree that linking specific error detections to accuracy gains is important for validating the benefit of the explanations. The pilot data include per-participant logs of which systematic errors were identified in each trial. In the revised manuscript we will add a per-error breakdown (e.g., detection rates for each of the three known model biases) together with a correlation analysis between the number of errors correctly identified and individual accuracy improvements. This will be presented in an expanded Preliminary Results section or a new supplementary table. revision: yes
Referee: Statistical reporting: The lack of detailed statistical reporting, such as exact p-values, effect sizes, confidence intervals, or power calculations for the observed outperformance over the model baseline, combined with the small N=42 pilot sample, reduces confidence in the preliminary conclusions and the ability to detect differences between assistance types.

Authors: We acknowledge the need for more complete statistical reporting. The revised version will include exact p-values, Cohen’s d effect sizes, and 95% confidence intervals for the key comparisons (user vs. model accuracy and conversational vs. Q&A conditions). We will also add a post-hoc power analysis for the observed effects and a brief discussion of the pilot sample size (N=42) as a limitation that primarily affects detection of between-condition differences, consistent with the manuscript’s framing as a design-refinement study rather than a definitive test. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical experimental study

full rationale

The paper describes an empirical user study (N=42) comparing conversational and Q&A XAI assistance on prediction accuracy, model understanding, and error identification using an explainable-by-design model. All central claims rest on direct experimental measurements of participant performance rather than any mathematical derivation, fitted parameters renamed as predictions, or load-bearing self-citations. No equations, ansatzes, or uniqueness theorems are invoked; results are presented as preliminary observations to inform a future full study. This is a standard self-contained empirical design with independent data collection.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions from experimental psychology and XAI evaluation literature rather than introducing new free parameters or invented entities.

axioms (1)

standard math Standard assumptions for statistical significance testing in between-subjects user experiments
Invoked when reporting that participants significantly outperformed the model.

pith-pipeline@v0.9.0 · 5692 in / 1207 out tokens · 33802 ms · 2026-05-21T07:37:07.040663+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

Bauer, K., Von Zahn, M., & Hinz, O. (2023). Expl(AI)ned: The Impact of Explainable Artificial Intelligence on Users’ Information Processing. Information Systems Research, 34(4), 1582–1602. Bordt, S., Lengerich, B., Nori, H., & Caruana, R. (2024). Data Science with LLMs and Interpretable Models. arXiv preprint arXiv:2402.14474. Buçinca, Z., Malaya, M. B., ...

work page arXiv 2023
[2]

Why Should I Trust You?

Miller, T. (2019). Explanation in Artificial Intelligence: Insights From the Social Sciences. Artificial Intelligence, 267, 1–38. Miller, T. (2023). Explainable AI Is Dead, Long Live Explainable AI! Hypothesis-Driven Decision Support Using Evaluative AI. Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, 333–342. Mindlin...

work page arXiv 2019

[1] [1]

Bauer, K., Von Zahn, M., & Hinz, O. (2023). Expl(AI)ned: The Impact of Explainable Artificial Intelligence on Users’ Information Processing. Information Systems Research, 34(4), 1582–1602. Bordt, S., Lengerich, B., Nori, H., & Caruana, R. (2024). Data Science with LLMs and Interpretable Models. arXiv preprint arXiv:2402.14474. Buçinca, Z., Malaya, M. B., ...

work page arXiv 2023

[2] [2]

Why Should I Trust You?

Miller, T. (2019). Explanation in Artificial Intelligence: Insights From the Social Sciences. Artificial Intelligence, 267, 1–38. Miller, T. (2023). Explainable AI Is Dead, Long Live Explainable AI! Hypothesis-Driven Decision Support Using Evaluative AI. Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, 333–342. Mindlin...

work page arXiv 2019