Can Conversational XAI Improve User Performance? An Experimental Study
Pith reviewed 2026-05-21 07:37 UTC · model grok-4.3
The pith
Users can outperform an AI model by spotting and fixing its systematic errors, yet conversational explanations provide no performance advantage over Q&A formats.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using an explainable-by-design prediction model, we create conditions where users can outperform the model by identifying and compensating for systematic errors. We compare conversational assistance against Q&A-based assistance to assess which better supports users in working with model explanations. Preliminary results from testing our experimental design show that participants (N=42) in both treatments significantly outperformed the model but reveal no performance differences between assistance types and modest engagement overall.
What carries the argument
An experimental design that pits conversational XAI assistance against Q&A assistance in a prediction task built around an explainable-by-design model whose systematic errors users can detect and correct.
If this is right
- Users supplied with explanations can exceed the accuracy of the underlying model by compensating for its biases.
- Conversational and Q&A formats produce equivalent performance and understanding in this setup.
- Low engagement levels indicate that future designs must increase user interaction to sustain benefits.
- The approach isolates the contribution of explanation format from explanation content.
Where Pith is reading between the lines
- The similar results suggest that explanation quality may matter more than delivery format for short tasks.
- Testing the same design on models without built-in explainability could show whether transparency is required for users to spot errors.
- If engagement is raised through longer sessions or incentives, conversational XAI might demonstrate advantages not visible here.
Load-bearing premise
Users can reliably detect and correct the model's systematic errors when given explanations from an explainable-by-design model, and that this ability drives any performance gain.
What would settle it
A larger study in which users do not exceed the model's accuracy or in which one assistance format produces clearly higher accuracy would falsify the preliminary pattern of results.
Figures
read the original abstract
Explainable AI (XAI) techniques aim to provide insights into predictive models and enhance user performance, yet they often fall short of these expectations. Conversational XAI assistants promise to overcome such limitations, but empirical evidence on their impact on objective performance measures remains limited. We propose an experimental design for evaluating explanation assistance through prediction accuracy, model understanding, and error identification. Using an explainable-by-design prediction model, we create conditions where users can outperform the model by identifying and compensating for systematic errors. We compare conversational assistance against Q&A-based assistance to assess which better supports users in working with model explanations. Preliminary results from testing our experimental design show that participants (N=42) in both treatments significantly outperformed the model but reveal no performance differences between assistance types and modest engagement overall. These findings inform refinements for our planned full study, including enhanced engagement interventions and investigation of the mechanisms driving improved predictions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript outlines an experimental study designed to investigate whether conversational explainable AI (XAI) assistants can enhance user performance compared to traditional Q&A-based assistance. The study employs an explainable-by-design model with known systematic errors, allowing users to potentially outperform the model by identifying and correcting these errors. Key measures include prediction accuracy, model understanding, and error identification. Preliminary findings from a pilot with 42 participants indicate that users in both assistance conditions significantly outperformed the model, but there were no significant differences between the conversational and Q&A conditions, and engagement was modest overall. The results are used to refine the design for a larger full study.
Significance. Should the full study validate these preliminary findings and establish clear mechanisms linking explanations to error compensation, this research would address an important gap in XAI literature regarding objective performance improvements rather than just perceived understanding. The experimental setup with predefined model errors provides a strong foundation for testing the practical utility of different XAI interaction styles.
major comments (2)
- Preliminary Results section: The assertion that participants outperformed the model by identifying and compensating for systematic errors lacks supporting data on specific error detections. Only aggregate performance metrics are presented, without per-error analysis or correlation between error identification and accuracy gains, which is central to validating the proposed benefit of the explanations.
- Statistical reporting: The lack of detailed statistical reporting, such as exact p-values, effect sizes, confidence intervals, or power calculations for the observed outperformance over the model baseline, combined with the small N=42 pilot sample, reduces confidence in the preliminary conclusions and the ability to detect differences between assistance types.
minor comments (2)
- The abstract mentions 'modest engagement overall' but does not specify the engagement metrics used; clarifying these would improve readability.
- Additional references to recent studies on conversational XAI would strengthen the motivation and related work sections.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript describing a preliminary experimental study of conversational versus Q&A XAI assistance. We address each major comment below and describe the revisions we will incorporate to improve the reporting of our pilot results while preserving the focus on design refinement for the full study.
read point-by-point responses
-
Referee: Preliminary Results section: The assertion that participants outperformed the model by identifying and compensating for systematic errors lacks supporting data on specific error detections. Only aggregate performance metrics are presented, without per-error analysis or correlation between error identification and accuracy gains, which is central to validating the proposed benefit of the explanations.
Authors: We agree that linking specific error detections to accuracy gains is important for validating the benefit of the explanations. The pilot data include per-participant logs of which systematic errors were identified in each trial. In the revised manuscript we will add a per-error breakdown (e.g., detection rates for each of the three known model biases) together with a correlation analysis between the number of errors correctly identified and individual accuracy improvements. This will be presented in an expanded Preliminary Results section or a new supplementary table. revision: yes
-
Referee: Statistical reporting: The lack of detailed statistical reporting, such as exact p-values, effect sizes, confidence intervals, or power calculations for the observed outperformance over the model baseline, combined with the small N=42 pilot sample, reduces confidence in the preliminary conclusions and the ability to detect differences between assistance types.
Authors: We acknowledge the need for more complete statistical reporting. The revised version will include exact p-values, Cohen’s d effect sizes, and 95% confidence intervals for the key comparisons (user vs. model accuracy and conversational vs. Q&A conditions). We will also add a post-hoc power analysis for the observed effects and a brief discussion of the pilot sample size (N=42) as a limitation that primarily affects detection of between-condition differences, consistent with the manuscript’s framing as a design-refinement study rather than a definitive test. revision: yes
Circularity Check
No significant circularity in empirical experimental study
full rationale
The paper describes an empirical user study (N=42) comparing conversational and Q&A XAI assistance on prediction accuracy, model understanding, and error identification using an explainable-by-design model. All central claims rest on direct experimental measurements of participant performance rather than any mathematical derivation, fitted parameters renamed as predictions, or load-bearing self-citations. No equations, ansatzes, or uniqueness theorems are invoked; results are presented as preliminary observations to inform a future full study. This is a standard self-contained empirical design with independent data collection.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard assumptions for statistical significance testing in between-subjects user experiments
Reference graph
Works this paper leans on
-
[1]
Bauer, K., Von Zahn, M., & Hinz, O. (2023). Expl(AI)ned: The Impact of Explainable Artificial Intelligence on Users’ Information Processing. Information Systems Research, 34(4), 1582–1602. Bordt, S., Lengerich, B., Nori, H., & Caruana, R. (2024). Data Science with LLMs and Interpretable Models. arXiv preprint arXiv:2402.14474. Buçinca, Z., Malaya, M. B., ...
-
[2]
Miller, T. (2019). Explanation in Artificial Intelligence: Insights From the Social Sciences. Artificial Intelligence, 267, 1–38. Miller, T. (2023). Explainable AI Is Dead, Long Live Explainable AI! Hypothesis-Driven Decision Support Using Evaluative AI. Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, 333–342. Mindlin...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.