arxiv: 2604.05350 · v2 · submitted 2026-04-07 · 💻 cs.CL · cs.AI

Recognition: no theorem link

DQA: Diagnostic Question Answering for IT Support

Vishaal Kapoor , Mariam Dundua , Sarthak Ahuja , Neda Kordjazi , Evren Yortucboylu , Vaibhavi Padala , Derek Ho , Jennifer Whitted

show 1 more author

Rebecca Steinert

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:42 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Diagnostic Question AnsweringIT SupportRetrieval-Augmented GenerationConversational AIRoot Cause AnalysisMulti-turn DialogueEnterprise Systems

0 comments

The pith

DQA raises IT support success from 41 percent to 79 percent by holding persistent diagnostic state and aggregating cases at the root-cause level instead of per document.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a diagnostic question-answering system called DQA for enterprise IT support. Standard multi-turn retrieval-augmented generation lacks explicit state, so it fails to accumulate evidence or resolve competing hypotheses across turns. DQA adds persistent state tracking, rewrites queries conversationally, aggregates retrievals by root cause, and generates responses conditioned on that state. On 150 replayed anonymized scenarios it reaches 78.7 percent trajectory-level success while cutting average turns from 8.4 to 3.9. A reader cares because faster, more reliable root-cause identification directly shortens downtime in production environments.

Core claim

DQA maintains persistent diagnostic state and aggregates retrieved cases at the level of root causes rather than individual documents. It combines conversational query rewriting, retrieval aggregation, and state-conditioned response generation to support systematic troubleshooting under enterprise latency and context constraints. On 150 anonymized enterprise IT support scenarios, averaged over three runs, it achieves 78.7 percent success under a trajectory-level criterion compared with 41.3 percent for a multi-turn RAG baseline while reducing average turns from 8.4 to 3.9.

What carries the argument

Persistent diagnostic state together with root-cause-level retrieval aggregation, which lets evidence accumulate and hypotheses compete across turns.

If this is right

Evidence from successive user messages can be retained and combined without being lost between turns.
Similar past cases are grouped by shared root cause, allowing stronger hypothesis testing.
Response generation is explicitly conditioned on the current accumulated state rather than the raw history.
Fewer conversation turns are required on average to reach a correct diagnosis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same state-plus-root-cause pattern could be tested in other iterative diagnostic settings such as medical or network troubleshooting conversations.
Production support platforms might see reduced human hand-offs if stateful aggregation replaces flat retrieval histories.
Explicit state tracking appears to be the main missing piece in many current retrieval-augmented dialogue systems.

Load-bearing premise

The replay-based evaluation on 150 anonymized enterprise scenarios with the chosen trajectory-level success metric faithfully measures real-world diagnostic performance.

What would settle it

A side-by-side live deployment on actual incoming support tickets that records final resolution rates, escalation frequency, and user-reported satisfaction for DQA versus the baseline.

read the original abstract

Enterprise IT support interactions are fundamentally diagnostic: effective resolution requires iterative evidence gathering from ambiguous user reports to identify an underlying root cause. While retrieval-augmented generation (RAG) provides grounding through historical cases, standard multi-turn RAG systems lack explicit diagnostic state and therefore struggle to accumulate evidence and resolve competing hypotheses across turns. We introduce DQA, a diagnostic question-answering framework that maintains persistent diagnostic state and aggregates retrieved cases at the level of root causes rather than individual documents. DQA combines conversational query rewriting, retrieval aggregation, and state-conditioned response generation to support systematic troubleshooting under enterprise latency and context constraints. We evaluate DQA on 150 anonymized enterprise IT support scenarios using a replay-based protocol. Averaged over three independent runs, DQA achieves a 78.7% success rate under a trajectory-level success criterion, compared to 41.3% for a multi-turn RAG baseline, while reducing average turns from 8.4 to 3.9.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DQA, a diagnostic question-answering framework for enterprise IT support that maintains persistent diagnostic state, aggregates retrieved cases at the root-cause level (rather than individual documents), and combines conversational query rewriting with state-conditioned response generation. On a replay-based evaluation using 150 anonymized enterprise scenarios, DQA reports an average 78.7% trajectory-level success rate (vs. 41.3% for a multi-turn RAG baseline) and reduces average turns from 8.4 to 3.9, averaged over three independent runs.

Significance. If the evaluation protocol and attribution of gains to state persistence and root-cause aggregation can be substantiated, the work would offer a practical advance in multi-turn LLM systems for diagnostic troubleshooting, where standard RAG often fails to accumulate evidence across turns. The numerical gap and turn reduction are potentially impactful for latency-sensitive enterprise applications, though the replay setup limits claims about generalization to live interactions.

major comments (2)

[Evaluation (replay protocol and success metric)] The central empirical claim (78.7% vs 41.3% success and turn reduction) rests on the replay-based protocol with trajectory-level success faithfully measuring diagnostic performance and attributing gains to persistent state plus root-cause aggregation. However, the manuscript provides no ablations isolating these components from others (e.g., query rewriting), and the abstract does not define how user responses are scripted or how success is operationalized when scenarios contain ambiguity or contradictions.
[Evaluation and Experimental Setup] Dataset and baseline details are insufficient to assess the result: the 150 anonymized scenarios are not characterized (e.g., distribution of root causes, ambiguity levels), the multi-turn RAG baseline implementation is not specified, and no statistical tests (variance across the three runs, significance of the 37.4-point gap) are reported.

minor comments (2)

Clarify what varies across the three independent runs (e.g., retrieval stochasticity, generation sampling temperature) to allow readers to interpret the reported averages.
Add a diagram or table contrasting DQA's state machine and aggregation step against the baseline RAG pipeline for immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We appreciate the emphasis on clarifying the evaluation protocol and experimental details. We will revise the paper to incorporate ablations, expanded protocol definitions, baseline specifications, summary dataset statistics, and statistical reporting. Our point-by-point responses follow.

read point-by-point responses

Referee: [Evaluation (replay protocol and success metric)] The central empirical claim (78.7% vs 41.3% success and turn reduction) rests on the replay-based protocol with trajectory-level success faithfully measuring diagnostic performance and attributing gains to persistent state plus root-cause aggregation. However, the manuscript provides no ablations isolating these components from others (e.g., query rewriting), and the abstract does not define how user responses are scripted or how success is operationalized when scenarios contain ambiguity or contradictions.

Authors: We agree that ablations are needed to isolate the effects of persistent state and root-cause aggregation. In the revision we will add experiments that disable each component independently while retaining query rewriting and other elements. We will also expand the Experimental Setup section with a precise description of the replay protocol, including how user responses are scripted from the scenarios and the exact criteria for trajectory-level success (correct root-cause identification within the turn limit). This will explicitly address handling of ambiguity and contradictions. The abstract will be updated to point readers to this definition in the main text. revision: yes
Referee: [Evaluation and Experimental Setup] Dataset and baseline details are insufficient to assess the result: the 150 anonymized scenarios are not characterized (e.g., distribution of root causes, ambiguity levels), the multi-turn RAG baseline implementation is not specified, and no statistical tests (variance across the three runs, significance of the 37.4-point gap) are reported.

Authors: We will add a detailed description of the multi-turn RAG baseline, specifying its query-rewriting and retrieval components. For the dataset, privacy constraints prevent releasing individual scenarios, but we will include summary statistics on root-cause category distribution and indicators of scenario complexity. We will also report per-run variance and apply a statistical significance test to the observed gaps. These additions will appear in the revised Experimental Setup and Results sections. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical framework evaluation

full rationale

The paper introduces DQA as a diagnostic QA framework with persistent state and root-cause aggregation, then reports direct empirical results from replay-based evaluation on 150 fixed scenarios (78.7% success vs 41.3% baseline, 3.9 vs 8.4 turns). No derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing steps exist. The reported metrics are measured outcomes from the described protocol, not outputs forced by construction from inputs or prior self-work. The evaluation protocol itself is a methodological choice whose external validity is a separate question from internal circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The framework is built on standard retrieval and generation components; no new mathematical axioms, free parameters, or invented physical entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5501 in / 1125 out tokens · 38292 ms · 2026-05-10T19:42:27.166744+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 12 canonical work pages · 3 internal anchors

[1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
[3]

Agnar Aamodt and Enric Plaza. 1994. https://doi.org/10.3233/AIC-1994-7104 Case-based reasoning: Foundational issues, methodological variations, and system approaches . AI Communications, 7(1):39--59

work page doi:10.3233/aic-1994-7104 1994
[4]

Yiruo Cheng, Xinzhe Wu, Yiwei Zhang, Wenxuan Feng, Minhao Chen, Qianqian Liu, and Ming Yin. 2025. https://doi.org/10.18653/v1/2025.findings-naacl.72 CORAL : Benchmarking multi-turn conversational retrieval-augmented generation . In Findings of the North American Chapter of the Association for Computational Linguistics

work page doi:10.18653/v1/2025.findings-naacl.72 2025
[5]

Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. 2018. https://doi.org/10.18653/v1/D18-1241 QuAC : Question answering in context . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2174--2184

work page doi:10.18653/v1/d18-1241 2018
[6]

Yannis Katsis, Sara Rosenthal, Kshitij Fadnis, Jianfeng Chen, and Avirup Sil. 2025. https://doi.org/10.48550/arXiv.2501.03468 MTRAG : A multi-turn conversational benchmark for evaluating retrieval-augmented generation systems . arXiv preprint arXiv:2501.03468

work page doi:10.48550/arxiv.2501.03468 2025
[7]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, Sebastian Riedel, and Douwe Kiela. 2020. https://doi.org/10.48550/arXiv.2005.11401 Retrieval-augmented generation for knowledge-intensive NLP tasks . In Advances in Neural Information Processing...

work page internal anchor Pith review doi:10.48550/arxiv.2005.11401 2020
[8]

Hongjin Qian and Zhicheng Dou. 2022. https://doi.org/10.18653/v1/2022.findings-emnlp.355 Explicit query rewriting for conversational dense retrieval . In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4054--4064

work page doi:10.18653/v1/2022.findings-emnlp.355 2022
[9]

Siva Reddy, Danqi Chen, and Christopher D. Manning. 2019. https://doi.org/10.1162/tacl_a_00266 CoQA : A conversational question answering challenge . Transactions of the Association for Computational Linguistics, 7:249--266

work page doi:10.1162/tacl_a_00266 2019
[10]

Nils Reimers and Iryna Gurevych. 2019. https://doi.org/10.18653/v1/D19-1410 Sentence- BERT : Sentence embeddings using siamese BERT -networks . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 3982--3992

work page doi:10.18653/v1/d19-1410 2019
[11]

Svitlana Vakulenko, Shayne Longpre, Zhucheng Tu, and Raviteja Anantha. 2021. https://doi.org/10.1007/978-3-030-71809-9_34 A comparison of question rewriting methods for conversational passage retrieval . In Advances in Information Retrieval: 43rd European Conference on IR Research, pages 418--424. Springer

work page doi:10.1007/978-3-030-71809-9_34 2021
[12]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. https://doi.org/10.48550/arXiv.2210.03629 ReAct : Synergizing reasoning and acting in language models . In International Conference on Learning Representations

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.03629 2023
[13]

Shi Yu, Jiahua Liu, Jingqin Yang, Chenguang Shen, Linjun Wang, Kun Liu, and Zhenglu Zhang. 2020. https://doi.org/10.1145/3397271.3401161 Few-shot generative conversational query rewriting . In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1933--1936

work page doi:10.1145/3397271.3401161 2020
[14]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. https://doi.org/10.48550/arXiv.2306.05685 Judging LLM -as-a-judge with MT -bench and chatbot arena . In Advances in Neural Information Processing Systems, volume 36

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.05685 2023