A Proactive EMR Assistant for Doctor-Patient Dialogue: Streaming ASR, Belief Stabilization, and Preliminary Controlled Evaluation

Jia You; Yan Liu; Zhenhai Pan

arxiv: 2604.13059 · v1 · submitted 2026-03-18 · 💻 cs.CL · cs.AI

A Proactive EMR Assistant for Doctor-Patient Dialogue: Streaming ASR, Belief Stabilization, and Preliminary Controlled Evaluation

Zhenhai Pan , Yan Liu , Jia You This is my paper

Pith reviewed 2026-05-15 10:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords proactive EMR assistantstreaming ASRbelief stabilizationdoctor-patient dialoguemedical state extractioncontrolled pilot evaluationdialogue systeminformation retrieval

0 comments

The pith

A proactive EMR assistant stabilizes diagnostic beliefs from streaming speech to support doctor-patient dialogues in real time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds an end-to-end system that processes live doctor-patient speech rather than waiting until after the visit to build notes. It combines streaming automatic speech recognition, punctuation restoration, state extraction, belief stabilization to keep diagnostic hypotheses consistent, objectified retrieval of relevant medical facts, and action planning to generate reports on the fly. In a controlled pilot using ten streamed simulated dialogues plus a 300-query benchmark, the complete pipeline reached 0.84 state-event F1, 0.87 retrieval Recall@5, 83.3 percent coverage, 81.4 percent structural completeness, and 80.0 percent risk recall. Ablations point to punctuation restoration and belief stabilization as helpful for the downstream steps. The authors frame the work strictly as a technical demonstration under simulated conditions, not as evidence for clinical use.

Core claim

The central claim is that an online architecture integrating streaming ASR, punctuation restoration, stateful extraction, belief stabilization, objectified retrieval, and action planning can produce a proactive EMR assistant. In the preliminary controlled evaluation on ten streamed doctor-patient dialogues and the aggregated 300-query benchmark, the full system attains a state-event F1 of 0.84, retrieval Recall@5 of 0.87, and end-to-end pilot scores of 83.3 percent coverage, 81.4 percent structural completeness, and 80.0 percent risk recall. Component ablations indicate that adding punctuation restoration and belief stabilization improves extraction quality, retrieval accuracy, and action選択.

What carries the argument

Belief stabilization, which incrementally updates and maintains consistent diagnostic state hypotheses across noisy streaming speech input instead of resetting on each turn.

If this is right

Punctuation restoration raises the quality of state extraction from speech input.
Belief stabilization improves retrieval accuracy and downstream action planning.
The integrated pipeline can generate replayable reports while the consultation is still underway.
End-to-end metrics exceed 80 percent on coverage, structural completeness, and risk recall under the pilot conditions.
Ablation results show directional gains attributable to the added stabilization and punctuation modules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the architecture scales, it could shorten the time doctors spend on documentation after visits.
Real deployment would require testing how the system interacts with existing hospital EMR databases during live sessions.
The emphasis on streaming handling suggests similar pipelines could be adapted for other real-time medical dialogue tasks such as triage or follow-up scheduling.

Load-bearing premise

Performance measured on ten simulated streamed dialogues and a 300-query benchmark will generalize beyond this tightly controlled pilot setting.

What would settle it

Substantially lower state-event F1 or retrieval Recall@5 when the same system is run on recordings of real, unscripted doctor-patient conversations collected in an actual clinic.

Figures

Figures reproduced from arXiv: 2604.13059 by Jia You, Yan Liu, Zhenhai Pan.

read the original abstract

Most dialogue-based electronic medical record (EMR) systems still behave as passive pipelines: transcribe speech, extract information, and generate the final note after the consultation. That design improves documentation efficiency, but it is insufficient for proactive consultation support because it does not explicitly address streaming speech noise, missing punctuation, unstable diagnostic belief, objectification quality, or measurable next-action gains. We present an end-to-end proactive EMR assistant built around streaming speech recognition, punctuation restoration, stateful extraction, belief stabilization, objectified retrieval, action planning, and replayable report generation. The system is evaluated in a preliminary controlled setting using ten streamed doctor-patient dialogues and a 300-query retrieval benchmark aggregated across dialogues. The full system reaches state-event F1 of 0.84, retrieval Recall@5 of 0.87, and end-to-end pilot scores of 83.3% coverage, 81.4% structural completeness, and 80.0% risk recall. Ablations further suggest that punctuation restoration and belief stabilization may improve downstream extraction, retrieval, and action selection within this pilot. These results were obtained under a controlled simulated pilot setting rather than broad deployment claims, and they should not be read as evidence of clinical deployment readiness, clinical safety, or real-world clinical utility. Instead, they suggest that the proposed online architecture may be technically coherent and directionally supportive under tightly controlled pilot conditions. The present study should be read as a pilot concept demonstration under tightly controlled pilot conditions rather than as evidence of clinical deployment readiness or clinical generalizability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A workable small-scale streaming EMR pipeline with honest scoping, but too limited by simulation and sample size to support much beyond a proof of concept.

read the letter

The main takeaway is a working prototype for a proactive EMR assistant that processes doctor-patient speech in real time using streaming ASR, punctuation restoration, and belief stabilization to keep track of the consultation state and suggest actions. The system also includes objectified retrieval and replayable report generation. What stands out is the end-to-end online design. Most systems wait until the end to build the note, but this one tries to stabilize beliefs and plan next steps as the talk happens. The results on the 300-query benchmark and the ten dialogues show decent performance, with state-event F1 at 0.84 and recall around 0.87. The ablations give some indication that the stabilization component helps with extraction and retrieval in this setting. The evaluation stays within its limits. Everything comes from simulated dialogues, and ten is a small number. That means we don't have evidence on how it performs with real patients or varied conditions like different accents or interruptions. The authors note this clearly, so the claims stay modest and focused on technical coherence. This work fits for researchers in medical dialogue systems who want to see how to make these tools more responsive during the visit. It could serve as a reference for combining those NLP pieces in a streaming setup, especially if someone is starting to build similar prototypes. The paper is coherent enough to warrant peer review. Referees could help push for larger scale tests or real data to strengthen the findings.

Referee Report

0 major / 2 minor

Summary. The paper presents a proactive EMR assistant system integrating streaming ASR, punctuation restoration, stateful extraction, belief stabilization, objectified retrieval, action planning, and replayable report generation. It reports results from a preliminary controlled pilot on ten streamed simulated doctor-patient dialogues plus a 300-query retrieval benchmark, with the full system achieving state-event F1 of 0.84, retrieval Recall@5 of 0.87, and end-to-end pilot scores of 83.3% coverage, 81.4% structural completeness, and 80.0% risk recall. Ablations are described as suggesting benefits from punctuation restoration and belief stabilization. All claims are explicitly scoped to technical coherence and directional support under tightly controlled simulated conditions, with no assertions of clinical safety, deployment readiness, or generalizability.

Significance. If the results hold, the work demonstrates a technically coherent online architecture capable of handling streaming speech noise and unstable diagnostic beliefs in a simulated medical dialogue setting. The component-wise ablations provide concrete, if preliminary, evidence that specific modules can improve downstream extraction, retrieval, and action selection within the pilot. This supplies a useful proof-of-concept baseline for proactive rather than passive EMR systems, though the small evaluation scale restricts claims of broader significance.

minor comments (2)

[Abstract / Results] Abstract and results sections: the ablation suggestions are described qualitatively ('may improve downstream extraction, retrieval, and action selection'); include a table or explicit delta metrics (e.g., F1 or Recall@5 with/without each component) so readers can judge the magnitude of the reported gains.
[Evaluation] Evaluation section: the ten dialogues are described as 'streamed simulated'; add a short paragraph on dialogue generation protocol, speaker characteristics, and any controls for realism so the pilot can be reproduced or extended.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the accurate summary, positive significance assessment, and recommendation for minor revision. The report correctly notes the preliminary scale of the evaluation. We address the key points below.

read point-by-point responses

Referee: The small evaluation scale restricts claims of broader significance.

Authors: We agree that the pilot is limited to ten streamed simulated dialogues plus a 300-query benchmark. The manuscript already qualifies every result as preliminary, scoped exclusively to technical coherence and directional support under tightly controlled simulated conditions, with explicit disclaimers against clinical safety, deployment readiness, or generalizability claims. The ablations are presented only as suggestive within this setting. No revision is required because the scoping language is already present and matches the referee's characterization. revision: no

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper reports empirical metrics (state-event F1 0.84, Recall@5 0.87, pilot scores ~80-83%) obtained by direct measurement on ten streamed simulated dialogues and a 300-query benchmark. No equations, derivations, or parameter fits are described that reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains. The text explicitly limits scope to the controlled pilot and states results demonstrate only technical coherence under those conditions. No load-bearing self-citations, uniqueness theorems, or ansatzes appear. This is a standard empirical system evaluation whose central claims remain independent of the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach relies on standard NLP and retrieval techniques without introducing new free parameters or postulated entities.

axioms (1)

domain assumption Existing streaming ASR and information extraction models perform adequately when applied to medical dialogues
The system architecture assumes off-the-shelf components integrate and function as described in the medical context.

pith-pipeline@v0.9.0 · 5585 in / 1183 out tokens · 51548 ms · 2026-05-15T10:28:09.489563+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Medical dialogue summarization for automated reporting in healthcare

Sabine Molenaar, Lientje Maas, Veronica Burriel, Fabiano Dalpiaz, and Sjaak Brinkkemper. Medical dialogue summarization for automated reporting in healthcare. InAdvanced Information Systems Engineering Workshops, pages 76–88, 2020

work page 2020
[2]

An empirical study of clinical note generation from doctor-patient encounters

Asma Ben Abacha, Wen-wai Yim, Yadan Fan, and Thomas Lin. An empirical study of clinical note generation from doctor-patient encounters. InProceedings of EACL, pages 2291–2302, 2023

work page 2023
[3]

Overview of the MEDIQA-Chat 2023 shared tasks on the summarization and generation of doctor-patient conversations

Asma Ben Abacha, Wen-wai Yim, Griffin Adams, Neal Snider, and Meliha Yetisgen. Overview of the MEDIQA-Chat 2023 shared tasks on the summarization and generation of doctor-patient conversations. InProceedings of the 5th Clinical Natural Language Processing Workshop, pages 503–513, 2023

work page 2023
[4]

Bidirectionalrecurrentneuralnetworkwithattentionmechanism for punctuation restoration

OttokarTilkandTanelAlumae. Bidirectionalrecurrentneuralnetworkwithattentionmechanism for punctuation restoration. InProceedings of Interspeech, pages 3047–3051, 2016

work page 2016
[5]

Robust prediction of punctuation and truecasing for medical ASR

Monica Sunkara, Srikanth Ronanki, Kalpit Dixit, Sravan Bodapati, and Katrin Kirchhoff. Robust prediction of punctuation and truecasing for medical ASR. InProceedings of the First Workshop on Natural Language Processing for Medical Conversations, pages 53–62, 2020

work page 2020
[6]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. InProceedings of ICML, pages 1321–1330, 2017

work page 2017
[7]

Angelopoulos and Stephen Bates

Anastasios N. Angelopoulos and Stephen Bates. Conformal prediction: A gentle introduction. Foundations and Trends in Machine Learning, 16(4):494–591, 2023

work page 2023
[8]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kuttler, Mike Lewis, Wen-tau Yih, Tim Rocktaschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems, 2020

work page 2020
[9]

LayoutLMv3: Pre-training for document AI with unified text and image masking

Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. LayoutLMv3: Pre-training for document AI with unified text and image masking. InProceedings of ACM Multimedia, pages 4083–4091, 2022. 9

work page 2022
[10]

OCR-free document understanding transformer

Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. OCR-free document understanding transformer. InProceedings of ECCV, pages 498–517, 2022

work page 2022
[11]

PubTables-1M: Towards comprehensive table extraction from unstructured documents

Brandon Smock, Rohith Pesala, and Robin Abraham. PubTables-1M: Towards comprehensive table extraction from unstructured documents. InProceedings of CVPR, pages 4634–4642, 2022

work page 2022
[12]

PP-OCRv3: More attempts for the improvement of ultra lightweight OCR system.arXiv preprint arXiv:2206.03001, 2022

Chenxia Li, Weiwei Liu, Ruoyu Guo, Xiaoting Yin, Kaitao Jiang, Yongkun Du, Yuning Du, Lingfeng Zhu, Baohua Lai, Xiaoguang Hu, Dianhai Yu, and Yanjun Ma. PP-OCRv3: More attempts for the improvement of ultra lightweight OCR system.arXiv preprint arXiv:2206.03001, 2022

work page arXiv 2022
[13]

Proactive knowledge inquiry in doctor-patient dialogue: Stateful extraction, belief updating, and path-aware action planning

Zhenhai Pan, Yan Liu, and Jia You. Proactive knowledge inquiry in doctor-patient dialogue: Stateful extraction, belief updating, and path-aware action planning. Companion methods manuscript, submitted to arXiv, 2026. 10

work page 2026

[1] [1]

Medical dialogue summarization for automated reporting in healthcare

Sabine Molenaar, Lientje Maas, Veronica Burriel, Fabiano Dalpiaz, and Sjaak Brinkkemper. Medical dialogue summarization for automated reporting in healthcare. InAdvanced Information Systems Engineering Workshops, pages 76–88, 2020

work page 2020

[2] [2]

An empirical study of clinical note generation from doctor-patient encounters

Asma Ben Abacha, Wen-wai Yim, Yadan Fan, and Thomas Lin. An empirical study of clinical note generation from doctor-patient encounters. InProceedings of EACL, pages 2291–2302, 2023

work page 2023

[3] [3]

Overview of the MEDIQA-Chat 2023 shared tasks on the summarization and generation of doctor-patient conversations

Asma Ben Abacha, Wen-wai Yim, Griffin Adams, Neal Snider, and Meliha Yetisgen. Overview of the MEDIQA-Chat 2023 shared tasks on the summarization and generation of doctor-patient conversations. InProceedings of the 5th Clinical Natural Language Processing Workshop, pages 503–513, 2023

work page 2023

[4] [4]

Bidirectionalrecurrentneuralnetworkwithattentionmechanism for punctuation restoration

OttokarTilkandTanelAlumae. Bidirectionalrecurrentneuralnetworkwithattentionmechanism for punctuation restoration. InProceedings of Interspeech, pages 3047–3051, 2016

work page 2016

[5] [5]

Robust prediction of punctuation and truecasing for medical ASR

Monica Sunkara, Srikanth Ronanki, Kalpit Dixit, Sravan Bodapati, and Katrin Kirchhoff. Robust prediction of punctuation and truecasing for medical ASR. InProceedings of the First Workshop on Natural Language Processing for Medical Conversations, pages 53–62, 2020

work page 2020

[6] [6]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. InProceedings of ICML, pages 1321–1330, 2017

work page 2017

[7] [7]

Angelopoulos and Stephen Bates

Anastasios N. Angelopoulos and Stephen Bates. Conformal prediction: A gentle introduction. Foundations and Trends in Machine Learning, 16(4):494–591, 2023

work page 2023

[8] [8]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kuttler, Mike Lewis, Wen-tau Yih, Tim Rocktaschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems, 2020

work page 2020

[9] [9]

LayoutLMv3: Pre-training for document AI with unified text and image masking

Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. LayoutLMv3: Pre-training for document AI with unified text and image masking. InProceedings of ACM Multimedia, pages 4083–4091, 2022. 9

work page 2022

[10] [10]

OCR-free document understanding transformer

Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. OCR-free document understanding transformer. InProceedings of ECCV, pages 498–517, 2022

work page 2022

[11] [11]

PubTables-1M: Towards comprehensive table extraction from unstructured documents

Brandon Smock, Rohith Pesala, and Robin Abraham. PubTables-1M: Towards comprehensive table extraction from unstructured documents. InProceedings of CVPR, pages 4634–4642, 2022

work page 2022

[12] [12]

PP-OCRv3: More attempts for the improvement of ultra lightweight OCR system.arXiv preprint arXiv:2206.03001, 2022

Chenxia Li, Weiwei Liu, Ruoyu Guo, Xiaoting Yin, Kaitao Jiang, Yongkun Du, Yuning Du, Lingfeng Zhu, Baohua Lai, Xiaoguang Hu, Dianhai Yu, and Yanjun Ma. PP-OCRv3: More attempts for the improvement of ultra lightweight OCR system.arXiv preprint arXiv:2206.03001, 2022

work page arXiv 2022

[13] [13]

Proactive knowledge inquiry in doctor-patient dialogue: Stateful extraction, belief updating, and path-aware action planning

Zhenhai Pan, Yan Liu, and Jia You. Proactive knowledge inquiry in doctor-patient dialogue: Stateful extraction, belief updating, and path-aware action planning. Companion methods manuscript, submitted to arXiv, 2026. 10

work page 2026