pith. machine review for the scientific record. sign in

arxiv: 2604.20022 · v1 · submitted 2026-04-21 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

Statistics, Not Scale: Modular Medical Dialogue with Bayesian Belief Engine

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords Bayesian inferencemodular AImedical diagnosislarge language modelsselective classificationadversarial robustnessprivacy-preserving AI
0
0 comments X

The pith

Separating language from reasoning in medical AI allows a Bayesian engine to deliver calibrated diagnosis and beat larger standalone models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models struggle as full diagnostic agents because they blend conversational abilities with statistical judgment in ways that cannot be fixed by scale alone. The paper presents a framework that uses the LLM only to interpret patient statements as evidence and to phrase questions, leaving all probability calculations to a separate, transparent Bayesian engine. This design keeps conversations private since no patient data reaches the LLM, permits swapping the reasoning module for different populations, and produces an adjustable balance between diagnostic accuracy and how many cases it covers. Experiments show that this hybrid setup creates a performance gap where even basic language sensors paired with the engine surpass advanced all-in-one models while using less computation and resisting manipulative inputs. The gains are shown to stem from the modular structure rather than superior knowledge.

Core claim

By restricting the LLM to the role of a language sensor that extracts structured evidence from patient utterances without performing any inference, and delegating all diagnostic reasoning to a deterministic Bayesian engine, the system achieves calibrated selective diagnosis, a statistical separation where low-cost components outperform integrated frontier models, and robustness against adversarial communication that defeats standalone LLMs.

What carries the argument

The Bayesian Medical Belief Engine, a module that receives structured evidence from the LLM sensor and computes posterior probabilities for diagnoses using explicit priors and likelihoods.

If this is right

  • Calibrated selective diagnosis with continuously adjustable accuracy-coverage tradeoff.
  • Even a cheap sensor paired with the engine outperforms a frontier standalone model at a fraction of the cost.
  • Robustness to adversarial patient communication styles that cause standalone models to collapse.
  • Patient data never enters the LLM, ensuring privacy by construction.
  • The statistical backend can be replaced for different target populations without retraining the language component.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This modular design suggests that similar separations could improve reliability in other high-stakes domains requiring both interpretation and precise calculation.
  • Replacing the Bayesian engine with other statistical models might extend the benefits to non-medical applications like risk assessment.

Load-bearing premise

An LLM sensor can convert natural language patient statements into accurate structured evidence without systematic biases that the Bayesian engine is unable to compensate for.

What would settle it

Demonstrating cases where the LLM consistently misinterprets patient descriptions of symptoms in a way that leads the Bayesian engine to lower accuracy than a comparable standalone LLM on the same dataset.

Figures

Figures reproduced from arXiv: 2604.20022 by Akhil Arora, Alexandra Kulinkina, David Sasu, Fay Elhassan, Jiayi Ma, Julien Stalhandske, Lars Klein, Mary-Anne Hartley, Yusuf Kesmen.

Figure 1
Figure 1. Figure 1: (a) Three paradigms for LLM-based diagnostic dialogue. Standalone: the LLM handles all reasoning, questioning, and diagnosis internally. LLM Bayesian: an external module computes EIG from LLM-derived posteriors, principled question selection, but no grounded knowledge base. BMBE (ours): the LLM serves only as a sensor; all diagnostic reasoning is performed by a deterministic Bayesian engine grounded in an … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the BMBE architecture. The LLM layer handles only language: parsing [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Left: DHS vs. API cost per token. Right: DHS vs. estimated cost per patient. In both views, BMBE sensors (circles) achieve higher DHS than standalone doctors (squares) at 10–18× lower cost. 60 70 80 90 100 Coverage (%) 40 50 60 70 80 90 100 Selective Accuracy (%) default GPT-5.4 Gemini 3.1 Pro GPT-OSS-120B Llama-4-Maverick Qwen 3.6+ Kimi K2.5 Triage (τ→0) Balanced (τ=0.50) Safety-critical (τ=0.90) BMBE + G… view at source ↗
Figure 4
Figure 4. Figure 4: Operating point control. The green curve shows the accuracy, coverage frontier of BMBE + [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Left: DDXPlus prior distribution sorted by prevalence. The long tail (max / min ≈ 200×) reflects real-world disease frequency; the dashed line shows the uniform baseline. Right: Distribution of positive evidence count per evaluation patient across KBs. two variants: one using GPT-5.4 and one using Gemini 3.1, enabling a cross-model comparison of zero-shot medical knowledge. Construction proceeds in two sta… view at source ↗
Figure 6
Figure 6. Figure 6: Left: Distribution of LLM-elicited binary likelihoods P(yes | d) for both GPT and Gemini KBs; the strong left skew indicates that most disease–feature associations are weak. Right: CDF of per-pair KL divergence from uniform across all three KBs; DDXPlus (empirical) has the highest informativeness, while both LLM-KBs are comparable despite being synthetically generated. agreement on medical knowledge. GPT g… view at source ↗
Figure 7
Figure 7. Figure 7: Left: Scatter plot of P(yes | d) for 45 shared features across 18 diseases (n=810 pairs); dashed line is perfect agreement. Right: Distribution of pairwise likelihood differences; the left￾skewed distribution (mean = −0.055) confirms Gemini’s systematically higher assignments. 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Variance of P(yes ∣ d) across diseases 0 5 10 15 20 25 30 Number of features GPT-5.4 Gemini 2.5 … view at source ↗
Figure 8
Figure 8. Figure 8: Feature discriminativeness (left: cross-disease variance; right: cross-disease range) for [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: DDXPlus selective accuracy vs. coverage ( [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: DHS across patient personas. Shaded areas show degradation from the plain baseline [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Top-1 accuracy vs. KB size K. BMBE remains stable across a 4× increase in disease space; the standalone doctor is flat re￾gardless of K [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: illustrates the engine’s belief dynamics. The left panel tracks the posterior of the ground￾truth disease across turns for a representative case: competing hypotheses rise and fall as evidence accumulates. The right panel aggregates entropy trajectories across all sessions, separating correct and incorrect diagnoses: correct cases exhibit steady entropy collapse, while incorrect cases plateau at elevated … view at source ↗
read the original abstract

Large language models are increasingly deployed as autonomous diagnostic agents, yet they conflate two fundamentally different capabilities: natural-language communication and probabilistic reasoning. We argue that this conflation is an architectural flaw, not an engineering shortcoming. We introduce BMBE (Bayesian Medical Belief Engine), a modular diagnostic dialogue framework that enforces a strict separation between language and reasoning: an LLM serves only as a sensor, parsing patient utterances into structured evidence and verbalising questions, while all diagnostic inference resides in a deterministic, auditable Bayesian engine. Because patient data never enters the LLM, the architecture is private by construction; because the statistical backend is a standalone module, it can be replaced per target population without retraining. This separation yields three properties no autonomous LLM can offer: calibrated selective diagnosis with a continuously adjustable accuracy-coverage tradeoff, a statistical separation gap where even a cheap sensor paired with the engine outperforms a frontier standalone model from the same family at a fraction of the cost, and robustness to adversarial patient communication styles that cause standalone doctors to collapse. We validate across empirical and LLM-generated knowledge bases against frontier LLMs, confirming the advantage is architectural, not informational.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper introduces BMBE, a modular medical dialogue framework that restricts an LLM to the role of a sensor (parsing utterances into structured evidence and verbalizing questions) while confining all diagnostic inference to a separate deterministic Bayesian engine. It claims this architectural separation delivers three properties unavailable to autonomous LLMs—calibrated selective diagnosis with an adjustable accuracy-coverage tradeoff, a statistical separation gap in which even a cheap sensor plus the engine outperforms a frontier model from the same family, and robustness to adversarial patient communication styles—while also ensuring privacy by construction and modularity across knowledge bases. The abstract asserts validation on both empirical and LLM-generated knowledge bases confirming that the advantage is architectural rather than informational.

Significance. If the empirical claims are substantiated with rigorous metrics, the modular separation of language and probabilistic reasoning offers a concrete, auditable alternative to end-to-end LLM diagnostic agents. The emphasis on replaceable statistical back-ends, privacy guarantees, and continuously tunable calibration could influence the design of reliable medical AI systems, particularly where cost, auditability, and robustness to input variation matter.

major comments (3)
  1. [Abstract] Abstract: The assertion of validation across knowledge bases and outperformance over frontier LLMs is made without any metrics, experimental protocols, error bars, or result tables; this absence prevents assessment of the claimed statistical separation gap and calibrated selective diagnosis.
  2. [Bayesian engine description] Bayesian engine description (likely §3 or §4): No equations or explicit likelihood definitions are supplied for how the engine incorporates evidence from the LLM sensor or models sensor noise; without these, it is impossible to verify how posterior calibration or correction of extraction errors is achieved.
  3. [Validation section] Validation and robustness claims: The central properties (separation gap and adversarial robustness) rest on the untested premise that LLM sensor errors are either negligible or fully correctable by the Bayesian update; the manuscript provides neither error bounds on the sensor mapping nor an explicit noise model in the likelihoods, which directly undermines the architectural-advantage argument.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for the thorough and insightful comments, which have helped us identify areas where the manuscript can be strengthened. We address each major comment below, indicating the changes we will implement in the revised version.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion of validation across knowledge bases and outperformance over frontier LLMs is made without any metrics, experimental protocols, error bars, or result tables; this absence prevents assessment of the claimed statistical separation gap and calibrated selective diagnosis.

    Authors: We agree that the abstract, as currently written, is too high-level and does not include quantitative details. In the revision, we will update the abstract to incorporate key metrics from our experiments, including the observed performance gaps (e.g., accuracy improvements and cost reductions), calibration error measures, and robustness statistics under adversarial conditions. We will also reference the experimental protocols and tables/figures where these results are presented in detail. This will enable immediate assessment of the claims. revision: yes

  2. Referee: [Bayesian engine description] Bayesian engine description (likely §3 or §4): No equations or explicit likelihood definitions are supplied for how the engine incorporates evidence from the LLM sensor or models sensor noise; without these, it is impossible to verify how posterior calibration or correction of extraction errors is achieved.

    Authors: Thank you for pointing this out. While Section 3 provides a high-level description of the Bayesian engine, we acknowledge the absence of explicit mathematical formulations. We will add the necessary equations in a new subsection, defining the likelihood function that maps sensor outputs to evidence probabilities, incorporating a noise model to account for potential LLM extraction errors (e.g., via beta-distributed or Bernoulli noise parameters). This will explicitly show how the posterior updates correct for sensor inaccuracies and achieve calibration. The revised manuscript will include these details to allow full verification. revision: yes

  3. Referee: [Validation section] Validation and robustness claims: The central properties (separation gap and adversarial robustness) rest on the untested premise that LLM sensor errors are either negligible or fully correctable by the Bayesian update; the manuscript provides neither error bounds on the sensor mapping nor an explicit noise model in the likelihoods, which directly undermines the architectural-advantage argument.

    Authors: We appreciate the referee's concern regarding the foundational assumptions. Our experiments do demonstrate the separation gap and robustness across multiple knowledge bases, but we agree that without quantified sensor error analysis and an explicit noise model, the explanation of how the Bayesian engine corrects errors remains incomplete. In the revision, we will include sensor error rate measurements from our LLM sensor evaluations, introduce an explicit noise model in the likelihood definitions (as noted in the response to the previous comment), and provide error bounds. We will also add ablation studies showing the impact of sensor errors on the final posteriors. This will substantiate that the architectural separation enables correction and robustness. revision: partial

Circularity Check

0 steps flagged

No circularity: claims rest on architectural separation, not derivations or self-referential fits

full rationale

The paper presents BMBE as an architectural framework enforcing LLM-as-sensor and standalone Bayesian engine, claiming three properties (calibrated selective diagnosis, statistical separation gap, adversarial robustness) follow directly from this separation. No equations, parameter fits, or derivation chains are exhibited in the provided text that reduce these properties to inputs by construction. Validation uses empirical/LLM-generated knowledge bases against frontier models, but the advantage is explicitly framed as architectural rather than informational or fitted. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked. This is the common honest case of a self-contained architectural argument with no reduction to its own fitted values or prior author results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review; full paper would be needed to enumerate all free parameters and axioms. The separation of sensor and engine is treated as an unproven domain assumption.

axioms (1)
  • domain assumption An LLM can function as a pure sensor that extracts structured evidence without introducing uncorrectable bias into downstream Bayesian inference
    Invoked by the claim that patient data never enters the LLM and that the engine alone performs inference.
invented entities (1)
  • Bayesian Medical Belief Engine (BMBE) no independent evidence
    purpose: Standalone deterministic module that performs all diagnostic probabilistic reasoning
    New named framework introduced to enforce the language-reasoning separation.

pith-pipeline@v0.9.0 · 5531 in / 1403 out tokens · 36173 ms · 2026-05-10T02:24:33.074233+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 8 canonical work pages · 3 internal anchors

  1. [1]

    Shortliffe.Computer-Based Medical Consultations: MYCIN

    Edward H. Shortliffe.Computer-Based Medical Consultations: MYCIN. Elsevier, 1976

  2. [2]

    F. T. de Dombal, D. J. Leaper, J. R. Staniland, A. P. McCann, and Jane C. Horrocks. Computer- aided diagnosis of acute abdominal pain.British Medical Journal, 2(5804):9–13, 1972

  3. [3]

    Miller, Harry E

    Randolph A. Miller, Harry E. Pople, and Jack D. Myers. Internist-I, an experimental computer- based diagnostic consultant for general internal medicine.New England Journal of Medicine, 307(8):468–476, 1982

  4. [4]

    Octo Barnett, James J

    G. Octo Barnett, James J. Cimino, Jon A. Hupp, and Edward P. Hoffer. DXplain: An evolving diagnostic decision-support system.JAMA, 258(1):67–74, 1987

  5. [5]

    Heckerman, Eric J

    David E. Heckerman, Eric J. Horvitz, and Bharat N. Nathwani. Toward normative expert systems: Part I. The Pathfinder project.Methods of Information in Medicine, 31(2):90–105, 1992

  6. [6]

    Greek Oracle

    Randolph A. Miller and Fred E. Masarie. The demise of the “Greek Oracle” model for medical diagnostic systems.Methods of Information in Medicine, 29(01):1–2, 1990

  7. [7]

    Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Towards expert-level medical question answering with large language models.Nature Medicine, 2025

  8. [8]

    Capabilities of Gemini Models in Medicine

    Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, et al. Capabilities of Gemini models in medicine.arXiv preprint arXiv:2404.18416, 2024

  9. [9]

    Towards conversational diagnostic AI

    Tao Tu, Anil Palepu, Mike Schaekermann, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Nenad Tober, et al. Towards conversational diagnostic AI. Nature, 2025

  10. [10]

    BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design

    Subhajit Choudhury, Sinead Williamson, Omar Rivasplata, and Tom Rainforth. BED-LLM: Intelligent information gathering with LLMs and Bayesian experimental design.arXiv preprint arXiv:2508.21184, 2025

  11. [11]

    DeLLMa: Decision making under uncertainty with large language models

    Ollie Liu, Deqing Fu, Dani Levy, Maryam Fazel, Adith Swaminathan, and Willie Neiswanger. DeLLMa: Decision making under uncertainty with large language models. InProceedings of the International Conference on Learning Representations (ICLR), 2025. Spotlight. 12

  12. [12]

    BIRD: A trustworthy Bayesian inference framework for large language models.arXiv preprint arXiv:2404.12494, 2024

    Yu Feng, Ben Zhou, Weidong Lin, and Dan Roth. BIRD: A trustworthy Bayesian inference framework for large language models.arXiv preprint arXiv:2404.12494, 2024

  13. [13]

    Ask patients with patience: Enabling LLMs for human-centric medical dialogue with grounded reasoning

    Jiayuan Zhu, Jiazhen Pan, Yuyuan Liu, Fenglin Liu, and Junde Wu. Ask patients with patience: Enabling LLMs for human-centric medical dialogue with grounded reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2846–2857, 2025

  14. [14]

    Ilgen, Emma Pierson, Pang Wei Koh, and Yulia Tsvetkov

    Shuyue Stella Li, Vidhisha Balachandran, Shangbin Feng, Jonathan S. Ilgen, Emma Pierson, Pang Wei Koh, and Yulia Tsvetkov. MediQ: Question-asking LLMs and a benchmark for reliable interactive clinical reasoning. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  15. [15]

    Fine-tuning large language models with medical data: Can safety be ensured?NEJM AI, 2(1), 2025

    Minkyoung Kim, Yunha Kim, Hee Jun Kang, Hyeram Seo, Heejung Choi, JiYe Han, Gaeun Kee, Seohyun Park, Soyoung Ko, Hyoje Jung, Byeolhee Kim, Tae Joon Jun, and Young-Hak Kim. Fine-tuning large language models with medical data: Can safety be ensured?NEJM AI, 2(1), 2025. doi: 10.1056/AIcs2400390

  16. [16]

    Yu, Lawrence M

    Victor L. Yu, Lawrence M. Fagan, Sharon M. Wraith, William J. Clancey, A. Carlisle Scott, John Hannigan, Robert L. Blum, Bruce G. Buchanan, and Stanley N. Cohen. Antimicrobial selection by a computer: A blinded evaluation by infectious diseases experts.JAMA, 242(12): 1279–1282, 1979

  17. [17]

    Weiss, Casimir A

    Sholom M. Weiss, Casimir A. Kulikowski, Saul Amarel, and Aran Safir. A model-based method for computer-aided medical decision-making.Artificial Intelligence, 11(1–2):145–172, 1978

  18. [18]

    Abdelzaher

    Xinyi Liu, Dachun Sun, Yi Fung, Dilek Hakkani-Tür, and Tarek F. Abdelzaher. DocCHA: Towards LLM-augmented interactive online diagnosis system. InProceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDial), pages 609–619, 2025

  19. [19]

    Reasoning like a doctor: Improving medical dialogue systems via diagnostic reasoning process alignment

    Kaishuai Xu, Yi Cheng, Wenjun Hou, Qiaoyu Tan, and Wenjie Li. Reasoning like a doctor: Improving medical dialogue systems via diagnostic reasoning process alignment. InFindings of the Association for Computational Linguistics: ACL 2024, pages 6796–6814, 2024

  20. [20]

    Thomas Savage, John Wang, Robert Gallo, Abdessalem Boukil, Vishwesh Patel, Seyed Amir Safavi-Naini, Ali Soroush, and Jonathan H. Chen. Large language model uncertainty proxies: Discrimination and calibration for medical diagnosis and treatment.Journal of the American Medical Informatics Association, 32(1):139–149, 2025

  21. [21]

    Collins, David Reich, Robert Freeman, and Eyal Klang

    Mahmud Omar, Vera Sorin, Jeremy D. Collins, David Reich, Robert Freeman, and Eyal Klang. Multi-model assurance analysis: LLMs highly vulnerable to adversarial hallucination attacks during clinical decision support.Communications Medicine, 5(1):97, 2025

  22. [22]

    Omiye, Jenna C

    Jesutofunmi A. Omiye, Jenna C. Lester, Simon Spichak, Veronica Rotemberg, and Roxana Daneshjou. Large language models propagate race-based medicine.npj Digital Medicine, 6(1): 195, 2023

  23. [23]

    AfriMed-QA: A pan-African, multi- specialty, medical question-answering benchmark dataset.arXiv preprint arXiv:2411.15640,

    Tobi Olatunji, Charles Nimo, Abraham Owodunni, et al. AfriMed-QA: A pan-African, multi- specialty, medical question-answering benchmark dataset.arXiv preprint arXiv:2411.15640,

  24. [24]

    ACL 2025, Best Social Impact Award

  25. [25]

    Conversational disease diagnosis via external planner-controlled large language models.arXiv preprint arXiv:2404.04292, 2024

    Zhoujian Sun, Chenghua Luo, Liangzhi Jiang, Linlin Liu, Xiaohan Yang, Junfan Shi, Tangjie Lv, Benyou Zhang, and Kezhi Mao. Conversational disease diagnosis via external planner-controlled large language models.arXiv preprint arXiv:2404.04292, 2024

  26. [26]

    DDXPlus: A new dataset for automatic medical diagnosis

    Arsene Fansi Tchango, Rishab Goel, Zhi Wen, Julien Martel, and Joumana Ghosn. DDXPlus: A new dataset for automatic medical diagnosis. InAdvances in Neural Information Processing Systems, volume 35, 2022

  27. [27]

    Jeffrey.The Logic of Decision

    Richard C. Jeffrey.The Logic of Decision. McGraw-Hill, 1965. 13

  28. [28]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

  29. [29]

    MedAgentSim: Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions

    Mohammad Almansoori, Komal Kumar, and Hisham Cholakkal. MedAgentSim: Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions . Inproceedings of Medical Image Computing and Computer Assisted Intervention – MICCAI 2025, volume LNCS 15968. Springer Nature Switzerland, September 2025

  30. [30]

    AI hospital: Benchmarking large language models in a multi-agent medical interaction simulator

    Zhihao Fan, Lai Wei, Jialong Tang, Wei Chen, Wang Siyuan, Zhongyu Wei, and Fei Huang. AI hospital: Benchmarking large language models in a multi-agent medical interaction simulator. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors,Proceedings of the 31st International Conference on Compu-...

  31. [31]

    PatientSim: A persona-driven simulator for realistic doctor-patient interactions

    Daeun Kyung, Hyunseung Chung, Seongsu Bae, Jiho Kim, Jae Ho Sohn, Taerim Kim, Soo Kyung Kim, and Edward Choi. PatientSim: A persona-driven simulator for realistic doctor-patient interactions. InAdvances in Neural Information Processing Systems, volume 38, 2025

  32. [32]

    Automatic interactive evaluation for large language models with state aware patient simula- tor.ArXiv, abs/2403.08495, 2024

    Yusheng Liao, Yutong Meng, Yuhao Wang, Hongcheng Liu, Yanfeng Wang, and Yu Wang. Automatic interactive evaluation for large language models with state aware patient simula- tor.ArXiv, abs/2403.08495, 2024. URL https://api.semanticscholar.org/CorpusID: 268379575

  33. [33]

    Selective classification for deep neural networks

    Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems, volume 30, 2017

  34. [34]

    I think I had a fever

    Yifan Zhao, Yixiao Hua, Dan Roth, and Jinhao Chen. Probing the multi-turn planning capa- bilities of LLMs via 20 question games. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024. 14 Appendix overview This supplementary material is organized as follows: A. Theoretical foundations. . . . . . . . . . . . ....

  35. [35]

    Classify the user response into one of the Allowed Values

  36. [36]

    Yes", "Not really

    Assess confidence: very_likely, likely, uncertain, unlikely, very_unlikely KEY RULES: – Direct answer ("Yes", "Not really"): map to closest value. – Unrelated response: return "unknown|likely". – Uncertain language ("I think so", "maybe"): use "uncertain". – Prefer "unknown" over hard negative when partial/vague. Return format: "value|confidence_level" Ex...

  37. [37]

    f_fever",

    NEVER use technical IDs (e.g., "f_fever", "d_flu")

  38. [38]

    Speak naturally and empathetically

  39. [39]

    Do NOT mention probabilities or internal values

  40. [40]

    {narrative}

    If clarifying a previous confusion, keep it brief. Bulk intake prompt.At session start, a single bulk intake call maps the patient’s opening narrative to multiple(f, v, c)triples simultaneously, reducing the number of follow-up questions needed. Bulk Intake Prompt System: You are an expert medical intake specialist. User Text: "{narrative}" TASK: Extract ...

  41. [41]

    Only extract explicitly mentioned or strongly implied features. 18

  42. [42]

    Extract demographics (age, gender, location) if present

  43. [43]

    Do NOT infer negatives from silence; omit unlisted features

  44. [44]

    feature_id

    Assess confidence for each extracted feature. Return JSON: {"feature_id": {"value": "...", "confidence": "likely"}, "demographics": {"age": N, ...}} Patient simulator prompt.The patient simulator receives the full clinical profile (demographics, chief complaint, symptoms, medical history, observed findings) and persona instructions. Crucially, the patient...

  45. [45]

    Answer based on KNOWN OBSERVED FINDINGS and patient profile

  46. [46]

    If asked about something listed: answer faithfully (including denials)

  47. [47]

    I’m not sure

    If NOT listed: say “I’m not sure” or “I don’t know”. Do not invent symptoms

  48. [48]

    NEVER reveal your diagnosis directly

  49. [49]

    Keep responses concise (1–3 sentences). PERSONA: – Language: {CEFR level A/B/C} – Personality: {plain|verbose|overanxious|distrustful} – Memory: {high|low recall} – Alertness: {normal|moderate daze|high daze} Standalone doctor prompt.The standalone LLM doctor receives no external reasoning support. It conducts the full diagnostic interview and outputs a d...

  50. [50]

    Acute exacerbation of COPD

    {prediction_2} ... REFERENCE DISEASE LIST: – {disease_name_1} – {disease_name_2} ... Output ONLY a numbered list with the matched disease name (exactly as written in the reference list) or NO_MATCH. LLM-generated KB prompts.The LLM-generated KB is constructed via two sequential prompts. Thefeature generation promptasks the model to propose clinically plau...

  51. [51]

    We then measure the posterior gap between the oracle’s top-1 and top-2 diseases

    KB Failure.We run anoracletest: all ground-truth features are supplied at confidence c=1.0. We then measure the posterior gap between the oracle’s top-1 and top-2 diseases. If this gap falls below a thresholdγ(γ=0.80), the KB cannot reliably discriminate the disease pair

  52. [52]

    Two subtypes: • False Positive (FP): the engine asks about a featureabsentfrom the patient’s ground-truth profile; the pipeline returns yes

    LLM Failure.The LLM pipeline (verbaliser + patient simulator + parser) injected incorrect evidence into the engine. Two subtypes: • False Positive (FP): the engine asks about a featureabsentfrom the patient’s ground-truth profile; the pipeline returns yes. If more than 2 such turns occur in a session, the case is flagged. The threshold reflects the empiri...

  53. [53]

    I have chest pain even at rest, upper chest pain, and pleuritic chest pain

    Inference Failure.The KB is adequate and the evidence pipeline introduced no detectable errors, yet the engine converged to the wrong diagnosis. Two subtypes: •Close: the ground truth remains in the top-3 posterior at session end, but the question budget or EIG policy did not resolve the differential. • Diverged: the ground truth is not in the top-3. The ...