arxiv: 2605.09272 · v1 · submitted 2026-05-10 · 💻 cs.AI · cs.CL· cs.CV

Recognition: 2 theorem links

· Lean Theorem

Towards Conversational Medical AI with Eyes, Ears and a Voice

Meet Shah , Jason Gusdorf , Anil Palepu , Chunjong Park , Jack W. O'Sullivan , Vishnu Ravi , Tim Strother , Pavel Dubov

show 45 more authors

Aliya Rysbek Toshiyuki Fukuzawa Yana Lunts Jan Freyberg Michael B. Chang Aniruddh Raghu David Stutz Devora Berlowitz Eliseo Papa Taylan Cemgil JD Velasquez Jack Chen Arthur Chen Doug Fritz Charlie Taylor Katya Tregubova Jing Rong Lim Richard Green Sara Mahdavi Mahvish Nagda Jihyeon Lee Craig Schiff Liviu Panait Sukhdeep Singh Valentin Li\'evin David G.T. Barrett Hannah Gladman Anna Cupani Francesca Pietra Uchechi Okereke Katherine Tong Clemens Meyer Erwan Rolland Mili Sanwalka Michael D. Howell Shixiang Shane Gu Bibo Xu Euan A. Ashley S. M. Ali Eslami Gregory Wayne Pushmeet Kohli Vivek Natarajan Adam Rodman Alan Karthikesalingam Ryutaro Tanno

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:50 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CV

keywords conversational AImedical AItelemedicinemultimodal AIclinical decision makingaudio-visual processingAI co-cliniciansimulated consultations

0 comments

The pith

An AI co-clinician processes live audio and video from patient conversations to make real-time clinical decisions and approaches primary care physicians on key tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a multimodal AI system that takes continuous audio-visual input from live consultations to support diagnosis and management in real time. It tests the system through 20 standardized telemedicine scenarios judged against physicians and other AI models using TelePACES criteria and case rubrics. The results indicate the AI comes close to physicians in areas such as management plans and differential diagnosis while clearly beating text-only models. If this holds, it implies that high-stakes medical AI works best when paired with human doctors rather than acting alone, and that text-only systems miss essential non-verbal information.

Core claim

The AI co-clinician, built with a dual-agent architecture on Gemini's low-latency audio-visual processing, approaches primary care physicians in TelePACES dimensions including management plans and differential diagnosis, significantly outperforms GPT-Realtime on all general criteria, reaches parity with physicians on case-specific triage measures, yet shows physicians superior overall in case-specific assessments, demonstrating that text-only approaches miss the core challenges of medical consultation and that real-time diagnostic AI advances most safely in collaborative triadic models.

What carries the argument

Dual-agent architecture that balances deep clinical reasoning against the low latency needed for natural dialogue while ingesting continuous audio-visual streams.

If this is right

Text-only AI approaches fail to capture the true challenges of medical consultation.
High-stakes real-time diagnostic AI is most safely advanced in collaborative triadic models with doctors and patients.
Multimodal systems can inform decisions using auditory and visual cues during telemedicine visits.
Gaps remain in physical examination and disease-specific reasoning even for advanced multimodal agents.
Video-based simulation with custom rubrics can serve as a benchmark for conversational medical AI.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such systems could support initial assessments in remote or resource-limited settings.
Adding richer sensory data streams might close remaining gaps in physical exam interpretation.
Collaborative AI use could lower routine workload for physicians in outpatient care.
Broader testing across varied patient populations would clarify how well the approach generalizes.

Load-bearing premise

Standardized outpatient scenarios acted by resident physicians in a video interface accurately represent real patient interactions and the TelePACES criteria plus case rubrics validly measure clinical competence, especially for physical examination and disease-specific reasoning.

What would settle it

Performance comparison of the AI against physicians during unscripted, in-person encounters that require hands-on physical examination and individualized disease reasoning.

read the original abstract

The practice of medicine relies not only upon skillful dialogue but also on the nuanced exchange and interpretation of rich auditory and visual cues between doctors and patients. Building on the low-latency voice and video processing capabilities of Gemini, we introduce AI co-clinician, a first-of-its-kind conversational AI system utilizing continuous streams of audio-visual data from live patient conversations to inform real-time clinical decisions. Its dual-agent architecture balances deep clinical reasoning with the low latency required for natural dialogue. To assess this system, we implemented a video-based interface emulating telemedicine consultations. We crafted 20 standardized outpatient scenarios requiring proactive real-time auditory and visual reasoning and designed "TelePACES" evaluation criteria alongside case-specific rubrics. In a randomized, interface-blinded, crossover simulation study (n = 120 encounters) with 10 internal medicine residents as patient actors, we compared AI co-clinician with primary care physicians (PCPs), GPT-Realtime, and a baseline agent. AI co-clinician approached PCPs in key TelePACES dimensions, including management plans and differential diagnosis, while significantly outperforming GPT-Realtime across all general criteria. While our agent demonstrated parity with PCPs in case-specific triage measures, physicians maintained superior overall performance in case-specific assessments. Although AI co-clinician marks a significant advance in real-time telemedical AI, gaps remain in physical examination and disease-specific reasoning. Our work shows that text-only approaches fail to capture the true challenges of medical consultation and suggests that high-stakes real-time diagnostic AI is most safely advanced in collaborative, triadic models where AI can be a supportive co-clinician for doctors and patients.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a dual-agent Gemini-based system that processes live audio-visual streams for medical dialogue and matches PCPs on some simulated tasks while beating text baselines, but the actor-scripted setup limits how far the results generalize.

read the letter

The main point is that this team built and tested a real-time multimodal AI co-clinician that takes continuous video and audio from telemedicine-style calls, splits the work between a fast dialogue agent and a deeper reasoning agent, and then measured it against primary care physicians and GPT-Realtime in 120 blinded encounters. The dual-agent split and the TelePACES rubric are the concrete engineering steps forward; prior medical AI work stayed mostly text-only or offline, so this is a clear next step in handling live cues like tone, facial expression, and movement during conversation. The randomized crossover design with resident actors gives direct head-to-head numbers, and the result that the system approaches PCPs on management plans and differentials while clearly beating the text baseline is useful data. They are also upfront about remaining gaps in physical-exam reasoning and disease-specific detail, which keeps the claims proportionate. The soft spot is exactly the one the stress-test note flags: all 20 scenarios are standardized outpatient cases performed by trained actors on a video interface. That removes the noise, variability, and genuine pathology of real patients, so the visual and auditory streams the model sees are cleaner and more predictable than they would be in practice. The parity on triage and diagnosis scores is therefore tied to this controlled environment, and the paper does not quantify how much the scripting inflates performance. Without error bars or full statistical tables visible in the abstract, it is also hard to judge the practical size of the differences. This paper is for groups working on collaborative, multimodal medical AI rather than fully autonomous tools. It deserves a serious referee because the architecture and the empirical comparison are substantive and falsifiable, even if the simulation constraints mean any deployment claims will need heavy revision and real-patient testing.

Referee Report

2 major / 2 minor

Summary. The paper introduces AI co-clinician, a dual-agent multimodal system built on Gemini that ingests continuous audio-visual streams from live patient conversations to support real-time clinical reasoning and dialogue. It evaluates the system via a randomized, interface-blinded crossover simulation (n=120 encounters) in which 10 internal-medicine residents acting as patients performed 20 standardized outpatient scenarios through a video interface; performance is compared against primary care physicians (PCPs), GPT-Realtime, and a baseline agent using newly defined TelePACES criteria plus case-specific rubrics. The central empirical finding is that the AI approaches PCPs on management plans and differential diagnosis, significantly outperforms GPT-Realtime on all general criteria, achieves parity on some triage measures, yet remains inferior to physicians on overall case-specific assessments, with acknowledged gaps in physical examination and disease-specific reasoning.

Significance. If the simulation results generalize, the work supplies direct evidence that continuous multimodal (audio-visual) input confers measurable advantages over text-only conversational agents in medical consultation tasks. The randomized blinded crossover design with explicit external baselines (PCPs and GPT-Realtime) is a methodological strength, as is the introduction of TelePACES criteria and the explicit framing of AI as a collaborative co-clinician rather than a replacement. These elements could inform future triadic human-AI clinical workflows and provide a reproducible template for evaluating real-time diagnostic AI.

major comments (2)

[Study Design / Evaluation] Study Design / Evaluation section: The headline claim that AI co-clinician approaches PCPs on TelePACES management plans and differential diagnosis rests on the 20 scripted outpatient scenarios performed by resident actors in a controlled video interface being representative of live encounters. Because actors follow predetermined scripts and lack genuine pathology, the auditory and visual streams are less noisy and more predictable than real patient data; this may inflate the apparent benefit of continuous multimodal input and make the observed parity simulation-specific. The manuscript notes gaps in physical examination but does not quantify how the controlled cues affect differential-diagnosis or management scores.
[Results] Results section: The abstract states that the agent 'significantly outperforming GPT-Realtime across all general criteria' and shows 'parity with PCPs in case-specific triage measures,' yet the summary provides no statistical details, error bars, p-values, confidence intervals, or effect sizes. Full reporting of the statistical analysis (including any post-hoc scenario selection or multiple-comparison adjustments) is required to substantiate these quantitative claims.

minor comments (2)

[Abstract] Abstract: The sample size (n=120 encounters) and number of actors (10) should be stated explicitly for immediate clarity.
[Methods] Methods: Provide additional detail on how the TelePACES criteria were derived and validated, and on the precise scoring rubrics used for the case-specific assessments.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important considerations for the generalizability and statistical transparency of our work. We address each major comment below and indicate the revisions we will undertake.

read point-by-point responses

Referee: [Study Design / Evaluation] Study Design / Evaluation section: The headline claim that AI co-clinician approaches PCPs on TelePACES management plans and differential diagnosis rests on the 20 scripted outpatient scenarios performed by resident actors in a controlled video interface being representative of live encounters. Because actors follow predetermined scripts and lack genuine pathology, the auditory and visual streams are less noisy and more predictable than real patient data; this may inflate the apparent benefit of continuous multimodal input and make the observed parity simulation-specific. The manuscript notes gaps in physical examination but does not quantify how the controlled cues affect differential-diagnosis or management scores.

Authors: We agree that the use of scripted scenarios performed by resident actors in a controlled video interface limits direct generalizability to real-world encounters, where auditory and visual data are noisier and less predictable. This design choice enabled a reproducible, randomized, interface-blinded crossover evaluation with standardized cases across AI systems and physicians. The manuscript already notes limitations in physical examination and disease-specific reasoning. We will revise the Discussion and Limitations sections to more explicitly address the potential for inflated performance due to reduced noise and to emphasize the simulation-specific nature of the parity findings on management plans and differentials. revision: partial
Referee: [Results] Results section: The abstract states that the agent 'significantly outperforming GPT-Realtime across all general criteria' and shows 'parity with PCPs in case-specific triage measures,' yet the summary provides no statistical details, error bars, p-values, confidence intervals, or effect sizes. Full reporting of the statistical analysis (including any post-hoc scenario selection or multiple-comparison adjustments) is required to substantiate these quantitative claims.

Authors: The full Results section of the manuscript contains the complete statistical analyses, including p-values, confidence intervals, effect sizes, and details on any multiple-comparison adjustments. The abstract was intentionally concise and omitted these specifics. We will revise the abstract to incorporate key statistical details supporting the claims of significant outperformance over GPT-Realtime and parity on triage measures, ensuring the abstract is self-contained. revision: yes

standing simulated objections not resolved

Quantifying the precise impact of reduced noise and predictability from scripted actor scenarios (versus genuine patient pathology) on differential-diagnosis and management scores, as this would require new experiments with real clinical data outside the scope of the current simulation study.

Circularity Check

0 steps flagged

No circularity: results are empirical comparisons to external baselines

full rationale

The paper reports an empirical simulation study (n=120 encounters) comparing the AI co-clinician against PCPs, GPT-Realtime, and a baseline agent using predefined TelePACES criteria and case-specific rubrics on 20 standardized scenarios. No equations, fitted parameters, or derivations are presented whose outputs reduce to the inputs by construction. Performance claims rest on blinded human ratings of the encounters rather than self-referential modeling or self-citation chains. The reader's assessment of score 1.0 is consistent with this; the noted limitations concern external validity of the actor-based simulation, not internal circularity of the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical outcomes of the simulation study and the assumed real-time multimodal capabilities of the underlying Gemini model; no new free parameters, ad-hoc axioms, or invented entities are introduced beyond standard assumptions about the base model.

axioms (1)

domain assumption Gemini provides low-latency voice and video processing suitable for continuous conversational use
Invoked to justify the real-time dual-agent design in the system description.

pith-pipeline@v0.9.0 · 5844 in / 1515 out tokens · 65565 ms · 2026-05-12T04:50:29.149057+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dual-agent architecture balances deep clinical reasoning with the low latency required for natural dialogue... Talker... Clinical Planner
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TelePACES evaluation criteria alongside case-specific rubrics... 20 standardized outpatient scenarios

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

[1]

Mirage the illusion of visual understanding.arXiv preprint arXiv:2603.21687, 2026

M. Asadi, J. W. O’Sullivan, F. Cao, T. Nedaee, K. Fardi, F.-F. Li, E. Adeli, and E. Ashley. Mirage the illusion of visual understanding.arXiv preprint arXiv:2603.21687,

work page arXiv
[2]

URL https://arxiv.org/abs/2603.08448. S. Choi, M. R. U. Z. Sajib, J. Manzano, and C. J. Chlebek. MHealth technology experiences of middle- aged and older individuals with visual impairments: Cross-sectional interview study.JMIR Form. Res., 7:e52410, Dec

work page arXiv
[3]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Accessed 2026-02-27. S. G. Engström, M. André, E. Arvidsson, C. J. Östgren, M. Troein, and L. Borgquist. Personal gp- continuity improves healthcare outcomes in primary care populations–a systematic review.British Journal of General Practice,

work page 2026
[5]

google/technologies/gemini/project-astra/

URL https://deepmind. google/technologies/gemini/project-astra/. Accessed: 2025-01-20. S. Johri, J. Jeong, B. A. Tran, D. I. Schlessinger, S. Wongvibulsin, L. A. Barnes, H.-Y. Zhou, Z. R. Cai, E. M. Van Allen, D. Kim, et al. An evaluation framework for clinical use of large language models in patient interaction tasks.Nature medicine, 31(1):77–86,

work page 2025
[6]

URL https://arxiv.org/abs/2507.16947. R. Kruis, E. A. Brown, J. Johnson, K. N. Simpson, J. McElligott, and J. Harvey. Patient perceptions of Audio-Only versus video telehealth visits: A qualitative study among patients in an academic medical center setting.Telemed Rep, 5(1):89–98, Apr

work page arXiv
[7]

E. C. Lee, V. Grigorescu, I. Enogieru, S. R. Smith, L. W. Samson, A. B. Conmy, and N. De Lew. Updated national survey trends in telehealth utilization and modality: 2021–2022. Technical Report HP- 2023-09, Office of the Assistant Secretary for Planning and Evaluation, U.S. Department of Health and Human Services, Washington, D.C., April

work page 2021
[8]

doi: 10.1038/s41586-025-08869-4

ISSN 1476-4687. doi: 10.1038/s41586-025-08869-4. URLhttps://doi.org/10.1038/s41586-025-08869-4. 21 Towards Conversational Medical AI with Eyes, Ears and a Voice A. A. Moore, J. R. Ellis, N. Dellavalle, M. Akerson, M. Andazola, E. G. Campbell, and M. DeCamp. Patient-facing chatbots: Enhancing healthcare accessibility while navigating digital literacy chal-...

work page doi:10.1038/s41586-025-08869-4
[9]

Accessed: 2026-03-04

URL https://openai.com/index/introducing-gpt-realtime/. Accessed: 2026-03-04. A. Pal, L. K. Umapathi, and M. Sankarasubbu. Med-halt: Medical domain hallucination test for large language models. InProceedings of the 27th Conference on Computational Natural Language Learning (CoNLL), pages 314–334,

work page 2026
[10]

URLhttps://arxiv.org/abs/2503.06074. D. J. Pereira Gray, K. Sidaway-Lee, E. White, A. Thorne, and P. H. Evans. Continuity of care with doctors—a matter of life and death? a systematic review of continuity of care and mortality.BMJ open, 8(6):e021161,

work page arXiv
[11]

URLhttps://arxiv.org/abs/2505.04653. D. J. Sartori, R. W. Hayes, M. Horlick, J. G. Adams, and S. R. Zabar. The TeleHealth OSCE: Preparing trainees to use telemedicine as a tool for transitions of care.J. Grad. Med. Educ., 12(6):764–768, Dec

work page arXiv
[12]

Tammes, R

P. Tammes, R. Morris, M. Murphy, and C. Salisbury. Is continuity of primary care declining in england? practice-level longitudinal study, 2012-2017.British Journal of General Practice,

work page 2012
[13]

Sara Mahdavi, Christoph er Semturs, Juraj Gottweis, Joelle Barral, Katherine Chou, Greg S

ISSN 1476-4687. doi: 10.1038/s41586-025-08866-7. URLhttps://doi.org/10. 1038/s41586-025-08866-7. S. Wamala Andersson and M. P. Gonzalez. Digital health literacy—a key factor in realizing the value of digital transformation in healthcare.Frontiers in Digital Health, 7:1461342,

work page doi:10.1038/s41586-025-08866-7
[14]

URLhttps://arxiv.org/abs/2512.01241. D. Zeltzer, Z. Kugler, L. Hayat, T. Brufman, R. Ilan Ber, K. Leibovich, T. Beer, I. Frank, R. Shaul, C. Goldzweig, et al. Comparison of initial artificial intelligence (ai) and final physician recom- mendations in ai-assisted virtual urgent care visits.Annals of Internal Medicine, 178(4):498–506,

work page arXiv