arxiv: 2604.23605 · v1 · submitted 2026-04-26 · 💻 cs.AI

Recognition: unknown

Thinking Like a Clinician: A Cognitive AI Agent for Clinical Diagnosis via Panoramic Profiling and Adversarial Debate

Zhiqi Lv , Duofan Tu , Jun Li , Mingyue Zhao , Heqin Zhu , Wenliang Li , Shaohua Kevin Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:09 UTC · model grok-4.3

classification 💻 cs.AI

keywords clinical diagnosislarge language modelselectronic health recordsadversarial debatetree of thoughtshallucination mitigationdiagnostic reasoning

0 comments

The pith

DxChain improves clinical diagnosis by profiling patients first, planning with tree search, and debating evidence through opposing AI views.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DxChain to fix tunnel vision and hallucinations that occur when large language models analyze unstructured electronic health records for diagnosis. It models the process after a clinician's thinking in three phases: anchoring on a full patient memory baseline, navigating possible paths strategically, and verifying through debate. The three new pieces are an initial broad profiling step to avoid early errors, a medical tree-of-thoughts method for efficient planning, and an angel-devil adversarial setup to handle conflicting evidence. If these hold, they would produce more accurate and logically consistent outputs than standard LLM approaches on real patient data.

Core claim

DxChain transforms the diagnostic workflow into an iterative process mirroring a clinician's cognitive trajectory that consists of Memory Anchoring, Navigation and Verification phases. It introduces a Profile-Then-Plan paradigm to mitigate cold-start hallucinations by establishing a panoramic patient baseline, a Medical Tree-of-Thoughts algorithm for strategic look ahead planning and resource aware navigation, and a Dialectical Diagnostic Verification procedure utilizing Angel-Devil adversarial debates to resolve complex evidence conflicts. Evaluated on two real world benchmarks, MIMIC-IV-Ext Cardiac Disease and MIMIC-IV-Ext CDM, it achieves state-of-the-art performances in both diagnostic

What carries the argument

The DxChain chain-based framework that combines Profile-Then-Plan for baseline creation, Medical Tree-of-Thoughts for navigation, and Angel-Devil adversarial debates for verification.

If this is right

DxChain mitigates cold-start hallucinations through initial panoramic patient profiling.
Med-ToT enables strategic and resource-aware navigation during diagnostic reasoning.
Angel-Devil debates help resolve complex evidence conflicts in patient records.
The overall system delivers higher diagnostic accuracy and logical consistency than prior LLM methods on the tested cardiac and CDM benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This modular structure could allow selective swapping of components for different clinical specialties.
The approach might extend to other domains that involve reasoning over long, messy text records.
Testing in live hospital workflows would show whether the consistency gains translate to fewer missed diagnoses.

Load-bearing premise

That the Profile-Then-Plan, Med-ToT, and Angel-Devil components will reliably mitigate tunnel vision and hallucinations in LLMs on unstructured EHRs without introducing new inconsistencies or biases in real clinical use.

What would settle it

A direct comparison on held-out complex EHR cases showing no gain in accuracy or logical consistency for DxChain over a plain LLM baseline.

Figures

Figures reproduced from arXiv: 2604.23605 by Duofan Tu, Heqin Zhu, Jun Li, Mingyue Zhao, Shaohua Kevin Zhou, Wenliang Li, Zhiqi Lv.

**Figure 1.** Figure 1: Overview of the proposed clinical reasoning framework. Phase I anchors memory from raw EHR via view at source ↗

**Figure 2.** Figure 2: The workflow of the Dialectical Verification view at source ↗

**Figure 3.** Figure 3: Stability analysis of DxChain across differ view at source ↗

read the original abstract

The application of large language models (LLMs) in clinical decision support faces significant challenges of "tunnel vision" and diagnostic hallucinations present in their processing unstructured electronic health records (EHRs). To address these challenges, we propose a novel chain-based clinical reasoning framework, called DxChain, which transforms the diagnostic workflow into an iterative process by mirroring a clinician's cognitive trajectory that consists of "Memory Anchoring", "Navigation" and "Verification" phases. DxChain introduces three key methodological innovations to elicit the potential of LLM: (i) a Profile-Then-Plan paradigm to mitigate cold-start hallucinations by establishing a panoramic patient baseline, (ii) a Medical Tree-of-Thoughts (Med-ToT) algorithm for strategic look ahead planning and resource aware navigation, and (iii) a Dialectical Diagnostic Verification procedure utilizing "Angel-Devil" adversarial debates to resolve complex evidence conflicts. Evaluated on two real world benchmarks, MIMIC-IV-Ext Cardiac Disease and MIMIC-IV-Ext CDM, DxChain achieves state-of-the-art performances in both diagnostic accuracy and logical consistency, offering a modular and reliable architecture for next-generation clinical AI. The code is at https://anonymous.4open.science/r/Dx-Chain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DxChain wraps common LLM prompting patterns into a named clinical workflow and claims SOTA on two author-defined MIMIC extensions, but the numbers and benchmark details are too thin to judge yet.

read the letter

The main takeaway is that this paper gives a concrete workflow for using LLMs on messy EHRs, built around three steps that try to copy how clinicians actually work. It starts with a panoramic patient profile to avoid early mistakes, moves to a tree-of-thoughts variant for planning, and ends with the model debating itself to check conflicting evidence. That structure is the real contribution here, even if the pieces themselves are extensions of existing prompting ideas rather than brand-new primitives.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DxChain, a chain-based clinical reasoning framework for LLMs processing unstructured EHRs. It models clinician cognition via Memory Anchoring, Navigation, and Verification phases, with three innovations: Profile-Then-Plan to establish panoramic patient baselines and reduce cold-start hallucinations, Medical Tree-of-Thoughts (Med-ToT) for lookahead planning and resource-aware navigation, and Angel-Devil adversarial debates for dialectical verification of evidence conflicts. The framework is evaluated on two author-defined extensions of MIMIC-IV (Cardiac Disease and CDM), claiming state-of-the-art diagnostic accuracy and logical consistency.

Significance. If the empirical superiority claims hold under transparent, reproducible conditions with standard public benchmarks and full baseline re-evaluations, the work could offer a useful modular architecture for mitigating LLM limitations like tunnel vision in clinical settings. The provision of code supports reproducibility and further testing.

major comments (2)

[Abstract and Evaluation section] The central SOTA claim depends on performance over MIMIC-IV-Ext Cardiac Disease and MIMIC-IV-Ext CDM. These are non-standard author extensions; the manuscript must specify patient selection criteria, how unstructured notes were labeled for diagnoses, exact train/test partitioning, and confirm all baselines were re-run on identical splits (see Abstract and Evaluation section). Without this, numerical superiority cannot be taken as load-bearing evidence of general improvement on real-world EHRs.
[Abstract] The abstract asserts SOTA in diagnostic accuracy and logical consistency but provides no concrete metrics (e.g., top-1 accuracy, F1, consistency score), baseline methods, statistical tests, error bars, or data-split details. This omission prevents assessment of whether the three components (Profile-Then-Plan, Med-ToT, Angel-Devil) deliver the claimed mitigation of hallucinations without introducing new biases.

minor comments (2)

[Abstract] The code repository link is anonymous; a permanent, non-anonymous link or DOI should be provided for publication.
[Method section] Ensure algorithmic pseudocode or precise step-by-step descriptions distinguish Med-ToT and Angel-Devil from prior Tree-of-Thoughts and multi-agent debate methods to avoid overlap claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of reproducibility and clarity in our evaluation. We address each major comment below with specific plans for revision where appropriate, while defending the core contributions of DxChain on the basis of the experiments reported in the manuscript.

read point-by-point responses

Referee: [Abstract and Evaluation section] The central SOTA claim depends on performance over MIMIC-IV-Ext Cardiac Disease and MIMIC-IV-Ext CDM. These are non-standard author extensions; the manuscript must specify patient selection criteria, how unstructured notes were labeled for diagnoses, exact train/test partitioning, and confirm all baselines were re-run on identical splits (see Abstract and Evaluation section). Without this, numerical superiority cannot be taken as load-bearing evidence of general improvement on real-world EHRs.

Authors: We agree that explicit documentation of the benchmark construction is necessary to support the SOTA claims. The current manuscript introduces MIMIC-IV-Ext as author-defined extensions focused on cardiac disease and CDM cases with unstructured notes, but we will add a new subsection in the Evaluation section that details: patient selection via ICD-9/10 codes and note-length filters from MIMIC-IV; labeling of ground-truth diagnoses through structured data cross-validation supplemented by clinician review on a held-out sample; the precise temporal train/test split (ensuring no future leakage); and confirmation that all baselines (including standard CoT, ToT, and other LLM agents) were fully re-implemented and evaluated on the identical partitions. These additions will make the numerical improvements directly interpretable as evidence of reduced hallucinations and improved consistency. revision: yes
Referee: [Abstract] The abstract asserts SOTA in diagnostic accuracy and logical consistency but provides no concrete metrics (e.g., top-1 accuracy, F1, consistency score), baseline methods, statistical tests, error bars, or data-split details. This omission prevents assessment of whether the three components (Profile-Then-Plan, Med-ToT, Angel-Devil) deliver the claimed mitigation of hallucinations without introducing new biases.

Authors: We acknowledge that the abstract, constrained by length, currently states the SOTA outcome at a high level without numbers. The full paper reports concrete results in tables (top-1 accuracy, F1, and a custom logical consistency score) with comparisons to baselines and ablation studies isolating each component's contribution to hallucination reduction. To improve accessibility, we will revise the abstract to include key quantitative highlights (e.g., accuracy and consistency gains with significance markers) while retaining the component descriptions. Full error bars, p-values, and split details will remain in the Evaluation section and supplementary material, as they exceed typical abstract limits. This change preserves the abstract's role as a summary without misrepresenting the empirical support for the three innovations. revision: partial

Circularity Check

0 steps flagged

No circularity in DxChain prompting framework

full rationale

The paper describes a modular prompting architecture (Profile-Then-Plan, Med-ToT, Angel-Devil debates) that mirrors clinician phases without any equations, fitted parameters, or derivation chain. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on standard LLM capabilities applied to author-extended benchmarks; these benchmarks are external to the method itself and do not force the framework's internal logic by construction. The derivation is therefore self-contained as an independent procedure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on the unproven effectiveness of the three new prompting procedures in real EHR data; no free parameters are introduced, but the framework assumes LLMs can execute the phases reliably.

axioms (1)

domain assumption LLMs can follow multi-phase iterative reasoning instructions to reduce hallucinations when given structured prompts
Invoked throughout the description of Memory Anchoring, Navigation, and Verification phases.

invented entities (3)

Profile-Then-Plan paradigm no independent evidence
purpose: Establish panoramic patient baseline to mitigate cold-start hallucinations
New methodological component introduced to address tunnel vision.
Medical Tree-of-Thoughts (Med-ToT) no independent evidence
purpose: Strategic look-ahead planning and resource-aware navigation
Adaptation of tree-of-thoughts for medical diagnosis.
Angel-Devil adversarial debates no independent evidence
purpose: Resolve complex evidence conflicts via dialectical verification
New verification procedure using opposing LLM roles.

pith-pipeline@v0.9.0 · 5543 in / 1408 out tokens · 59899 ms · 2026-05-08T06:09:17.160803+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Wsi-agents: A collaborative multi-agent system for multi-modal whole slide image analysis.arXiv preprint arXiv:2507.14680, 2025

Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240. Fenglin Liu, Hongjian Zhou, Boyang Gu, Xinyu Zou, Jinfa Huang, Jinge Wu, Yiru Li, Sam S Chen, Yining Hua, Peilin Zhou, and 1 others. 2025. Application of large language models in medicine.Nature Reviews Bioengineering, pages 1–20. N...

work page arXiv 2025
[2]

LLaMA: Open and Efficient Foundation Language Models

Generative artificial intelligence in medicine. Nature Medicine, pages 1–13. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, and 1 others. 2023. Llama: Open and effi- cient foundation language models.arXiv preprint arXiv:2302.13971. Krithik Vish...

work page internal anchor Pith review arXiv 2023
[3]

Chief Complaint & HPI: Briefly summarize the main reason for the visit and history of present illness
[4]

Positive Findings: List all abnormal lab values, positive imaging findings, and abnormal vital signs
[5]

normal” findings that help rule out major condi- tions (e.g., “Troponin negative

Pertinent Negatives: CRITICAL - Include “normal” findings that help rule out major condi- tions (e.g., “Troponin negative”, “ECG normal”, “No fever”)
[6]

History & Meds: List confirmed past medical history and current medications
[7]

panoramic

Filter Noise: Remove administrative data (in- surance, address) and truly irrelevant normals. 6.Format: Use a concise, structured format. Output:Clinical Abstract Design Rationale:This prompt explicitly in- structs the model to include normal findings (perti- nent negatives) that are critical for ruling out life- threatening conditions. By synthesizing ch...
[8]

If the finding contradicts or is significantly different, answerNO
[9]

Reflection

If it confirms or is consistent, answerYES. Constraint:Answer only YES or NO. Design Rationale:This prompt implements the binary verification logic described in Equation 4. By forcing a strict Boolean output, the system can deterministically decide whether to proceed with the current reasoning path or trigger a "Reflection" loop to revise the diagnostic h...
[10]

Description:The reasoning behind this ap- proach (Concise, max 2 sentences)
[11]

• Valid IDs: diagnostic_test_specialist, medical_imaging_specialist, clinical_specialist, medical_coder, internal_medicine_specialist

First Step Actions:A list of SPECIFIC ex- pert IDs to call IMMEDIATELY . • Valid IDs: diagnostic_test_specialist, medical_imaging_specialist, clinical_specialist, medical_coder, internal_medicine_specialist. • {optional_tools_desc}
[12]

Mental Simulation

Expected Outcome:What specific results do you PREDICT if this strategy is correct? (Crucial for "Mental Simulation"). Design Rationale:This prompt enforces the "Look-Ahead" capability of the Med-ToT frame- work. By requiring the model to explicitly state the Expected Outcomebefore execution, it mitigates hindsight bias. Furthermore, the explicit division ...
[13]

Angel Agent (Supporter): Argues WHY the diagnosis is correct, citing supporting evi- dence
[14]

Input Data:

Devil Agent (Skeptic): Argues WHY the di- agnosis might be wrong, citing negative find- ings or alternative explanations. Input Data:
[15]

Ambiguity Points (Topics): {ambiguity_points}
[16]

Key Findings (Evidence):{key_findings}
[17]

Current Diagnoses Status: {diagnosis_status} Task:
[18]

•Angel: Focus on evidence presence

Simulate Debate: For EACH ambiguous di- agnosis, generate a short, intense debate (2 rounds) between Angel and Devil. •Angel: Focus on evidence presence. • Devil: Focus onClinical Significance. Argue that the finding might be inci- dental, not actively treated, or merely a symptom/lab value rather than a code- able diagnosis. HOWEVER, if the find- ing str...
[19]

Pleural Effusion

Final Verdict: After the debate, act as the Judge and decide the final status of the diag- nosis. • Keep: The Angel won. The diagnosis is valid and clinically significant. • Discard: The Devil won. The diagnosis is incidental or invalid. • Modify: The diagnosis needs to be changed to something else (specify what). E.g., Change “Pleural Effusion” to “COPD”...
[20]

debate_transcript

Structure Enforcement: Ensure the final out- put strictly separatesPrimary(Acute) and Secondary(Chronic) diagnoses. Output Format (JSON): { "debate_transcript": "String containing the dialogue...", "final_verdicts": { "Diagnosis Name": "Keep" | "Discard" | "Modify: New Name" }, "final_diagnosis_update": { "primary_diagnoses": [ {"disease_name": "...", "ic...
[21]

Diagnoses to Defend:{diagnosis_names}
[22]

Key Findings (Evidence):{key_findings} Defense Strategy:For EACH diagnosis:
[23]

If we miss Pneumonia, pa- tient dies

Clinical Consequence: What happens if we miss this? (e.g., “If we miss Pneumonia, pa- tient dies.”)
[24]

arguments

Risk Factor Defense: If it’s a chronic condi- tion (e.g., Hyperlipidemia, Obesity, Smoking History), argue that it is CRITICAL for long- term risk stratification and secondary preven- tion, even if not acutely treated today. 3.Evidence: Cite the specific lab/imaging. Output Format (JSON): { "arguments": { "Diagnosis Name 1": "Defend because...", "Diagnosi...
[25]

Diagnoses to Attack:{diagnosis_names}
[26]

Kill Criteria

Key Findings (Evidence):{key_findings} Attack Strategy (Criteria to Discard):For EACH diagnosis, check these “Kill Criteria”:
[27]

So What?

The “So What?” Test: Is this condition ac- tively treated? If it’s just a mild lab abnormal- ity (e.g., “Mild Anemia”, “Thrombocytope- nia”) or imaging finding (e.g., “Atelectasis”, “Pleural Effusion”) with NO specific interven- tion, argue to DISCARD
[28]

Chest Pain

Symptom masquerading as Disease: Is it just a symptom (e.g., “Chest Pain”, “Dysp- nea”, “Weakness”)? If the cause is known, DISCARD the symptom
[29]

Varicose veins

Incidental/Minor: Is it a minor finding (e.g., “Varicose veins”, “Cyst”, “Scar”) irrelevant to the hospital stay? DISCARD
[30]

Left Ventricular Hypertro- phy

Duplicate/Overlap: Is it covered by another diagnosis? (e.g., “Left Ventricular Hypertro- phy” when “Hypertension” is present). Output Format (JSON): { "arguments": { "Diagnosis Name 1": "DISCARD because [Reason]...", "Diagnosis Name 2": "MODIFY to [New Name] because..." } } Angel Agent Rebuttal Prompt System Role: You are the “Angel Agent” (The Advocate)...
[31]

Address their specific points
[32]

rebuttals

Reiterate clinical danger. Output Format (JSON): { "rebuttals": { "Diagnosis Name 1": "Rebuttal...", "Diagnosis Name 2": "Rebuttal..." } } Devil Agent Rebuttal Prompt System Role: You are the “Devil Agent” (The Skeptic). You are debating the “Angel Agent”. Context:Angel’s Arguments: {angel_arguments} Task:For EACH diagnosis, rebut the Angel’s latest argument
[33]

Point out over-reaction
[34]

rebuttals

Reiterate lack of significance. Output Format (JSON): { "rebuttals": { "Diagnosis Name 1": "Rebuttal...", "Diagnosis Name 2": "Rebuttal..." } } Design Rationale:This adversarial debate framework ensures diagnostic robustness by forc- ing explicit justification of each diagnosis through structured argumentation. The Angel Agent pre- vents premature dismiss...