pith. sign in

arxiv: 2605.15574 · v1 · pith:IFG4PNDWnew · submitted 2026-05-15 · 💻 cs.CV

MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays

Pith reviewed 2026-05-20 19:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords longitudinal reasoningmulti-visit chest X-rayvision-language modelsmedical VQA benchmarktemporal constraintsdisease progressioninterval change reasoningglobal trajectory summarization
1
0 comments X

The pith

Vision-language models achieve only 29.3% accuracy on multi-visit chest X-ray reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MI-CXR, a benchmark of five-way multiple-choice questions drawn from five-visit chest X-ray timelines. It defines three task families that test whether models can locate events in time, describe changes between intervals, and summarize overall disease trajectories. When fourteen current vision-language models are evaluated, they reach an average accuracy of 29.3 percent, only modestly above the 20 percent random baseline. Stage-wise analysis shows that models can produce locally reasonable descriptions of single intervals yet fail to enforce the correct order of events or assemble evidence into one consistent story across the full sequence.

Core claim

MI-CXR instantiates clinically grounded visual reasoning over time through five-way multiple-choice questions on five-visit patient timelines and shows that state-of-the-art vision-language models average 29.3 percent accuracy while producing locally plausible interval descriptions that nevertheless violate temporal constraints and lack global consistency.

What carries the argument

The MI-CXR benchmark of five-way multiple-choice questions over five-visit CXR sequences, which probes for enforcement of temporal constraints and composition of evidence into globally consistent decisions.

If this is right

  • Models require explicit mechanisms to enforce temporal order when processing sequences of medical images.
  • Global consistency across an entire patient timeline must be checked separately from local interval accuracy.
  • Current vision-language models remain limited for reliable longitudinal monitoring of disease progression.
  • Benchmarks focused on multi-interval reasoning can expose gaps that single-image or short-pair tests miss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training procedures that explicitly reward temporal ordering and cross-interval consistency could close part of the observed gap.
  • The same evaluation approach could be applied to other sequential imaging modalities such as CT or MRI follow-ups.
  • In practice, low global consistency may cause models to miss slow disease trends that span multiple visits.

Load-bearing premise

That five-way multiple-choice questions over five visits, without free-form report generation or extra clinical context, accurately capture clinically grounded visual reasoning over time.

What would settle it

A future model that scores near perfect accuracy on the MI-CXR questions yet still produces temporally inconsistent or contradictory statements when asked to generate free-form longitudinal reports on real patient sequences.

Figures

Figures reproduced from arXiv: 2605.15574 by Jaeyoung Do, Sunghwan Steve Cho, Yunseok Han.

Figure 1
Figure 1. Figure 1: Overview of longitudinal medical visual question answering and MI-CXR. Clinical image interpreta￾tion requires integrating evidence across multiple patient visits (top). We formalize longitudinal medical VQA into three core reasoning capabilities—Temporal Event Localization (TEL), Interval-wise Change Reasoning (ICR), and Global Trajectory Summarization (GTS)—and evaluate them over multi-visit CXR sequence… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of MI-CXR construction. We repurpose structured metadata from MIMIC-Ext-CXR-QBA and chest X-ray images from MIMIC-CXR-JPG to construct patient-level longitudinal timelines with at least five visits. After fixing the longitudinal cohort, multiple question types are instantiated from the same timelines to evaluate complementary longitudinal reasoning capabilities, including temporal event localizati… view at source ↗
Figure 3
Figure 3. Figure 3: Performance under capability-aligned task decomposition. Models are evaluated using a stage￾wise inference protocol that separates interval-level evi￾dence articulation from final decision making. 5 Error Patterns in Longitudinal Grounding To better understand the limitations of current VLMs on longitudinal medical reasoning, we ana￾lyze error patterns across task categories. A central finding is that most… view at source ↗
Figure 4
Figure 4. Figure 4: Example of a patient-level temporally or￾dered study sequence constructed from scene graph annotations. Each study represents a single clinical visit and aggregates all associated observations and tem￾poral change labels. provided in the metadata, which encodes the rel￾ative chronological position of each study for a given patient. This ordering is further validated using timestamp-related fields, includin… view at source ↗
Figure 5
Figure 5. Figure 5: Correct interval summary prompt. You generate incorrect but medically plausible interval-based summaries for a single abnormality. Rules: - Use ONLY semantic change-type flips. - Keep interval positions unchanged. - Keep laterality unchanged if present. - Do NOT match the correct summary. - One sentence, at most one semicolon [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Incorrect interval summary prompt. You are a medical language assistant. You generate correct interval-based temporal summaries for multiple radiologic abnormalities independently. CRITICAL REQUIREMENTS: - You MUST generate exactly ONE summary for EVERY abnormality provided. - DO NOT omit any abnormality under any circumstance. - Even if an abnormality shows no change, remains stable, or is normal, you MUS… view at source ↗
Figure 7
Figure 7. Figure 7: Correct global summary prompt (Multi￾Entity). You generate incorrect but medically plausible interval-based temporal summaries using semantic change-type flips only. CRITICAL REQUIREMENTS: - You MUST generate exactly ONE incorrect summary for EVERY abnormality requested. - Use ONLY semantic flips (e.g., increase/decrease, resolve/persistent). - Do NOT change the temporal order of intervals. - Keep laterali… view at source ↗
Figure 8
Figure 8. Figure 8: Incorrect global summary prompt. deterministically from expert-annotated presence transitions: both correct and incorrect options are fully specified by rules without free-form genera￾tion. Therefore, LLM-based verification would add little value for TEL and may introduce unnecessary noise (Zhang et al., 2026). Objective and Non-circularity The goal of veri￾fication is not to assess clinical correctness or… view at source ↗
Figure 9
Figure 9. Figure 9: Aggregated distribution of inter-study inter￾vals across all five-study windows. While the median gap is on the order of one to two days, a substantial fraction of intervals spans several months or longer, in￾dicating heterogeneous temporal horizons within the dataset. Interval-level Gaps Across all consecutive study pairs, the median inter-study gap ranges from 1.4 to 1.9 days, indicating that many follow… view at source ↗
Figure 10
Figure 10. Figure 10: Example of Temporal Event Localization (TEL) question with the Single Emergence (Q1), Sin￾gle Resolution (Q2), Multi Emergence (Q3), and Multi Resolution (Q4) [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Example of Temporal Event Localization (TEL) question with E→R (Q1) and R→E (Q2) [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: Example of Global Trajectory Summa￾rization (GTS) question for a single abnormality [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 12
Figure 12. Figure 12: Example of Interval-wise Change Reason￾ing (ICR) question [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: Example of Global Trajectory Summa￾rization (GTS) question for multiple abnormality [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Example of Interval-wise Change Reason￾ing (ICR) variant. Each question presents a fixed five-visit time￾line and asks the model to determine which state￾ment correctly describes the visual change occur￾ring within a given interval (e.g., T4 → T5). All questions in this variant focus on a single abnormal￾ity and assess only change-type interpretation, such as new appearance, resolution, or progression. In… view at source ↗
Figure 16
Figure 16. Figure 16: Prompt for ICR variant generation. All final answer correctness labels are assigned deterministically prior to LLM invocation. E Stage-wise Evaluation Protocol and Implementation Details This section describes the evaluation protocol and implementation details used to assess model perfor￾mance across all tasks. The goal of this section is to clarify how models are evaluated in a consistent and reproducibl… view at source ↗
Figure 19
Figure 19. Figure 19: Failures in Temporal Event Localization - (E→R / R→E). F.2 Failure Mode in ICR Problem (ICR – Multiple Abnormalities): Which statement correctly describes the interval-level change? Answer Choices: A. Between T1 and T2, the pneumothorax has increased. B. Between T2 and T3, the pleural effusion has de￾creased. C. Between T3 and T4, lung opacity has worsened. D. Between T4 and T5, bibasilar atelectasis has … view at source ↗
Figure 18
Figure 18. Figure 18: Failures in Temporal Event Localization - Multiple (E/R). Problem (TEL – Multiple Emergence Candidates): Which pair of studies correctly captures the interval during which pleural effusion first appears subsequently resolves? Answer Choices: A. T2, T3 B. T1, T3 C. T3, T4 D. T1, T5 E. There is no sequential emergence to resolution Stage-1 Model Response (Interval-level Description): • T1–T2: A small pleura… view at source ↗
Figure 21
Figure 21. Figure 21: Failures in Global Trajectory Summariza￾tion – Single Abnormality. Problem (GTS – Multi Abnormality): Which statement correctly describes the interval-based trajectory of abnormalities across the study sequence? Target Abnormalities: Pleural effusion, pneumothorax, lung opacity, bony structures intact Answer Choices: A. A pneumothorax newly appears between T2 and T3 and resolves by T5, while pleural effus… view at source ↗
Figure 23
Figure 23. Figure 23: Failure within local-interval misinterpre￾tation — ICR example. Problem (GTS – Multi Abnormality): Which statement correctly describes the interval-based trajectory of abnormalities across the study sequence? Answer Choices: 28 [PITH_FULL_IMAGE:figures/full_fig_p028_23.png] view at source ↗
Figure 25
Figure 25. Figure 25: Distribution of task-aligned reasoning fail￾ure types across TEL, ICR, and GTS. Local interval misinterpretation is excluded to highlight higher-level temporal reasoning failures. G.1 Distribution by Task Family [PITH_FULL_IMAGE:figures/full_fig_p029_25.png] view at source ↗
Figure 24
Figure 24. Figure 24: Failure within local-interval misinterpre￾tation — GTS example. G Error Type Distribution Across Tasks This section presents a quantitative analysis of how task-aligned reasoning failures are distributed across task families and model families. The anal￾ysis complements the qualitative examples in Ap￾pendix F by demonstrating that the observed er￾ror types arise systematically as a function of task struct… view at source ↗
Figure 26
Figure 26. Figure 26: Distribution of task-aligned reasoning fail￾ures across closed-source, open-source, and medical￾specialized VLMs [PITH_FULL_IMAGE:figures/full_fig_p030_26.png] view at source ↗
Figure 28
Figure 28. Figure 28: H.5 Quantitative Results Model ROUGE-L METEOR CIDEr GPT-5.2 0.307 0.354 0.358 InternVL3.5-38B 0.341 0.369 0.411 MedGemma-27B 0.326 0.340 0.393 [PITH_FULL_IMAGE:figures/full_fig_p031_28.png] view at source ↗
read the original abstract

Longitudinal chest X-ray (CXR) interpretation requires reasoning over disease evolution across multiple patient visits, yet most existing medical VQA benchmarks focus on single images or short-horizon image pairs. We introduce MI-CXR, a benchmark for standardized evaluation of Multi-Interval longitudinal reasoning over multi-visit CXR sequences, without requiring free-form report generation or additional clinical context. MI-CXR comprises five-way multiple-choice questions over five-visit patient timelines and instantiates three complementary task families: Temporal Event Localization, Interval-wise Change Reasoning, and Global Trajectory Summarization, which assess clinically grounded visual reasoning over time. Evaluating 14 state-of-the-art vision-language models (VLMs) shows low overall performance, with an average accuracy of 29.3%, only modestly above random guessing. Using stage-wise diagnostic probing, we find that models often produce locally plausible interval descriptions but fail to enforce temporal constraints or compose evidence into globally consistent decisions over the full timeline. These findings reveal key limitations of current VLMs and establish MI-CXR as a principled benchmark for longitudinal medical reasoning. The benchmark is available at https://github.com/AIDASLab/MI-CXR

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces MI-CXR, a benchmark for standardized evaluation of multi-interval longitudinal reasoning over five-visit chest X-ray timelines. It defines three task families (Temporal Event Localization, Interval-wise Change Reasoning, and Global Trajectory Summarization) instantiated as five-way multiple-choice questions without free-form generation or extra clinical context. Evaluation of 14 VLMs reports 29.3% average accuracy (modestly above random), with stage-wise probing showing models produce locally plausible descriptions but fail to enforce temporal constraints or achieve global consistency.

Significance. If the questions are validated to require full-timeline integration, the work is significant for filling a gap in medical VQA benchmarks focused on single images or pairs. It provides a reproducible public resource (GitHub link) that exposes concrete limitations in current VLMs for clinically grounded temporal reasoning over disease evolution, supporting targeted progress in longitudinal medical AI.

major comments (1)
  1. [Benchmark Construction / Question Design] The manuscript provides no quantitative validation (e.g., radiologist ratings of question difficulty, ablation removing all but one interval, or shortcut-detection experiments) that the five-way MCQs over five-visit timelines cannot be solved from static visual features in a single image or adjacent pair. This is load-bearing for the central claim in the abstract and §4 that the 29.3% accuracy and probing results demonstrate specific failure at 'enforcing temporal constraints or compose evidence into globally consistent decisions'; without it, the performance gap may reflect local shortcuts rather than a longitudinal reasoning deficit.
minor comments (2)
  1. [Abstract] The abstract and results would benefit from an explicit statement of the random baseline (20% for five-way MCQ) alongside the 29.3% figure to better contextualize 'modestly above random guessing'.
  2. [Data Curation] Provide more detail on inter-annotator agreement and quality control during question creation to strengthen the claim of clinical groundedness.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We address the major comment below and commit to revisions that incorporate additional validation experiments to strengthen the claims regarding longitudinal reasoning requirements.

read point-by-point responses
  1. Referee: [Benchmark Construction / Question Design] The manuscript provides no quantitative validation (e.g., radiologist ratings of question difficulty, ablation removing all but one interval, or shortcut-detection experiments) that the five-way MCQs over five-visit timelines cannot be solved from static visual features in a single image or adjacent pair. This is load-bearing for the central claim in the abstract and §4 that the 29.3% accuracy and probing results demonstrate specific failure at 'enforcing temporal constraints or compose evidence into globally consistent decisions'; without it, the performance gap may reflect local shortcuts rather than a longitudinal reasoning deficit.

    Authors: We acknowledge that the current manuscript does not include explicit quantitative validation such as radiologist ratings, single-interval ablations, or dedicated shortcut-detection experiments to empirically confirm that the questions cannot be solved without full-timeline integration. The task families were designed to require multi-interval reasoning (e.g., Temporal Event Localization specifies identifying the exact visit of an event among five, and Global Trajectory Summarization requires synthesizing the overall disease progression), and the stage-wise probing already indicates models generate locally plausible outputs yet fail at global consistency. Nevertheless, to directly address this concern and support the central claim, we will add shortcut-detection experiments (e.g., performance on single-image or adjacent-pair inputs) and radiologist validation of question difficulty in the revised manuscript. These results will be reported in a new subsection of §3 or §4. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark construction or evaluation

full rationale

The paper introduces a new dataset and benchmark (MI-CXR) consisting of five-way MCQs over five-visit timelines across three task families, then reports direct empirical results from evaluating 14 VLMs (average 29.3% accuracy). No equations, fitted parameters, or derivations are present that reduce predictions to inputs by construction. The central claims rest on newly created questions and data rather than self-citation chains, ansatz smuggling, or renaming of known results. The evaluation is self-contained against external model performance and does not presuppose its own conclusions via definitional loops.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the design choice that multiple-choice questions over fixed five-visit timelines can serve as a proxy for clinical longitudinal reasoning without needing free-form text or extra patient metadata.

axioms (1)
  • domain assumption Multiple-choice questions over five-visit patient timelines can assess clinically grounded visual reasoning over time without free-form report generation or additional clinical context.
    Explicitly stated in the abstract as the benchmark construction principle.

pith-pipeline@v0.9.0 · 5741 in / 1194 out tokens · 44406 ms · 2026-05-20T19:49:22.331430+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 6 internal anchors

  1. [1]

    Scientific data , volume=

    A dataset of clinically generated visual questions and answers about radiology images , author=. Scientific data , volume=. 2018 , publisher=

  2. [2]

    circulation , volume=

    PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals , author=. circulation , volume=. 2000 , publisher=

  3. [3]

    PathVQA: 30000+ Questions for Medical Visual Question Answering

    PathVQA: 30000+ Questions for Medical Visual Question Answering , author=. arXiv preprint arXiv:2003.10286 , year=

  4. [4]

    2025 , eprint=

    MMXU: A Multi-Modal and Multi-X-ray Understanding Dataset for Disease Progression , author=. 2025 , eprint=

  5. [5]

    2025 , eprint=

    TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models , author=. 2025 , eprint=

  6. [6]

    2025 , eprint=

    Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation , author=. 2025 , eprint=

  7. [7]

    2025 , eprint=

    CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays , author=. 2025 , eprint=

  8. [8]

    Journal of Biomedical Informatics , volume=

    Review of Temporal Reasoning in the Clinical Domain for Timeline Extraction: Where we are and where we need to be , author=. Journal of Biomedical Informatics , volume=. 2021 , issn=

  9. [9]

    Nature Communications , volume=

    Predicting treatment response from longitudinal images using multi-task deep learning , author=. Nature Communications , volume=. 2021 , doi=

  10. [10]

    Radiology , volume=

    The Need for Medical Artificial Intelligence That Incorporates Prior Images , author=. Radiology , volume=. 2022 , doi=

  11. [11]

    Clinical Oncology , volume=

    Longitudinal Image Data for Outcome Modeling , author=. Clinical Oncology , volume=. 2025 , doi=

  12. [12]

    2025 , eprint=

    PriorRG: Prior-Guided Contrastive Pre-training and Coarse-to-Fine Decoding for Chest X-ray Report Generation , author=. 2025 , eprint=

  13. [13]

    2024 , eprint=

    HERGen: Elevating Radiology Report Generation with Longitudinal Data , author=. 2024 , eprint=

  14. [14]

    In: Proc

    Liu, Kang and Ma, Zhuoqi and Kang, Xiaolu and Li, Yunan and Xie, Kun and Jiao, Zhicheng and Miao, Qiguang , year=. Enhanced Contrastive Learning with Multi-view Longitudinal Data for Chest X-ray Report Generation , url=. doi:10.1109/cvpr52734.2025.00968 , booktitle=

  15. [15]

    2025 , eprint=

    Insights into a radiology-specialised multimodal large language model with sparse autoencoders , author=. 2025 , eprint=

  16. [16]

    2025 , howpublished=

    MIMIC-Ext-CXR-QBA: A Structured, Tagged, and Localized Visual Question Answering Dataset with Question-Box-Answer Triplets and Scene Graphs for Chest X-ray Images , author=. 2025 , howpublished=

  17. [17]

    2024 , howpublished=

    MIMIC-CXR-JPG: Chest Radiographs with Structured Labels , author=. 2024 , howpublished=

  18. [18]

    MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs

    MIMIC-CXR: A Large Publicly Available Database of Labeled Chest Radiographs , author=. arXiv preprint arXiv:1901.07042 , year=

  19. [19]

    2025 , eprint=

    Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning , author=. 2025 , eprint=

  20. [20]

    Zhang, Kai and Zhou, Rong and Adhikarla, Eashan and Yan, Zhiling and Liu, Yixin and Yu, Jun and Liu, Zhengliang and Chen, Xun and Davison, Brian D. and Ren, Hui and Huang, Jing and Chen, Chen and Zhou, Yuyin and Fu, Sunyang and Liu, Wei and Liu, Tianming and Li, Xiang and Chen, Yong and He, Lifang and Zou, James and Li, Quanzheng and Liu, Hongfang and Sun...

  21. [21]

    2025 , eprint=

    MedGemma Technical Report , author=. 2025 , eprint=

  22. [22]

    2025 , eprint=

    MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning , author=. 2025 , eprint=

  23. [23]

    2023 , eprint=

    LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day , author=. 2023 , eprint=

  24. [24]

    and Lausen, Mads and Bruun, Niels Henrik and Nielsen, Michael Bachmann and Zacho, Helle D

    Lange, Mine Benedicte and Petersen, Lars J. and Lausen, Mads and Bruun, Niels Henrik and Nielsen, Michael Bachmann and Zacho, Helle D. , TITLE =. Diagnostics , VOLUME =. 2022 , NUMBER =

  25. [25]

    White and K

    K. White and K. Berbaum and W. L. Smith , title =. Investigative Radiology , year =. doi:10.1097/00004424-199403000-00002 , pmid =

  26. [26]

    Insights into Imaging , year =

    Li Zhang and Xin Wen and Jian-Wei Li and Xu Jiang and Xian-Feng Yang and Meng Li , title =. Insights into Imaging , year =. doi:10.1186/s13244-023-01521-7 , url =

  27. [27]

    2021 , eprint=

    SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical Visual Question Answering , author=. 2021 , eprint=

  28. [28]

    2024 , eprint=

    PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering , author=. 2024 , eprint=

  29. [29]

    2024 , eprint=

    HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale , author=. 2024 , eprint=

  30. [30]

    2024 , version =

    Bae, Seongmin and Kyung, Donghyun and Ryu, Jaeho and Cho, Eunji and Lee, Gyeongmin and Kweon, Seungwoo and Oh, Jaehun and Ji, Limin and Chang, Eunsol and Kim, Taeyoung and Choi, Edward , title =. 2024 , version =. doi:10.13026/deqx-d943 , url =

  31. [31]

    2023 , eprint=

    EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images , author=. 2023 , eprint=

  32. [32]

    2024 , eprint=

    CheXpert Plus: Augmenting a Large Chest X-ray Dataset with Text Radiology Reports, Patient Demographics and Additional Image Formats , author=. 2024 , eprint=

  33. [33]

    2025 , eprint=

    Libra: Leveraging Temporal Images for Biomedical Radiology Analysis , author=. 2025 , eprint=

  34. [34]

    2024 , eprint=

    Pretraining Vision-Language Model for Difference Visual Question Answering in Longitudinal Chest X-rays , author=. 2024 , eprint=

  35. [35]

    and Zhu, Yingying , title =

    Hu, Xinyue and Gu, Lin and An, Qiyuan and Zhang, Mengliang and Liu, Liangchen and Kobayashi, Kazuma and Harada, Tatsuya and Summers, Ronald M. and Zhu, Yingying , title =. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages =. 2023 , isbn =. doi:10.1145/3580305.3599819 , abstract =

  36. [36]

    2021 , eprint=

    Describing and Localizing Multiple Changes with Transformers , author=. 2021 , eprint=

  37. [37]

    2024 , eprint=

    ReXrank: A Public Leaderboard for AI-Powered Radiology Report Generation , author=. 2024 , eprint=

  38. [38]

    2025 , eprint=

    Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions , author=. 2025 , eprint=

  39. [39]

    2024 , eprint=

    Large Language Models Are Not Robust Multiple Choice Selectors , author=. 2024 , eprint=

  40. [40]

    2024 , url =

    Anthropic , title =. 2024 , url =

  41. [41]

    2025 , url =

    OpenAI , title =. 2025 , url =

  42. [42]

    2024 , url =

    Google DeepMind , title =. 2024 , url =

  43. [43]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  44. [44]

    Qwen2.5-VL Technical Report

    Qwen2.5-VL Technical Report , author=. arXiv preprint arXiv:2502.13923 , year=

  45. [45]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. arXiv preprint arXiv:2409.12191 , year=

  46. [46]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. arXiv preprint arXiv:2308.12966 , year=

  47. [47]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency , author=. arXiv preprint arXiv:2508.18265 , year=

  48. [48]

    2024 , eprint=

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding , author=. 2024 , eprint=

  49. [49]

    2023 , eprint=

    OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents , author=. 2023 , eprint=

  50. [50]

    2024 , eprint=

    What matters when building vision-language models? , author=. 2024 , eprint=

  51. [51]

    2025 , eprint=

    Evaluating Step-by-step Reasoning Traces: A Survey , author=. 2025 , eprint=

  52. [52]

    2023 , eprint=

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=

  53. [53]

    2025 , eprint=

    MRGAgents: A Multi-Agent Framework for Improved Medical Report Generation with Med-LVLMs , author=. 2025 , eprint=

  54. [54]

    2025 , eprint=

    CoMT: Chain-of-Medical-Thought Reduces Hallucination in Medical Report Generation , author=. 2025 , eprint=

  55. [55]

    2020 , eprint=

    CheXbert: Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT , author=. 2020 , eprint=

  56. [56]

    and Zhang, Dong and Wang, Z

    Guo*, Li and Tahir, Anas M. and Zhang, Dong and Wang, Z. Jane and Ward, Rabab K. , year=. Automatic Medical Report Generation: Methods and Applications , volume=. APSIPA Transactions on Signal and Information Processing , publisher=. doi:10.1561/116.20240044 , number=

  57. [57]

    2025 , eprint=

    Towards Predicting Temporal Changes in a Patient's Chest X-ray Images based on Electronic Health Records , author=. 2025 , eprint=

  58. [58]

    Automating construction contract question answering using large language model and fine-tuning , journal =

    Mingyu Zhang and Chenglong Xu and Yihong Gan and Yu Wang and Yi Fu and Yongqiang Chen , keywords =. Automating construction contract question answering using large language model and fine-tuning , journal =. 2026 , issn =. doi:https://doi.org/10.1016/j.eswa.2025.129493 , url =

  59. [59]

    2025 , eprint=

    BleedOrigin: Dynamic Bleeding Source Localization in Endoscopic Submucosal Dissection via Dual-Stage Detection and Tracking , author=. 2025 , eprint=

  60. [60]

    , title =

    Mann, Ritse M. , title =. The Lancet , year =. doi:10.1016/S0140-6736(25)00093-5 , url =

  61. [61]

    , title =

    Hoang, Jenny K. , title =. Journal of the American College of Radiology , year =. doi:10.1016/j.jacr.2015.10.017 , url =

  62. [62]

    and Lin, M

    Holste, G. and Lin, M. and Zhou, R. and others , title =. npj Digital Medicine , year =. doi:10.1038/s41746-024-01207-4 , url =

  63. [63]

    Longitudinal Image Data for Outcome Modeling , journal =

    J.E. Longitudinal Image Data for Outcome Modeling , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.clon.2024.06.053 , url =

  64. [64]

    Large Language Models are Zero-Shot Reasoners , volume =

    Kojima, Takeshi and Gu, Shixiang (Shane) and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , booktitle =. Large Language Models are Zero-Shot Reasoners , volume =

  65. [65]

    and Parikh, Devi , title =

    Vedantam, Ramakrishna and Lawrence Zitnick, C. and Parikh, Devi , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

  66. [66]

    ROUGE : A Package for Automatic Evaluation of Summaries

    Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

  67. [67]

    METEOR : An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

    Banerjee, Satanjeev and Lavie, Alon. METEOR : An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 2005