MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays

Jaeyoung Do; Sunghwan Steve Cho; Yunseok Han

arxiv: 2605.15574 · v1 · pith:IFG4PNDWnew · submitted 2026-05-15 · 💻 cs.CV

MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays

Sunghwan Steve Cho , Yunseok Han , Jaeyoung Do This is my paper

Pith reviewed 2026-05-20 19:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords longitudinal reasoningmulti-visit chest X-rayvision-language modelsmedical VQA benchmarktemporal constraintsdisease progressioninterval change reasoningglobal trajectory summarization

0 comments

The pith

Vision-language models achieve only 29.3% accuracy on multi-visit chest X-ray reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MI-CXR, a benchmark of five-way multiple-choice questions drawn from five-visit chest X-ray timelines. It defines three task families that test whether models can locate events in time, describe changes between intervals, and summarize overall disease trajectories. When fourteen current vision-language models are evaluated, they reach an average accuracy of 29.3 percent, only modestly above the 20 percent random baseline. Stage-wise analysis shows that models can produce locally reasonable descriptions of single intervals yet fail to enforce the correct order of events or assemble evidence into one consistent story across the full sequence.

Core claim

MI-CXR instantiates clinically grounded visual reasoning over time through five-way multiple-choice questions on five-visit patient timelines and shows that state-of-the-art vision-language models average 29.3 percent accuracy while producing locally plausible interval descriptions that nevertheless violate temporal constraints and lack global consistency.

What carries the argument

The MI-CXR benchmark of five-way multiple-choice questions over five-visit CXR sequences, which probes for enforcement of temporal constraints and composition of evidence into globally consistent decisions.

If this is right

Models require explicit mechanisms to enforce temporal order when processing sequences of medical images.
Global consistency across an entire patient timeline must be checked separately from local interval accuracy.
Current vision-language models remain limited for reliable longitudinal monitoring of disease progression.
Benchmarks focused on multi-interval reasoning can expose gaps that single-image or short-pair tests miss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training procedures that explicitly reward temporal ordering and cross-interval consistency could close part of the observed gap.
The same evaluation approach could be applied to other sequential imaging modalities such as CT or MRI follow-ups.
In practice, low global consistency may cause models to miss slow disease trends that span multiple visits.

Load-bearing premise

That five-way multiple-choice questions over five visits, without free-form report generation or extra clinical context, accurately capture clinically grounded visual reasoning over time.

What would settle it

A future model that scores near perfect accuracy on the MI-CXR questions yet still produces temporally inconsistent or contradictory statements when asked to generate free-form longitudinal reports on real patient sequences.

Figures

Figures reproduced from arXiv: 2605.15574 by Jaeyoung Do, Sunghwan Steve Cho, Yunseok Han.

**Figure 1.** Figure 1: Overview of longitudinal medical visual question answering and MI-CXR. Clinical image interpretation requires integrating evidence across multiple patient visits (top). We formalize longitudinal medical VQA into three core reasoning capabilities—Temporal Event Localization (TEL), Interval-wise Change Reasoning (ICR), and Global Trajectory Summarization (GTS)—and evaluate them over multi-visit CXR sequence… view at source ↗

**Figure 2.** Figure 2: Overview of MI-CXR construction. We repurpose structured metadata from MIMIC-Ext-CXR-QBA and chest X-ray images from MIMIC-CXR-JPG to construct patient-level longitudinal timelines with at least five visits. After fixing the longitudinal cohort, multiple question types are instantiated from the same timelines to evaluate complementary longitudinal reasoning capabilities, including temporal event localizati… view at source ↗

**Figure 3.** Figure 3: Performance under capability-aligned task decomposition. Models are evaluated using a stagewise inference protocol that separates interval-level evidence articulation from final decision making. 5 Error Patterns in Longitudinal Grounding To better understand the limitations of current VLMs on longitudinal medical reasoning, we analyze error patterns across task categories. A central finding is that most… view at source ↗

**Figure 4.** Figure 4: Example of a patient-level temporally ordered study sequence constructed from scene graph annotations. Each study represents a single clinical visit and aggregates all associated observations and temporal change labels. provided in the metadata, which encodes the relative chronological position of each study for a given patient. This ordering is further validated using timestamp-related fields, includin… view at source ↗

**Figure 5.** Figure 5: Correct interval summary prompt. You generate incorrect but medically plausible interval-based summaries for a single abnormality. Rules: - Use ONLY semantic change-type flips. - Keep interval positions unchanged. - Keep laterality unchanged if present. - Do NOT match the correct summary. - One sentence, at most one semicolon [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Incorrect interval summary prompt. You are a medical language assistant. You generate correct interval-based temporal summaries for multiple radiologic abnormalities independently. CRITICAL REQUIREMENTS: - You MUST generate exactly ONE summary for EVERY abnormality provided. - DO NOT omit any abnormality under any circumstance. - Even if an abnormality shows no change, remains stable, or is normal, you MUS… view at source ↗

**Figure 7.** Figure 7: Correct global summary prompt (MultiEntity). You generate incorrect but medically plausible interval-based temporal summaries using semantic change-type flips only. CRITICAL REQUIREMENTS: - You MUST generate exactly ONE incorrect summary for EVERY abnormality requested. - Use ONLY semantic flips (e.g., increase/decrease, resolve/persistent). - Do NOT change the temporal order of intervals. - Keep laterali… view at source ↗

**Figure 8.** Figure 8: Incorrect global summary prompt. deterministically from expert-annotated presence transitions: both correct and incorrect options are fully specified by rules without free-form generation. Therefore, LLM-based verification would add little value for TEL and may introduce unnecessary noise (Zhang et al., 2026). Objective and Non-circularity The goal of verification is not to assess clinical correctness or… view at source ↗

**Figure 9.** Figure 9: Aggregated distribution of inter-study intervals across all five-study windows. While the median gap is on the order of one to two days, a substantial fraction of intervals spans several months or longer, indicating heterogeneous temporal horizons within the dataset. Interval-level Gaps Across all consecutive study pairs, the median inter-study gap ranges from 1.4 to 1.9 days, indicating that many follow… view at source ↗

**Figure 10.** Figure 10: Example of Temporal Event Localization (TEL) question with the Single Emergence (Q1), Single Resolution (Q2), Multi Emergence (Q3), and Multi Resolution (Q4) [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Example of Temporal Event Localization (TEL) question with E→R (Q1) and R→E (Q2) [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 13.** Figure 13: Example of Global Trajectory Summarization (GTS) question for a single abnormality [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 12.** Figure 12: Example of Interval-wise Change Reasoning (ICR) question [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 14.** Figure 14: Example of Global Trajectory Summarization (GTS) question for multiple abnormality [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: Example of Interval-wise Change Reasoning (ICR) variant. Each question presents a fixed five-visit timeline and asks the model to determine which statement correctly describes the visual change occurring within a given interval (e.g., T4 → T5). All questions in this variant focus on a single abnormality and assess only change-type interpretation, such as new appearance, resolution, or progression. In… view at source ↗

**Figure 16.** Figure 16: Prompt for ICR variant generation. All final answer correctness labels are assigned deterministically prior to LLM invocation. E Stage-wise Evaluation Protocol and Implementation Details This section describes the evaluation protocol and implementation details used to assess model performance across all tasks. The goal of this section is to clarify how models are evaluated in a consistent and reproducibl… view at source ↗

**Figure 19.** Figure 19: Failures in Temporal Event Localization - (E→R / R→E). F.2 Failure Mode in ICR Problem (ICR – Multiple Abnormalities): Which statement correctly describes the interval-level change? Answer Choices: A. Between T1 and T2, the pneumothorax has increased. B. Between T2 and T3, the pleural effusion has decreased. C. Between T3 and T4, lung opacity has worsened. D. Between T4 and T5, bibasilar atelectasis has … view at source ↗

**Figure 18.** Figure 18: Failures in Temporal Event Localization - Multiple (E/R). Problem (TEL – Multiple Emergence Candidates): Which pair of studies correctly captures the interval during which pleural effusion first appears subsequently resolves? Answer Choices: A. T2, T3 B. T1, T3 C. T3, T4 D. T1, T5 E. There is no sequential emergence to resolution Stage-1 Model Response (Interval-level Description): • T1–T2: A small pleura… view at source ↗

**Figure 21.** Figure 21: Failures in Global Trajectory Summarization – Single Abnormality. Problem (GTS – Multi Abnormality): Which statement correctly describes the interval-based trajectory of abnormalities across the study sequence? Target Abnormalities: Pleural effusion, pneumothorax, lung opacity, bony structures intact Answer Choices: A. A pneumothorax newly appears between T2 and T3 and resolves by T5, while pleural effus… view at source ↗

**Figure 23.** Figure 23: Failure within local-interval misinterpretation — ICR example. Problem (GTS – Multi Abnormality): Which statement correctly describes the interval-based trajectory of abnormalities across the study sequence? Answer Choices: 28 [PITH_FULL_IMAGE:figures/full_fig_p028_23.png] view at source ↗

**Figure 25.** Figure 25: Distribution of task-aligned reasoning failure types across TEL, ICR, and GTS. Local interval misinterpretation is excluded to highlight higher-level temporal reasoning failures. G.1 Distribution by Task Family [PITH_FULL_IMAGE:figures/full_fig_p029_25.png] view at source ↗

**Figure 24.** Figure 24: Failure within local-interval misinterpretation — GTS example. G Error Type Distribution Across Tasks This section presents a quantitative analysis of how task-aligned reasoning failures are distributed across task families and model families. The analysis complements the qualitative examples in Appendix F by demonstrating that the observed error types arise systematically as a function of task struct… view at source ↗

**Figure 26.** Figure 26: Distribution of task-aligned reasoning failures across closed-source, open-source, and medicalspecialized VLMs [PITH_FULL_IMAGE:figures/full_fig_p030_26.png] view at source ↗

**Figure 28.** Figure 28: H.5 Quantitative Results Model ROUGE-L METEOR CIDEr GPT-5.2 0.307 0.354 0.358 InternVL3.5-38B 0.341 0.369 0.411 MedGemma-27B 0.326 0.340 0.393 [PITH_FULL_IMAGE:figures/full_fig_p031_28.png] view at source ↗

read the original abstract

Longitudinal chest X-ray (CXR) interpretation requires reasoning over disease evolution across multiple patient visits, yet most existing medical VQA benchmarks focus on single images or short-horizon image pairs. We introduce MI-CXR, a benchmark for standardized evaluation of Multi-Interval longitudinal reasoning over multi-visit CXR sequences, without requiring free-form report generation or additional clinical context. MI-CXR comprises five-way multiple-choice questions over five-visit patient timelines and instantiates three complementary task families: Temporal Event Localization, Interval-wise Change Reasoning, and Global Trajectory Summarization, which assess clinically grounded visual reasoning over time. Evaluating 14 state-of-the-art vision-language models (VLMs) shows low overall performance, with an average accuracy of 29.3%, only modestly above random guessing. Using stage-wise diagnostic probing, we find that models often produce locally plausible interval descriptions but fail to enforce temporal constraints or compose evidence into globally consistent decisions over the full timeline. These findings reveal key limitations of current VLMs and establish MI-CXR as a principled benchmark for longitudinal medical reasoning. The benchmark is available at https://github.com/AIDASLab/MI-CXR

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a multi-visit CXR benchmark that shows current VLMs score low on temporal tasks, but the MCQ setup may allow single-image shortcuts that undercut the global reasoning claims.

read the letter

The main takeaway is that this work builds a benchmark for reasoning over five-visit chest X-ray sequences and finds that 14 state-of-the-art VLMs average only 29.3% accuracy on it. The three task families cover temporal event localization, interval change reasoning, and global trajectory summarization, which moves past the usual single-image or short-pair medical VQA setups. That extension to longer timelines is the clearest new piece, and it lines up with real needs in tracking chronic conditions where disease evolves across visits. The stage-wise probing also gives a concrete picture of models producing plausible local descriptions yet failing to hold temporal order or build consistent decisions across the full series. That failure mode analysis is useful for anyone trying to improve temporal handling in medical VLMs. The paper keeps the setup clean by avoiding free-form reports or extra clinical notes, which makes the evaluation focused and reproducible in principle. On the downside, the multiple-choice format over full timelines raises the exact issue the stress-test flags: without ablations that remove all but one interval or radiologist checks confirming no single-image shortcuts, the low scores may not isolate a deficit in longitudinal composition. The abstract gives performance numbers and qualitative notes but leaves question construction, distractor design, and inter-annotator agreement unaddressed in the summary, so those details matter for trusting the results. This is aimed at groups working on medical vision-language models and longitudinal reasoning benchmarks. Readers testing new temporal architectures or wanting failure cases to guide improvements could get practical value from the dataset and the reported gaps. It deserves peer review to let referees verify the question validation steps and see whether the temporal claims survive closer inspection of the data construction.

Referee Report

1 major / 2 minor

Summary. The paper introduces MI-CXR, a benchmark for standardized evaluation of multi-interval longitudinal reasoning over five-visit chest X-ray timelines. It defines three task families (Temporal Event Localization, Interval-wise Change Reasoning, and Global Trajectory Summarization) instantiated as five-way multiple-choice questions without free-form generation or extra clinical context. Evaluation of 14 VLMs reports 29.3% average accuracy (modestly above random), with stage-wise probing showing models produce locally plausible descriptions but fail to enforce temporal constraints or achieve global consistency.

Significance. If the questions are validated to require full-timeline integration, the work is significant for filling a gap in medical VQA benchmarks focused on single images or pairs. It provides a reproducible public resource (GitHub link) that exposes concrete limitations in current VLMs for clinically grounded temporal reasoning over disease evolution, supporting targeted progress in longitudinal medical AI.

major comments (1)

[Benchmark Construction / Question Design] The manuscript provides no quantitative validation (e.g., radiologist ratings of question difficulty, ablation removing all but one interval, or shortcut-detection experiments) that the five-way MCQs over five-visit timelines cannot be solved from static visual features in a single image or adjacent pair. This is load-bearing for the central claim in the abstract and §4 that the 29.3% accuracy and probing results demonstrate specific failure at 'enforcing temporal constraints or compose evidence into globally consistent decisions'; without it, the performance gap may reflect local shortcuts rather than a longitudinal reasoning deficit.

minor comments (2)

[Abstract] The abstract and results would benefit from an explicit statement of the random baseline (20% for five-way MCQ) alongside the 29.3% figure to better contextualize 'modestly above random guessing'.
[Data Curation] Provide more detail on inter-annotator agreement and quality control during question creation to strengthen the claim of clinical groundedness.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We address the major comment below and commit to revisions that incorporate additional validation experiments to strengthen the claims regarding longitudinal reasoning requirements.

read point-by-point responses

Referee: [Benchmark Construction / Question Design] The manuscript provides no quantitative validation (e.g., radiologist ratings of question difficulty, ablation removing all but one interval, or shortcut-detection experiments) that the five-way MCQs over five-visit timelines cannot be solved from static visual features in a single image or adjacent pair. This is load-bearing for the central claim in the abstract and §4 that the 29.3% accuracy and probing results demonstrate specific failure at 'enforcing temporal constraints or compose evidence into globally consistent decisions'; without it, the performance gap may reflect local shortcuts rather than a longitudinal reasoning deficit.

Authors: We acknowledge that the current manuscript does not include explicit quantitative validation such as radiologist ratings, single-interval ablations, or dedicated shortcut-detection experiments to empirically confirm that the questions cannot be solved without full-timeline integration. The task families were designed to require multi-interval reasoning (e.g., Temporal Event Localization specifies identifying the exact visit of an event among five, and Global Trajectory Summarization requires synthesizing the overall disease progression), and the stage-wise probing already indicates models generate locally plausible outputs yet fail at global consistency. Nevertheless, to directly address this concern and support the central claim, we will add shortcut-detection experiments (e.g., performance on single-image or adjacent-pair inputs) and radiologist validation of question difficulty in the revised manuscript. These results will be reported in a new subsection of §3 or §4. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark construction or evaluation

full rationale

The paper introduces a new dataset and benchmark (MI-CXR) consisting of five-way MCQs over five-visit timelines across three task families, then reports direct empirical results from evaluating 14 VLMs (average 29.3% accuracy). No equations, fitted parameters, or derivations are present that reduce predictions to inputs by construction. The central claims rest on newly created questions and data rather than self-citation chains, ansatz smuggling, or renaming of known results. The evaluation is self-contained against external model performance and does not presuppose its own conclusions via definitional loops.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the design choice that multiple-choice questions over fixed five-visit timelines can serve as a proxy for clinical longitudinal reasoning without needing free-form text or extra patient metadata.

axioms (1)

domain assumption Multiple-choice questions over five-visit patient timelines can assess clinically grounded visual reasoning over time without free-form report generation or additional clinical context.
Explicitly stated in the abstract as the benchmark construction principle.

pith-pipeline@v0.9.0 · 5741 in / 1194 out tokens · 44406 ms · 2026-05-20T19:49:22.331430+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MI-CXR comprises five-way multiple-choice questions over five-visit patient timelines and instantiates three complementary task families: Temporal Event Localization, Interval-wise Change Reasoning, and Global Trajectory Summarization
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Evaluating 14 state-of-the-art vision-language models shows low overall performance, with an average accuracy of 29.3%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 6 internal anchors

[1]

Scientific data , volume=

A dataset of clinically generated visual questions and answers about radiology images , author=. Scientific data , volume=. 2018 , publisher=

work page 2018
[2]

circulation , volume=

PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals , author=. circulation , volume=. 2000 , publisher=

work page 2000
[3]

PathVQA: 30000+ Questions for Medical Visual Question Answering

PathVQA: 30000+ Questions for Medical Visual Question Answering , author=. arXiv preprint arXiv:2003.10286 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2003
[4]

2025 , eprint=

MMXU: A Multi-Modal and Multi-X-ray Understanding Dataset for Disease Progression , author=. 2025 , eprint=

work page 2025
[5]

2025 , eprint=

TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models , author=. 2025 , eprint=

work page 2025
[6]

2025 , eprint=

Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation , author=. 2025 , eprint=

work page 2025
[7]

2025 , eprint=

CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays , author=. 2025 , eprint=

work page 2025
[8]

Journal of Biomedical Informatics , volume=

Review of Temporal Reasoning in the Clinical Domain for Timeline Extraction: Where we are and where we need to be , author=. Journal of Biomedical Informatics , volume=. 2021 , issn=

work page 2021
[9]

Nature Communications , volume=

Predicting treatment response from longitudinal images using multi-task deep learning , author=. Nature Communications , volume=. 2021 , doi=

work page 2021
[10]

Radiology , volume=

The Need for Medical Artificial Intelligence That Incorporates Prior Images , author=. Radiology , volume=. 2022 , doi=

work page 2022
[11]

Clinical Oncology , volume=

Longitudinal Image Data for Outcome Modeling , author=. Clinical Oncology , volume=. 2025 , doi=

work page 2025
[12]

2025 , eprint=

PriorRG: Prior-Guided Contrastive Pre-training and Coarse-to-Fine Decoding for Chest X-ray Report Generation , author=. 2025 , eprint=

work page 2025
[13]

2024 , eprint=

HERGen: Elevating Radiology Report Generation with Longitudinal Data , author=. 2024 , eprint=

work page 2024
[14]

In: Proc

Liu, Kang and Ma, Zhuoqi and Kang, Xiaolu and Li, Yunan and Xie, Kun and Jiao, Zhicheng and Miao, Qiguang , year=. Enhanced Contrastive Learning with Multi-view Longitudinal Data for Chest X-ray Report Generation , url=. doi:10.1109/cvpr52734.2025.00968 , booktitle=

work page doi:10.1109/cvpr52734.2025.00968 2025
[15]

2025 , eprint=

Insights into a radiology-specialised multimodal large language model with sparse autoencoders , author=. 2025 , eprint=

work page 2025
[16]

2025 , howpublished=

MIMIC-Ext-CXR-QBA: A Structured, Tagged, and Localized Visual Question Answering Dataset with Question-Box-Answer Triplets and Scene Graphs for Chest X-ray Images , author=. 2025 , howpublished=

work page 2025
[17]

2024 , howpublished=

MIMIC-CXR-JPG: Chest Radiographs with Structured Labels , author=. 2024 , howpublished=

work page 2024
[18]

MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs

MIMIC-CXR: A Large Publicly Available Database of Labeled Chest Radiographs , author=. arXiv preprint arXiv:1901.07042 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1901
[19]

2025 , eprint=

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning , author=. 2025 , eprint=

work page 2025
[20]

Zhang, Kai and Zhou, Rong and Adhikarla, Eashan and Yan, Zhiling and Liu, Yixin and Yu, Jun and Liu, Zhengliang and Chen, Xun and Davison, Brian D. and Ren, Hui and Huang, Jing and Chen, Chen and Zhou, Yuyin and Fu, Sunyang and Liu, Wei and Liu, Tianming and Li, Xiang and Chen, Yong and He, Lifang and Zou, James and Li, Quanzheng and Liu, Hongfang and Sun...

work page doi:10.1038/s41591-024-03185-2
[21]

2025 , eprint=

MedGemma Technical Report , author=. 2025 , eprint=

work page 2025
[22]

2025 , eprint=

MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning , author=. 2025 , eprint=

work page 2025
[23]

2023 , eprint=

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day , author=. 2023 , eprint=

work page 2023
[24]

and Lausen, Mads and Bruun, Niels Henrik and Nielsen, Michael Bachmann and Zacho, Helle D

Lange, Mine Benedicte and Petersen, Lars J. and Lausen, Mads and Bruun, Niels Henrik and Nielsen, Michael Bachmann and Zacho, Helle D. , TITLE =. Diagnostics , VOLUME =. 2022 , NUMBER =

work page 2022
[25]

White and K

K. White and K. Berbaum and W. L. Smith , title =. Investigative Radiology , year =. doi:10.1097/00004424-199403000-00002 , pmid =

work page doi:10.1097/00004424-199403000-00002
[26]

Insights into Imaging , year =

Li Zhang and Xin Wen and Jian-Wei Li and Xu Jiang and Xian-Feng Yang and Meng Li , title =. Insights into Imaging , year =. doi:10.1186/s13244-023-01521-7 , url =

work page doi:10.1186/s13244-023-01521-7
[27]

2021 , eprint=

SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical Visual Question Answering , author=. 2021 , eprint=

work page 2021
[28]

2024 , eprint=

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering , author=. 2024 , eprint=

work page 2024
[29]

2024 , eprint=

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale , author=. 2024 , eprint=

work page 2024
[30]

2024 , version =

Bae, Seongmin and Kyung, Donghyun and Ryu, Jaeho and Cho, Eunji and Lee, Gyeongmin and Kweon, Seungwoo and Oh, Jaehun and Ji, Limin and Chang, Eunsol and Kim, Taeyoung and Choi, Edward , title =. 2024 , version =. doi:10.13026/deqx-d943 , url =

work page doi:10.13026/deqx-d943 2024
[31]

2023 , eprint=

EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images , author=. 2023 , eprint=

work page 2023
[32]

2024 , eprint=

CheXpert Plus: Augmenting a Large Chest X-ray Dataset with Text Radiology Reports, Patient Demographics and Additional Image Formats , author=. 2024 , eprint=

work page 2024
[33]

2025 , eprint=

Libra: Leveraging Temporal Images for Biomedical Radiology Analysis , author=. 2025 , eprint=

work page 2025
[34]

2024 , eprint=

Pretraining Vision-Language Model for Difference Visual Question Answering in Longitudinal Chest X-rays , author=. 2024 , eprint=

work page 2024
[35]

and Zhu, Yingying , title =

Hu, Xinyue and Gu, Lin and An, Qiyuan and Zhang, Mengliang and Liu, Liangchen and Kobayashi, Kazuma and Harada, Tatsuya and Summers, Ronald M. and Zhu, Yingying , title =. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages =. 2023 , isbn =. doi:10.1145/3580305.3599819 , abstract =

work page doi:10.1145/3580305.3599819 2023
[36]

2021 , eprint=

Describing and Localizing Multiple Changes with Transformers , author=. 2021 , eprint=

work page 2021
[37]

2024 , eprint=

ReXrank: A Public Leaderboard for AI-Powered Radiology Report Generation , author=. 2024 , eprint=

work page 2024
[38]

2025 , eprint=

Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions , author=. 2025 , eprint=

work page 2025
[39]

2024 , eprint=

Large Language Models Are Not Robust Multiple Choice Selectors , author=. 2024 , eprint=

work page 2024
[40]

2024 , url =

Anthropic , title =. 2024 , url =

work page 2024
[41]

2025 , url =

OpenAI , title =. 2025 , url =

work page 2025
[42]

2024 , url =

Google DeepMind , title =. 2024 , url =

work page 2024
[43]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[44]

Qwen2.5-VL Technical Report

Qwen2.5-VL Technical Report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. arXiv preprint arXiv:2409.12191 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. arXiv preprint arXiv:2308.12966 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency , author=. arXiv preprint arXiv:2508.18265 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

2024 , eprint=

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding , author=. 2024 , eprint=

work page 2024
[49]

2023 , eprint=

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents , author=. 2023 , eprint=

work page 2023
[50]

2024 , eprint=

What matters when building vision-language models? , author=. 2024 , eprint=

work page 2024
[51]

2025 , eprint=

Evaluating Step-by-step Reasoning Traces: A Survey , author=. 2025 , eprint=

work page 2025
[52]

2023 , eprint=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=

work page 2023
[53]

2025 , eprint=

MRGAgents: A Multi-Agent Framework for Improved Medical Report Generation with Med-LVLMs , author=. 2025 , eprint=

work page 2025
[54]

2025 , eprint=

CoMT: Chain-of-Medical-Thought Reduces Hallucination in Medical Report Generation , author=. 2025 , eprint=

work page 2025
[55]

2020 , eprint=

CheXbert: Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT , author=. 2020 , eprint=

work page 2020
[56]

and Zhang, Dong and Wang, Z

Guo*, Li and Tahir, Anas M. and Zhang, Dong and Wang, Z. Jane and Ward, Rabab K. , year=. Automatic Medical Report Generation: Methods and Applications , volume=. APSIPA Transactions on Signal and Information Processing , publisher=. doi:10.1561/116.20240044 , number=

work page doi:10.1561/116.20240044
[57]

2025 , eprint=

Towards Predicting Temporal Changes in a Patient's Chest X-ray Images based on Electronic Health Records , author=. 2025 , eprint=

work page 2025
[58]

Automating construction contract question answering using large language model and fine-tuning , journal =

Mingyu Zhang and Chenglong Xu and Yihong Gan and Yu Wang and Yi Fu and Yongqiang Chen , keywords =. Automating construction contract question answering using large language model and fine-tuning , journal =. 2026 , issn =. doi:https://doi.org/10.1016/j.eswa.2025.129493 , url =

work page doi:10.1016/j.eswa.2025.129493 2026
[59]

2025 , eprint=

BleedOrigin: Dynamic Bleeding Source Localization in Endoscopic Submucosal Dissection via Dual-Stage Detection and Tracking , author=. 2025 , eprint=

work page 2025
[60]

, title =

Mann, Ritse M. , title =. The Lancet , year =. doi:10.1016/S0140-6736(25)00093-5 , url =

work page doi:10.1016/s0140-6736(25)00093-5
[61]

, title =

Hoang, Jenny K. , title =. Journal of the American College of Radiology , year =. doi:10.1016/j.jacr.2015.10.017 , url =

work page doi:10.1016/j.jacr.2015.10.017 2015
[62]

and Lin, M

Holste, G. and Lin, M. and Zhou, R. and others , title =. npj Digital Medicine , year =. doi:10.1038/s41746-024-01207-4 , url =

work page doi:10.1038/s41746-024-01207-4
[63]

Longitudinal Image Data for Outcome Modeling , journal =

J.E. Longitudinal Image Data for Outcome Modeling , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.clon.2024.06.053 , url =

work page doi:10.1016/j.clon.2024.06.053 2025
[64]

Large Language Models are Zero-Shot Reasoners , volume =

Kojima, Takeshi and Gu, Shixiang (Shane) and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , booktitle =. Large Language Models are Zero-Shot Reasoners , volume =

work page
[65]

and Parikh, Devi , title =

Vedantam, Ramakrishna and Lawrence Zitnick, C. and Parikh, Devi , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

work page
[66]

ROUGE : A Package for Automatic Evaluation of Summaries

Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

work page 2004
[67]

METEOR : An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

Banerjee, Satanjeev and Lavie, Alon. METEOR : An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 2005

work page 2005

[1] [1]

Scientific data , volume=

A dataset of clinically generated visual questions and answers about radiology images , author=. Scientific data , volume=. 2018 , publisher=

work page 2018

[2] [2]

circulation , volume=

PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals , author=. circulation , volume=. 2000 , publisher=

work page 2000

[3] [3]

PathVQA: 30000+ Questions for Medical Visual Question Answering

PathVQA: 30000+ Questions for Medical Visual Question Answering , author=. arXiv preprint arXiv:2003.10286 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2003

[4] [4]

2025 , eprint=

MMXU: A Multi-Modal and Multi-X-ray Understanding Dataset for Disease Progression , author=. 2025 , eprint=

work page 2025

[5] [5]

2025 , eprint=

TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models , author=. 2025 , eprint=

work page 2025

[6] [6]

2025 , eprint=

Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation , author=. 2025 , eprint=

work page 2025

[7] [7]

2025 , eprint=

CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays , author=. 2025 , eprint=

work page 2025

[8] [8]

Journal of Biomedical Informatics , volume=

Review of Temporal Reasoning in the Clinical Domain for Timeline Extraction: Where we are and where we need to be , author=. Journal of Biomedical Informatics , volume=. 2021 , issn=

work page 2021

[9] [9]

Nature Communications , volume=

Predicting treatment response from longitudinal images using multi-task deep learning , author=. Nature Communications , volume=. 2021 , doi=

work page 2021

[10] [10]

Radiology , volume=

The Need for Medical Artificial Intelligence That Incorporates Prior Images , author=. Radiology , volume=. 2022 , doi=

work page 2022

[11] [11]

Clinical Oncology , volume=

Longitudinal Image Data for Outcome Modeling , author=. Clinical Oncology , volume=. 2025 , doi=

work page 2025

[12] [12]

2025 , eprint=

PriorRG: Prior-Guided Contrastive Pre-training and Coarse-to-Fine Decoding for Chest X-ray Report Generation , author=. 2025 , eprint=

work page 2025

[13] [13]

2024 , eprint=

HERGen: Elevating Radiology Report Generation with Longitudinal Data , author=. 2024 , eprint=

work page 2024

[14] [14]

In: Proc

Liu, Kang and Ma, Zhuoqi and Kang, Xiaolu and Li, Yunan and Xie, Kun and Jiao, Zhicheng and Miao, Qiguang , year=. Enhanced Contrastive Learning with Multi-view Longitudinal Data for Chest X-ray Report Generation , url=. doi:10.1109/cvpr52734.2025.00968 , booktitle=

work page doi:10.1109/cvpr52734.2025.00968 2025

[15] [15]

2025 , eprint=

Insights into a radiology-specialised multimodal large language model with sparse autoencoders , author=. 2025 , eprint=

work page 2025

[16] [16]

2025 , howpublished=

MIMIC-Ext-CXR-QBA: A Structured, Tagged, and Localized Visual Question Answering Dataset with Question-Box-Answer Triplets and Scene Graphs for Chest X-ray Images , author=. 2025 , howpublished=

work page 2025

[17] [17]

2024 , howpublished=

MIMIC-CXR-JPG: Chest Radiographs with Structured Labels , author=. 2024 , howpublished=

work page 2024

[18] [18]

MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs

MIMIC-CXR: A Large Publicly Available Database of Labeled Chest Radiographs , author=. arXiv preprint arXiv:1901.07042 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1901

[19] [19]

2025 , eprint=

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning , author=. 2025 , eprint=

work page 2025

[20] [20]

Zhang, Kai and Zhou, Rong and Adhikarla, Eashan and Yan, Zhiling and Liu, Yixin and Yu, Jun and Liu, Zhengliang and Chen, Xun and Davison, Brian D. and Ren, Hui and Huang, Jing and Chen, Chen and Zhou, Yuyin and Fu, Sunyang and Liu, Wei and Liu, Tianming and Li, Xiang and Chen, Yong and He, Lifang and Zou, James and Li, Quanzheng and Liu, Hongfang and Sun...

work page doi:10.1038/s41591-024-03185-2

[21] [21]

2025 , eprint=

MedGemma Technical Report , author=. 2025 , eprint=

work page 2025

[22] [22]

2025 , eprint=

MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning , author=. 2025 , eprint=

work page 2025

[23] [23]

2023 , eprint=

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day , author=. 2023 , eprint=

work page 2023

[24] [24]

and Lausen, Mads and Bruun, Niels Henrik and Nielsen, Michael Bachmann and Zacho, Helle D

Lange, Mine Benedicte and Petersen, Lars J. and Lausen, Mads and Bruun, Niels Henrik and Nielsen, Michael Bachmann and Zacho, Helle D. , TITLE =. Diagnostics , VOLUME =. 2022 , NUMBER =

work page 2022

[25] [25]

White and K

K. White and K. Berbaum and W. L. Smith , title =. Investigative Radiology , year =. doi:10.1097/00004424-199403000-00002 , pmid =

work page doi:10.1097/00004424-199403000-00002

[26] [26]

Insights into Imaging , year =

Li Zhang and Xin Wen and Jian-Wei Li and Xu Jiang and Xian-Feng Yang and Meng Li , title =. Insights into Imaging , year =. doi:10.1186/s13244-023-01521-7 , url =

work page doi:10.1186/s13244-023-01521-7

[27] [27]

2021 , eprint=

SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical Visual Question Answering , author=. 2021 , eprint=

work page 2021

[28] [28]

2024 , eprint=

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering , author=. 2024 , eprint=

work page 2024

[29] [29]

2024 , eprint=

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale , author=. 2024 , eprint=

work page 2024

[30] [30]

2024 , version =

Bae, Seongmin and Kyung, Donghyun and Ryu, Jaeho and Cho, Eunji and Lee, Gyeongmin and Kweon, Seungwoo and Oh, Jaehun and Ji, Limin and Chang, Eunsol and Kim, Taeyoung and Choi, Edward , title =. 2024 , version =. doi:10.13026/deqx-d943 , url =

work page doi:10.13026/deqx-d943 2024

[31] [31]

2023 , eprint=

EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images , author=. 2023 , eprint=

work page 2023

[32] [32]

2024 , eprint=

CheXpert Plus: Augmenting a Large Chest X-ray Dataset with Text Radiology Reports, Patient Demographics and Additional Image Formats , author=. 2024 , eprint=

work page 2024

[33] [33]

2025 , eprint=

Libra: Leveraging Temporal Images for Biomedical Radiology Analysis , author=. 2025 , eprint=

work page 2025

[34] [34]

2024 , eprint=

Pretraining Vision-Language Model for Difference Visual Question Answering in Longitudinal Chest X-rays , author=. 2024 , eprint=

work page 2024

[35] [35]

and Zhu, Yingying , title =

Hu, Xinyue and Gu, Lin and An, Qiyuan and Zhang, Mengliang and Liu, Liangchen and Kobayashi, Kazuma and Harada, Tatsuya and Summers, Ronald M. and Zhu, Yingying , title =. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages =. 2023 , isbn =. doi:10.1145/3580305.3599819 , abstract =

work page doi:10.1145/3580305.3599819 2023

[36] [36]

2021 , eprint=

Describing and Localizing Multiple Changes with Transformers , author=. 2021 , eprint=

work page 2021

[37] [37]

2024 , eprint=

ReXrank: A Public Leaderboard for AI-Powered Radiology Report Generation , author=. 2024 , eprint=

work page 2024

[38] [38]

2025 , eprint=

Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions , author=. 2025 , eprint=

work page 2025

[39] [39]

2024 , eprint=

Large Language Models Are Not Robust Multiple Choice Selectors , author=. 2024 , eprint=

work page 2024

[40] [40]

2024 , url =

Anthropic , title =. 2024 , url =

work page 2024

[41] [41]

2025 , url =

OpenAI , title =. 2025 , url =

work page 2025

[42] [42]

2024 , url =

Google DeepMind , title =. 2024 , url =

work page 2024

[43] [43]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025

[44] [44]

Qwen2.5-VL Technical Report

Qwen2.5-VL Technical Report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[45] [45]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. arXiv preprint arXiv:2409.12191 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. arXiv preprint arXiv:2308.12966 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency , author=. arXiv preprint arXiv:2508.18265 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[48] [48]

2024 , eprint=

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding , author=. 2024 , eprint=

work page 2024

[49] [49]

2023 , eprint=

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents , author=. 2023 , eprint=

work page 2023

[50] [50]

2024 , eprint=

What matters when building vision-language models? , author=. 2024 , eprint=

work page 2024

[51] [51]

2025 , eprint=

Evaluating Step-by-step Reasoning Traces: A Survey , author=. 2025 , eprint=

work page 2025

[52] [52]

2023 , eprint=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=

work page 2023

[53] [53]

2025 , eprint=

MRGAgents: A Multi-Agent Framework for Improved Medical Report Generation with Med-LVLMs , author=. 2025 , eprint=

work page 2025

[54] [54]

2025 , eprint=

CoMT: Chain-of-Medical-Thought Reduces Hallucination in Medical Report Generation , author=. 2025 , eprint=

work page 2025

[55] [55]

2020 , eprint=

CheXbert: Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT , author=. 2020 , eprint=

work page 2020

[56] [56]

and Zhang, Dong and Wang, Z

Guo*, Li and Tahir, Anas M. and Zhang, Dong and Wang, Z. Jane and Ward, Rabab K. , year=. Automatic Medical Report Generation: Methods and Applications , volume=. APSIPA Transactions on Signal and Information Processing , publisher=. doi:10.1561/116.20240044 , number=

work page doi:10.1561/116.20240044

[57] [57]

2025 , eprint=

Towards Predicting Temporal Changes in a Patient's Chest X-ray Images based on Electronic Health Records , author=. 2025 , eprint=

work page 2025

[58] [58]

Automating construction contract question answering using large language model and fine-tuning , journal =

Mingyu Zhang and Chenglong Xu and Yihong Gan and Yu Wang and Yi Fu and Yongqiang Chen , keywords =. Automating construction contract question answering using large language model and fine-tuning , journal =. 2026 , issn =. doi:https://doi.org/10.1016/j.eswa.2025.129493 , url =

work page doi:10.1016/j.eswa.2025.129493 2026

[59] [59]

2025 , eprint=

BleedOrigin: Dynamic Bleeding Source Localization in Endoscopic Submucosal Dissection via Dual-Stage Detection and Tracking , author=. 2025 , eprint=

work page 2025

[60] [60]

, title =

Mann, Ritse M. , title =. The Lancet , year =. doi:10.1016/S0140-6736(25)00093-5 , url =

work page doi:10.1016/s0140-6736(25)00093-5

[61] [61]

, title =

Hoang, Jenny K. , title =. Journal of the American College of Radiology , year =. doi:10.1016/j.jacr.2015.10.017 , url =

work page doi:10.1016/j.jacr.2015.10.017 2015

[62] [62]

and Lin, M

Holste, G. and Lin, M. and Zhou, R. and others , title =. npj Digital Medicine , year =. doi:10.1038/s41746-024-01207-4 , url =

work page doi:10.1038/s41746-024-01207-4

[63] [63]

Longitudinal Image Data for Outcome Modeling , journal =

J.E. Longitudinal Image Data for Outcome Modeling , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.clon.2024.06.053 , url =

work page doi:10.1016/j.clon.2024.06.053 2025

[64] [64]

Large Language Models are Zero-Shot Reasoners , volume =

Kojima, Takeshi and Gu, Shixiang (Shane) and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , booktitle =. Large Language Models are Zero-Shot Reasoners , volume =

work page

[65] [65]

and Parikh, Devi , title =

Vedantam, Ramakrishna and Lawrence Zitnick, C. and Parikh, Devi , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

work page

[66] [66]

ROUGE : A Package for Automatic Evaluation of Summaries

Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

work page 2004

[67] [67]

METEOR : An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

Banerjee, Satanjeev and Lavie, Alon. METEOR : An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 2005

work page 2005