arxiv: 2604.10410 · v2 · submitted 2026-04-12 · 💻 cs.AI

Recognition: unknown

CWCD: Category-Wise Contrastive Decoding for Structured Medical Report Generation

Shantam Srivastava , Mahesh Bhosale , David Doermann , Mingchen Gao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:33 UTC · model grok-4.3

classification 💻 cs.AI

keywords radiology report generationcontrastive decodingmulti-modal large language modelschest X-raystructured medical reportspathology co-occurrencemedical image interpretation

0 comments

The pith

Category-wise contrastive decoding generates more accurate structured radiology reports from chest X-rays by reducing reliance on language priors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that single-pass decoding in current multi-modal models for radiology reports causes the model to pay less attention to the X-ray image as text generation continues, leading to reports that incorrectly combine unrelated pathologies. CWCD counters this by splitting generation into pathology categories and using a contrast between the full image and a masked version of it, guided by category-specific prompts. A sympathetic reader would care because this keeps the output tied more closely to visible evidence rather than statistical patterns in the training text. If the approach holds, it offers a modular way to improve existing foundation models without full retraining.

Core claim

Current foundation models generate radiology reports in a single forward pass, which diminishes attention to visual tokens and increases reliance on language priors, introducing spurious pathology co-occurrences. CWCD introduces category-specific parameterization and generates category-wise reports by contrasting normal X-rays with masked X-rays using category-specific visual prompts, yielding consistent gains on both clinical efficacy and natural language generation metrics.

What carries the argument

Category-Wise Contrastive Decoding (CWCD), which applies category-specific visual prompts to contrast a normal chest X-ray against its masked counterpart for each pathology category during report generation.

If this is right

Generated reports contain fewer incorrect combinations of pathologies.
Clinical efficacy metrics improve because factual alignment with image findings rises.
Natural language generation metrics improve due to better coherence within each category.
The method integrates modularly with existing models such as LLaVA-Rad without requiring new end-to-end training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same contrastive masking idea could extend to other medical imaging modalities where language priors dominate over visual detail.
Category-wise structure may improve clinician trust by making it easier to verify each section against the image.
If masking is learned rather than fixed, the approach might adapt to new categories without manual prompt engineering.

Load-bearing premise

Single-pass decoding in multi-modal models necessarily reduces attention to the image and that contrasting against masked X-rays will correct this without creating new errors or biases.

What would settle it

On a test set of X-rays with known independent pathologies, CWCD-generated reports would show no reduction in the rate of spurious co-occurrence pairs compared with single-pass baselines.

Figures

Figures reproduced from arXiv: 2604.10410 by David Doermann, Mahesh Bhosale, Mingchen Gao, Shantam Srivastava.

**Figure 1.** Figure 1: Category-Wise Contrastive Decoding (CWCD) generates a category-wise structured report under eight anatomical headers by contrasting a normal X-ray with a masked X-ray (3 categories shown here for brevity). Automated Radiology Report Generation (RRG), the task of producing free-text descriptions of visual observations from a radiology image, such as a chest X-ray, has therefore emerged as an essential res… view at source ↗

**Figure 2.** Figure 2: LAMA score calculated from 100 randomly sampled images from MIMIC-CXR dataset using LLaVA-Rad over text tokens (left) and image tokens (right). During the report generation process, we observe a pronounced decline in attention to image tokens accompanied by a steady increase in reliance on linguistic priors. Motivation. We observe that, as report generation progresses, the model’s attention increasingly re… view at source ↗

**Figure 3.** Figure 3: An overview of CWCD framework for the “Cardiovascular” Anatomical category. The base log probability distribution is contrasted with the masked log probability distribution using Eq.6. We then sample the highest probability token from the final distribution. This process repeats for each token in an auto-regressive form to obtain a Category report. Reports across all categories are aggregated to obtain a f… view at source ↗

**Figure 4.** Figure 4: We replicated the experiment presented in Sec. 1 on CheXagent-2 to demonstrate that the problem of attention decay over image tokens during token generation also affects other MLLMs. Appendix B. Related Work Structured Findings Generation. Findings section of a radiology report is comprised of visual observations from a given chest X-ray. Usually, these are free-text reports but there is a growing body of … view at source ↗

read the original abstract

Interpreting chest X-rays is inherently challenging due to the overlap between anatomical structures and the subtle presentation of many clinically significant pathologies, making accurate diagnosis time-consuming even for experienced radiologists. Recent radiology-focused foundation models, such as LLaVA-Rad and Maira-2, have positioned multi-modal large language models (MLLMs) at the forefront of automated radiology report generation (RRG). However, despite these advances, current foundation models generate reports in a single forward pass. This decoding strategy diminishes attention to visual tokens and increases reliance on language priors as generation proceeds, which in turn introduces spurious pathology co-occurrences in the generated reports. To mitigate these limitations, we propose Category-Wise Contrastive Decoding (CWCD), a novel and modular framework designed to enhance structured radiology report generation (SRRG). Our approach introduces category-specific parameterization and generates category-wise reports by contrasting normal X-rays with masked X-rays using category-specific visual prompts. Experimental results demonstrate that CWCD consistently outperforms baseline methods across both clinical efficacy and natural language generation metrics. An ablation study further elucidates the contribution of each architectural component to overall performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CWCD adds category-wise masked contrast to radiology MLLM decoding, but the mechanism may not cleanly separate visual signals amid chest X-ray overlaps.

read the letter

The main takeaway is that this paper introduces CWCD, a modular add-on that generates reports one category at a time by contrasting a normal X-ray against a masked version using category-specific visual prompts. It targets the drop in visual attention and rise in language priors that happens in single-pass decoding from models like LLaVA-Rad and Maira-2, aiming to reduce spurious pathology co-occurrences in structured reports. The framework is presented as new in its combination of category parameterization and this contrastive step, with an ablation to break down component contributions. That is the concrete addition on offer. The approach does a reasonable job of naming a plausible failure mode in current medical vision-language models and offering a testable fix that keeps the base model intact. If the experiments hold up, the modularity could make it easy for others to try on similar tasks. The ablation is a positive step for showing what drives any gains. The soft spot is the masking step itself. Chest X-rays contain heavy anatomical overlap, so region-based or heuristic masking for one pathology category can easily leave residual cues from others or remove relevant context. The contrast might then pick up on artifacts or extra parameterization rather than restoring true visual attention. The abstract supplies no numbers, baselines, or statistical details, so effect sizes and consistency are impossible to judge from the summary alone. If the full paper shows large, controlled improvements with clear isolation of the mechanism, that would strengthen the case; otherwise the gains risk being explained by prompting overhead. This work is for researchers focused on medical report generation and multimodal decoding tweaks. A reader already working on radiology MLLMs or contrastive methods would find the framework worth examining for ideas, even if they end up modifying the masking. It deserves a serious referee because the problem is real, the method is implementable, and the claims are falsifiable with the right controls. I would send it out for review with instructions to the referees to examine the masking validity and run additional overlap checks.

Referee Report

2 major / 1 minor

Summary. The paper proposes Category-Wise Contrastive Decoding (CWCD), a modular framework for structured radiology report generation (SRRG) from chest X-rays using MLLMs. It identifies single-pass decoding as causing diminished visual attention and spurious pathology co-occurrences, then introduces category-specific parameterization to generate reports by contrasting normal X-rays against masked versions via category-specific visual prompts. The central claim is consistent outperformance over baselines on clinical efficacy and NLG metrics, supported by an ablation study.

Significance. If the empirical results and mechanism hold, CWCD provides a practical, modular addition to existing radiology MLLMs that could reduce reliance on language priors and improve report accuracy in a domain where anatomical overlap and subtle pathologies make automation error-prone. The emphasis on category-wise handling and contrastive decoding aligns with needs for more reliable structured outputs in clinical AI.

major comments (2)

[Method] Method section (description of CWCD): the claim that contrasting normal X-rays with masked X-rays using category-specific prompts reliably corrects diminished visual attention rests on the unverified assumption that masking isolates relevant visual evidence without residual cues or cross-category leakage; given heavy anatomical overlap in chest X-rays, the paper must supply masking details, visualizations, or quantitative checks showing the contrast isolates pathologies rather than introducing artifacts.
[Experiments] Experiments section: the abstract asserts 'consistent outperformance' on clinical and NLG metrics, yet no numerical results, baseline specifications, statistical tests, or effect sizes are referenced in the provided text; without these, the central claim cannot be evaluated and the ablation's contribution remains unquantified.

minor comments (1)

[Abstract] Abstract: the phrasing 'Experimental results demonstrate...' would be clearer if it previewed at least one key metric or baseline name rather than remaining purely qualitative.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating revisions where the manuscript will be updated to strengthen the presentation and address the concerns raised.

read point-by-point responses

Referee: [Method] Method section (description of CWCD): the claim that contrasting normal X-rays with masked X-rays using category-specific prompts reliably corrects diminished visual attention rests on the unverified assumption that masking isolates relevant visual evidence without residual cues or cross-category leakage; given heavy anatomical overlap in chest X-rays, the paper must supply masking details, visualizations, or quantitative checks showing the contrast isolates pathologies rather than introducing artifacts.

Authors: We agree that the current description of the masking procedure is insufficient to fully substantiate the isolation of visual evidence, particularly given the anatomical overlap in chest X-rays. In the revised manuscript, we will expand the Method section with a detailed specification of the category-specific masking algorithm (including mask generation parameters and application to visual tokens), add example visualizations comparing original, masked, and contrasted images, and include quantitative checks such as pathology co-occurrence rates and attention map comparisons before/after masking to demonstrate reduced leakage and artifact introduction. revision: yes
Referee: [Experiments] Experiments section: the abstract asserts 'consistent outperformance' on clinical and NLG metrics, yet no numerical results, baseline specifications, statistical tests, or effect sizes are referenced in the provided text; without these, the central claim cannot be evaluated and the ablation's contribution remains unquantified.

Authors: The full experiments section provides numerical results in tables comparing CWCD to baselines (LLaVA-Rad, Maira-2) on clinical efficacy (CheXbert F1) and NLG metrics (BLEU, ROUGE), along with ablation scores and statistical tests. However, to improve immediate evaluability as noted, we will revise the abstract to reference key quantitative improvements and effect sizes, and ensure the experiments section explicitly highlights baseline specifications, p-values from significance tests, and the ablation's per-component contributions with specific metric deltas. revision: partial

Circularity Check

0 steps flagged

No circularity: method presented as independent architectural addition without self-referential derivations

full rationale

The paper introduces CWCD as a modular framework using category-specific parameterization, contrastive decoding between normal and masked X-rays, and category-wise visual prompts. No equations, derivations, or first-principles results are shown that reduce any claimed improvement to a fitted parameter, self-definition, or input by construction. The central claims rest on experimental outperformance and ablation studies rather than any load-bearing self-citation chain or renamed known result. The derivation chain is self-contained as an empirical architectural proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of a newly proposed decoding strategy whose benefit is asserted via experiments; the ledger therefore records the new method itself as an invented entity and the stated limitation of single-pass decoding as a domain assumption.

axioms (1)

domain assumption Single forward pass decoding in MLLMs diminishes attention to visual tokens and increases reliance on language priors as generation proceeds
This premise is stated directly in the abstract as the motivation for the new method.

invented entities (1)

Category-Wise Contrastive Decoding (CWCD) no independent evidence
purpose: To enhance structured radiology report generation through category-specific parameterization and contrastive generation from masked X-rays
The framework is introduced as a novel modular component in this work.

pith-pipeline@v0.9.0 · 5502 in / 1320 out tokens · 38234 ms · 2026-05-10T16:33:09.150774+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · 1 internal anchor

[1]

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E

URLhttps://arxiv.org/abs/2401.12208. Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URLhttps://lmsys.org/blog/2023-03-30-vicuna/. Eric W. Christensen...

work page doi:10.1016/j.jacr.2024.10 2023
[2]

Xu Chu, Xinrong Chen, Guanyu Wang, Zhijie Tan, Kui Huang, Wenyu Lv, Tong Mo, and Weiping Li

Publisher Copyright:©2024 American College of Radiology. Xu Chu, Xinrong Chen, Guanyu Wang, Zhijie Tan, Kui Huang, Wenyu Lv, Tong Mo, and Weiping Li. Qwen look again: Guiding vision-language reasoning models to re-attention visual information, 2025. URLhttps://arxiv.org/abs/2505.23558. DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms v...

work page doi:10.18653/v1/2022.findings-emnlp.319 2024
[3]

URLhttps://arxiv.org/abs/2408.15802. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16×16 words: Transformers for image recognition at scale.Proceedings of the 9th International...

work page doi:10.13026/0pw2-je90 2021
[4]

B leu: a method for automatic evaluation of machine translation

Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://aclanthology.org/P02-1040/. Chantal Pellegrini, Ege Özsoy, Benjamin Busam, Benedikt Wiestler, Nassir Navab, and Matthias Keicher. Radialog: Large vision-language models for x-ray reporting and dialog- driven assistance. InMedical Imaging with Deep Learning, 2025. Alec Rad...

work page doi:10.3115/1073083.1073135 2025
[5]

Learning Transferable Visual Models From Natural Language Supervision

URLhttps://arxiv.org/abs/2103.00020. Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression, 2019. URLhttps://arxiv.org/abs/1902.09630. Omar Sabri, Bassam Al-Shargabi, and Abdelrahman Abuarqoub. The role of artificial intelligence in...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.32604/cmc.2025.066987 2019
[6]

ISBN 978-3-030-32226-7

Springer International Publishing. ISBN 978-3-030-32226-7. Juan Manuel Zambrano Chaves, Shih-Cheng Huang, Yanbo Xu, Hanwen Xu, Naoto Usuyama, Sheng Zhang, Fei Wang, Yujia Xie, Mahmoud Khademi, Ziyi Yang, Hany Awadalla, Julia Gong, Houdong Hu, Jianwei Yang, Chunyuan Li, Jianfeng Gao, Yu Gu, Cliff Wong, Mu Wei, Tristan Naumann, Muhao Chen, Matthew P. Lungre...

work page doi:10.1038/s41467-025-58344-x 2025