Discrete Diffusion Language Models for Interactive Radiology Report Drafting

Halil Ibrahim Gulluk; Max Van Puyvelde; Olivier Gevaert; Wim Van Criekinge

arxiv: 2607.01436 · v1 · pith:XBFD5AO6new · submitted 2026-07-01 · 💻 cs.AI · cs.LG

Discrete Diffusion Language Models for Interactive Radiology Report Drafting

Max Van Puyvelde , Halil Ibrahim Gulluk , Wim Van Criekinge , Olivier Gevaert This is my paper

Pith reviewed 2026-07-03 20:18 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords diffusion language modelsradiology report generationvisual question answeringany-order infillinteractive text generationmedical imagingLoRA fine-tuningautoregressive comparison

0 comments

The pith

A finetuned diffusion language model matches or exceeds autoregressive models on medical visual question answering while decoding 3.5-4.4 times faster and enabling any-order text infill.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper adapts a mixture-of-experts diffusion language model originally trained on general text and fine-tunes it with the same low-rank adaptation recipe used for its autoregressive counterpart on medical visual question answering datasets. It demonstrates that the resulting model performs at least as well as the autoregressive version when outputs are scored by a verbosity-robust LLM judge, while also running substantially faster at generation time. The central advantage highlighted is that the diffusion process denoises the entire token canvas bidirectionally, which makes it possible for a user to supply or edit arbitrary fragments of a radiology report and have the model complete the remaining text in any order. This interactive capability directly addresses the variable structure and completeness found in actual clinical reports across institutions and clinicians. The work positions diffusion models as a practical alternative for medical report drafting rather than a purely theoretical improvement over left-to-right generation.

Core claim

By adapting DiffusionGemma-26B via LoRA on medical visual question answering data, the finetuned diffusion model with 3.8B active parameters reaches or surpasses the performance of its autoregressive sibling Gemma-4-26B on the same benchmarks, delivers 3.5-4.4x faster decoding, and supplies native any-order infill because the token canvas is refined bidirectionally rather than emitted sequentially.

What carries the argument

Bidirectional iterative denoising of a full token canvas in a mixture-of-experts diffusion language model, which refines all positions simultaneously instead of generating tokens left to right.

If this is right

The finetuned diffusion model becomes competitive with frontier vision-language models on medical visual question answering tasks.
Generation runs 3.5-4.4 times faster than the same-size autoregressive model under identical fine-tuning.
Any-order infill lets a radiologist fix report fragments and have the model complete the text between them.
The approach fits real radiology reports that are often terse or vary in structure across clinicians and institutions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interactive any-order editing could reduce the time radiologists spend revising draft reports in clinical software.
The same bidirectional mechanism may transfer to other medical document tasks such as discharge summaries or pathology reports that require non-sequential revisions.
Combining diffusion generation with stronger vision encoders could further close remaining gaps with larger frontier models on image-conditioned medical text.

Load-bearing premise

The verbosity-robust LLM judge used to score outputs accurately captures clinical correctness and usefulness of the generated reports rather than surface-level fluency.

What would settle it

A blinded side-by-side evaluation by practicing radiologists that measures clinical accuracy and usefulness of diffusion-generated versus autoregressive-generated reports on a held-out set of real patient imaging cases.

Figures

Figures reproduced from arXiv: 2607.01436 by Halil Ibrahim Gulluk, Max Van Puyvelde, Olivier Gevaert, Wim Van Criekinge.

**Figure 1.** Figure 1: A medical-VQA example (VQA-Med). Every model’s answer to “what is abnormal in the CT scan?” (reference: pancreatic ductal adenocarcinoma), with the LLM judge’s verdict (✓ correct, × incorrect). Base and frontier models reply in full sentences that exact-match scoring would reject regardless of content; here only the finetuned diffusion model answers correctly [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: LLM-judge accuracy (Claude Sonnet 4.6). (a) base vs. finetuned, for diffusion and AR. (b) the finetuned 26B model (3.8B active) against three frontier non-reasoning VLMs. †Claude-Sonnet4.6 is the judge model [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Completing a gap from both sides. One sentence of a chest X-ray report is masked (the gap) and filled from the surrounding fixed fragments. Top: the diffusion model draws on fragments on either side and recovers the sentence correctly. Bottom: the autoregressive sibling sees only the fragments before it, the rest greyed out, and reconstructs it incorrectly. Real MIMIC-CXR example [PITH_FULL_IMAGE:figures/… view at source ↗

read the original abstract

Diffusion language models, which generate text by denoising a token canvas bidirectionally instead of emitting tokens left to right, have become competitive with autoregressive (AR) generation. Medical foundation models, however, remain almost entirely autoregressive. We adapt a mixture-of-experts diffusion language model, DiffusionGemma-26B, and benchmark it against its same-size AR sibling Gemma-4-26B under an identical LoRA recipe on medical visual question answering datasets, scored by a verbosity-robust LLM judge. Diffusion matches or exceeds AR on all of them, and the finetuned model (3.8B active) is competitive with frontier vision-language models; its decoding is also 3.5-4.4x faster. Beyond this parity, the diffusion model offers a drafting capability AR lacks: any-order infill. Because the canvas is denoised bidirectionally, a radiologist can fix report fragments and have the model fill the text between them, an operation inherent to diffusion but not to autoregression, which is subpar at it. This suits real reports, which are often terse or inconsistent across clinicians and institutions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adapts a diffusion LM to medical VQA and shows it matches its AR sibling on LLM-judge scores while adding any-order infilling for interactive drafting, but the clinical grounding of those scores is thin.

read the letter

The main point is that a finetuned diffusion model reaches parity with its autoregressive sibling on medical visual question answering tasks, runs 3.5-4.4x faster at decode, and supports any-order infilling that lets a user fix fragments and have the model complete the rest. That last part is the practical difference for radiology reports, which are often edited piecemeal rather than written left to right.

The work applies an existing mixture-of-experts diffusion model under the same LoRA setup as the AR baseline and shows the bidirectional canvas can handle terse or inconsistent clinical text. The speed numbers and the 3.8B active parameter count are stated clearly, and the interactive use case is a direct consequence of the generation method rather than an add-on.

The soft spot is the evaluation. Performance parity rests entirely on a verbosity-robust LLM judge with no reported human radiologist ratings, no inter-rater checks, and no correlation to clinical metrics such as RadGraph or CheXpert label accuracy. In a domain where factual errors in findings carry real cost, an unanchored proxy is the weakest link. Dataset names, statistical tests, and ablation details are also absent from the abstract, though the full paper may supply them.

This is for groups working on medical text generation or non-autoregressive language models who want to test interactive drafting. A reader already following diffusion LMs will get the most from the infilling demonstration. The paper deserves a serious referee because the core technical claim is testable and the interactive angle is worth checking, even if the current evidence for clinical reliability needs strengthening.

I would send it to peer review so domain experts can examine whether the judge tracks actual report quality.

Referee Report

2 major / 1 minor

Summary. The manuscript adapts the DiffusionGemma-26B mixture-of-experts discrete diffusion language model and compares it to its autoregressive sibling Gemma-4-26B under an identical LoRA fine-tuning recipe on medical visual question answering datasets. Performance is measured exclusively by a verbosity-robust LLM judge; the diffusion model is reported to match or exceed the AR baseline, decode 3.5-4.4x faster, and enable any-order infill for interactive radiology report drafting.

Significance. If the performance parity and speed claims hold under clinical validation, the work would show that discrete diffusion models can reach parity with autoregressive models in a specialized medical domain while adding bidirectional infill capabilities that match real clinical drafting workflows. The any-order infill feature is a genuine architectural advantage not available to standard AR models.

major comments (2)

[Abstract] Abstract: the central claim that the finetuned diffusion model 'matches or exceeds AR on all of them' rests solely on scores from an unvalidated verbosity-robust LLM judge. No human radiologist ratings, inter-rater agreement statistics, or correlation with established clinical metrics (RadGraph, CheXpert label accuracy, or error-severity scales) are reported to establish that the judge tracks factual correctness rather than surface fluency.
[Abstract] Abstract: dataset names, sizes, task definitions, and statistical test details for the VQA benchmarks are omitted, preventing assessment of whether the reported parity is robust or limited to particular distributions.

minor comments (1)

[Abstract] The abstract states '3.8B active parameters' for the finetuned model but does not clarify how this active-parameter count was obtained from the 26B MoE backbone or whether it affects the speed comparison.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on the abstract and evaluation methodology. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the finetuned diffusion model 'matches or exceeds AR on all of them' rests solely on scores from an unvalidated verbosity-robust LLM judge. No human radiologist ratings, inter-rater agreement statistics, or correlation with established clinical metrics (RadGraph, CheXpert label accuracy, or error-severity scales) are reported to establish that the judge tracks factual correctness rather than surface fluency.

Authors: We agree that the reported performance parity relies solely on the verbosity-robust LLM judge without human radiologist validation, inter-rater statistics, or explicit correlation to clinical metrics such as RadGraph or CheXpert. This constitutes a genuine limitation of the current study. We will revise the abstract to explicitly qualify the claims as LLM-judge-based and to moderate the language around 'matches or exceeds.' A limitations discussion on automated evaluation in medical domains will also be added to the manuscript. revision: partial
Referee: [Abstract] Abstract: dataset names, sizes, task definitions, and statistical test details for the VQA benchmarks are omitted, preventing assessment of whether the reported parity is robust or limited to particular distributions.

Authors: We will update the abstract to include the specific medical VQA dataset names, their sizes, and task definitions. Statistical test details are already provided in the experimental setup and results sections of the full manuscript; we will ensure the abstract references these details more clearly. revision: yes

standing simulated objections not resolved

Human radiologist ratings, inter-rater agreement, or direct correlation of the LLM judge with clinical metrics such as RadGraph or CheXpert, as these were not collected in the original experiments and would require new data collection.

Circularity Check

0 steps flagged

No circularity; empirical benchmarking only

full rationale

The paper reports direct experimental results from fine-tuning a diffusion LM (DiffusionGemma-26B) and its AR sibling (Gemma-4-26B) under an identical LoRA recipe, then measuring VQA performance via an LLM judge and noting the any-order infill property inherent to bidirectional denoising. No equations, parameter fits presented as predictions, self-definitional steps, or load-bearing self-citations appear in the provided text. The central claims rest on external benchmarking rather than any reduction of outputs to the paper's own inputs by construction, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are stated or implied beyond standard model fine-tuning assumptions.

pith-pipeline@v0.9.1-grok · 5740 in / 1084 out tokens · 20722 ms · 2026-07-03T20:18:33.652158+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 23 canonical work pages · 10 internal anchors

[1]

Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg

Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Struc- tured denoising diffusion models in discrete state-spaces. InAdvances in Neural Information Processing Systems (NeurIPS), 2021. doi: 10.48550/arXiv.2107.03006

work page doi:10.48550/arxiv.2107.03006 2021
[2]

Castro, Anton Schwaighofer, Anja Thieme, Sam Bond-Taylor, Maximilian Ilse, Fernando Pérez-García, Valentina Salvatelli, Harshita Sharma, et al

Shruthi Bannur, Kenza Bouzid, Daniel C. Castro, Anton Schwaighofer, Anja Thieme, Sam Bond-Taylor, Maximilian Ilse, Fernando Pérez-García, Valentina Salvatelli, Harshita Sharma, et al. MAIRA-2: Grounded radiology report generation, 2024. arXiv preprint arXiv:2406.04449

work page arXiv 2024
[3]

Hasan, Vivek V

Asma Ben Abacha, Sadid A. Hasan, Vivek V . Datla, Joey Liu, Dina Demner-Fushman, and Henning Müller. VQA-Med: Overview of the medical visual question answering task at ImageCLEF 2019. InCLEF 2019 Working Notes, CEUR Workshop Proceedings, 2019

2019
[4]

ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

Lifeng Chen, Tianqi You, Hao Liu, Zhimin Bao, Jile Jiao, Xiao Han, Zhicai Ou, Tao Sun, Xiaofeng Mou, Xiaojie Jin, and Yi Xu. ECHO: Efficient chest x-ray report generation with one-step block diffusion, 2026. arXiv preprint arXiv:2604.09450

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

Gemma 4: Open multimodal models

Gemma Team, Google DeepMind. Gemma 4: Open multimodal models. Model card, https: //huggingface.co/google/gemma-4-26B-A4B-it, 2026

2026
[6]

DiffusionGemma: Block discrete-diffusion language models

Google DeepMind. DiffusionGemma: Block discrete-diffusion language models. Model card, https://huggingface.co/google/diffusiongemma-26B-A4B-it, 2026

2026
[7]

SemEnrich: Self-Supervised Semantic Enrichment of Radiology Reports for Vision-Language Learning

Halil Ibrahim Gulluk and Olivier Gevaert. SemEnrich: Self-supervised semantic enrichment of radiology reports for vision-language learning.arXiv preprint arXiv:2604.09887, 2026. doi: 10.48550/arXiv.2604.09887

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.09887 2026
[8]

OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models

Halil Ibrahim Gulluk, Max Van Puyvelde, and Olivier Gevaert. OpenMedQ: Broad open pretraining for medical vision-language models.arXiv preprint arXiv:2606.12953, 2026. doi: 10.48550/arXiv.2606.12953

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2606.12953 2026
[9]

SDR: Set-Distance Rewards for Radiology Report Generation

Halil Ibrahim Gulluk, Max Van Puyvelde, Wim Van Criekinge, and Olivier Gevaert. SDR: Set-distance rewards for radiology report generation, 2026. arXiv preprint arXiv:2606.00440

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

arXiv preprint arXiv:2602.01326 , year=

HKU NLP Group. DreamOn: Diffusion language models for code infilling beyond fixed-size canvas, 2026. arXiv preprint arXiv:2602.01326

work page arXiv 2026
[11]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022. doi: 10.48550/arXiv.2106. 09685

work page doi:10.48550/arxiv.2106 2022
[12]

Hyland, Shruthi Bannur, Kenza Bouzid, Daniel C

Stephanie L. Hyland, Shruthi Bannur, Kenza Bouzid, Daniel C. Castro, Mercy Ranjit, Anton Schwaighofer, Fernando Pérez-García, Valentina Salvatelli, Shaury Srivastav, Anja Thieme, et al. MAIRA-1: A specialised large multimodal model for radiology report generation, 2023. arXiv preprint arXiv:2311.13668

work page arXiv 2023
[13]

Alistair E. W. Johnson, Tom J. Pollard, Seth J. Berkowitz, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-ying Deng, Roger G. Mark, and Steven Horng. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports.Scientific Data, 6(1):317,
[14]

doi: 10.1038/s41597-019-0322-0

work page doi:10.1038/s41597-019-0322-0
[15]

Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman

Jason J. Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific Data, 5(1): 180251, 2018. doi: 10.1038/sdata.2018.251. 7

work page doi:10.1038/sdata.2018.251 2018
[16]

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. LLaV A-Med: Training a large language-and-vision assistant for biomedicine in one day. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks, 2023. doi: 10.48550/arXiv.2306.00890

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.00890 2023
[17]

SLAKE: A semantically- labeled knowledge-enhanced dataset for medical visual question answering

Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. SLAKE: A semantically- labeled knowledge-enhanced dataset for medical visual question answering. InIEEE Interna- tional Symposium on Biomedical Imaging (ISBI), 2021. doi: 10.1109/ISBI48211.2021.9434010

work page doi:10.1109/isbi48211.2021.9434010 2021
[18]

Discrete diffusion models with MLLMs for unified medical multimodal generation, 2025

Jiawei Mao, Yuhan Wang, Lifeng Chen, Can Zhao, Yucheng Tang, Dong Yang, Liangqiong Qu, Daguang Xu, and Yuyin Zhou. Discrete diffusion models with MLLMs for unified medical multimodal generation, 2025. arXiv preprint arXiv:2510.06131

work page arXiv 2025
[19]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025. doi: 10.48550/arXiv.2502.09992

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.09992 2025
[20]

Chiu, Alexander Rush, and V olodymyr Kuleshov

Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T. Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models. InAdvances in Neural Information Processing Systems (NeurIPS),
[21]

doi: 10.48550/arXiv.2406.07524

work page doi:10.48550/arxiv.2406.07524
[22]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Tim Tanida, Philip Müller, Georgios Kaissis, and Daniel Rueckert. Interactive and explainable region-guided radiology report generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. doi: 10.1109/CVPR52729.2023.00718

work page doi:10.1109/cvpr52729.2023.00718 2023
[23]

CopilotCAD: Empowering radiologists with report completion models and quantitative evidence from medical image foundation models, 2024

Sheng Wang et al. CopilotCAD: Empowering radiologists with report completion models and quantitative evidence from medical image foundation models, 2024. arXiv preprint arXiv:2404.07424

work page arXiv 2024
[24]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025. doi: 10.48550/arXiv.2508.15487

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.15487 2025
[25]

AnchorDiff: Topology-Aware Masked Diffusion with Confidence-based Rewriting for Radiology Report Generation

Shiying Yu, Jielei Wang, and Guoming Lu. AnchorDiff: Topology-aware masked diffu- sion with confidence-based rewriting for radiology report generation, 2026. arXiv preprint arXiv:2605.17071

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

Sigmoid Loss for Language Image Pre-Training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InInternational Conference on Computer Vision (ICCV), 2023. doi: 10.48550/arXiv.2303.15343

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.15343 2023
[27]

ReXrank: A public leaderboard for ai-powered radiology report generation, 2024

Xiaoman Zhang, Hong-Yu Zhou, Xiaoli Yang, et al. ReXrank: A public leaderboard for ai-powered radiology report generation, 2024. arXiv preprint arXiv:2411.15122

work page arXiv 2024
[28]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2023. doi: 10. 48550/arXiv.2...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg

Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Struc- tured denoising diffusion models in discrete state-spaces. InAdvances in Neural Information Processing Systems (NeurIPS), 2021. doi: 10.48550/arXiv.2107.03006

work page doi:10.48550/arxiv.2107.03006 2021

[2] [2]

Castro, Anton Schwaighofer, Anja Thieme, Sam Bond-Taylor, Maximilian Ilse, Fernando Pérez-García, Valentina Salvatelli, Harshita Sharma, et al

Shruthi Bannur, Kenza Bouzid, Daniel C. Castro, Anton Schwaighofer, Anja Thieme, Sam Bond-Taylor, Maximilian Ilse, Fernando Pérez-García, Valentina Salvatelli, Harshita Sharma, et al. MAIRA-2: Grounded radiology report generation, 2024. arXiv preprint arXiv:2406.04449

work page arXiv 2024

[3] [3]

Hasan, Vivek V

Asma Ben Abacha, Sadid A. Hasan, Vivek V . Datla, Joey Liu, Dina Demner-Fushman, and Henning Müller. VQA-Med: Overview of the medical visual question answering task at ImageCLEF 2019. InCLEF 2019 Working Notes, CEUR Workshop Proceedings, 2019

2019

[4] [4]

ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

Lifeng Chen, Tianqi You, Hao Liu, Zhimin Bao, Jile Jiao, Xiao Han, Zhicai Ou, Tao Sun, Xiaofeng Mou, Xiaojie Jin, and Yi Xu. ECHO: Efficient chest x-ray report generation with one-step block diffusion, 2026. arXiv preprint arXiv:2604.09450

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

Gemma 4: Open multimodal models

Gemma Team, Google DeepMind. Gemma 4: Open multimodal models. Model card, https: //huggingface.co/google/gemma-4-26B-A4B-it, 2026

2026

[6] [6]

DiffusionGemma: Block discrete-diffusion language models

Google DeepMind. DiffusionGemma: Block discrete-diffusion language models. Model card, https://huggingface.co/google/diffusiongemma-26B-A4B-it, 2026

2026

[7] [7]

SemEnrich: Self-Supervised Semantic Enrichment of Radiology Reports for Vision-Language Learning

Halil Ibrahim Gulluk and Olivier Gevaert. SemEnrich: Self-supervised semantic enrichment of radiology reports for vision-language learning.arXiv preprint arXiv:2604.09887, 2026. doi: 10.48550/arXiv.2604.09887

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.09887 2026

[8] [8]

OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models

Halil Ibrahim Gulluk, Max Van Puyvelde, and Olivier Gevaert. OpenMedQ: Broad open pretraining for medical vision-language models.arXiv preprint arXiv:2606.12953, 2026. doi: 10.48550/arXiv.2606.12953

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2606.12953 2026

[9] [9]

SDR: Set-Distance Rewards for Radiology Report Generation

Halil Ibrahim Gulluk, Max Van Puyvelde, Wim Van Criekinge, and Olivier Gevaert. SDR: Set-distance rewards for radiology report generation, 2026. arXiv preprint arXiv:2606.00440

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [10]

arXiv preprint arXiv:2602.01326 , year=

HKU NLP Group. DreamOn: Diffusion language models for code infilling beyond fixed-size canvas, 2026. arXiv preprint arXiv:2602.01326

work page arXiv 2026

[11] [11]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022. doi: 10.48550/arXiv.2106. 09685

work page doi:10.48550/arxiv.2106 2022

[12] [12]

Hyland, Shruthi Bannur, Kenza Bouzid, Daniel C

Stephanie L. Hyland, Shruthi Bannur, Kenza Bouzid, Daniel C. Castro, Mercy Ranjit, Anton Schwaighofer, Fernando Pérez-García, Valentina Salvatelli, Shaury Srivastav, Anja Thieme, et al. MAIRA-1: A specialised large multimodal model for radiology report generation, 2023. arXiv preprint arXiv:2311.13668

work page arXiv 2023

[13] [13]

Alistair E. W. Johnson, Tom J. Pollard, Seth J. Berkowitz, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-ying Deng, Roger G. Mark, and Steven Horng. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports.Scientific Data, 6(1):317,

[14] [14]

doi: 10.1038/s41597-019-0322-0

work page doi:10.1038/s41597-019-0322-0

[15] [15]

Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman

Jason J. Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific Data, 5(1): 180251, 2018. doi: 10.1038/sdata.2018.251. 7

work page doi:10.1038/sdata.2018.251 2018

[16] [16]

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. LLaV A-Med: Training a large language-and-vision assistant for biomedicine in one day. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks, 2023. doi: 10.48550/arXiv.2306.00890

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.00890 2023

[17] [17]

SLAKE: A semantically- labeled knowledge-enhanced dataset for medical visual question answering

Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. SLAKE: A semantically- labeled knowledge-enhanced dataset for medical visual question answering. InIEEE Interna- tional Symposium on Biomedical Imaging (ISBI), 2021. doi: 10.1109/ISBI48211.2021.9434010

work page doi:10.1109/isbi48211.2021.9434010 2021

[18] [18]

Discrete diffusion models with MLLMs for unified medical multimodal generation, 2025

Jiawei Mao, Yuhan Wang, Lifeng Chen, Can Zhao, Yucheng Tang, Dong Yang, Liangqiong Qu, Daguang Xu, and Yuyin Zhou. Discrete diffusion models with MLLMs for unified medical multimodal generation, 2025. arXiv preprint arXiv:2510.06131

work page arXiv 2025

[19] [19]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025. doi: 10.48550/arXiv.2502.09992

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.09992 2025

[20] [20]

Chiu, Alexander Rush, and V olodymyr Kuleshov

Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T. Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models. InAdvances in Neural Information Processing Systems (NeurIPS),

[21] [21]

doi: 10.48550/arXiv.2406.07524

work page doi:10.48550/arxiv.2406.07524

[22] [22]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Tim Tanida, Philip Müller, Georgios Kaissis, and Daniel Rueckert. Interactive and explainable region-guided radiology report generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. doi: 10.1109/CVPR52729.2023.00718

work page doi:10.1109/cvpr52729.2023.00718 2023

[23] [23]

CopilotCAD: Empowering radiologists with report completion models and quantitative evidence from medical image foundation models, 2024

Sheng Wang et al. CopilotCAD: Empowering radiologists with report completion models and quantitative evidence from medical image foundation models, 2024. arXiv preprint arXiv:2404.07424

work page arXiv 2024

[24] [24]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025. doi: 10.48550/arXiv.2508.15487

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.15487 2025

[25] [25]

AnchorDiff: Topology-Aware Masked Diffusion with Confidence-based Rewriting for Radiology Report Generation

Shiying Yu, Jielei Wang, and Guoming Lu. AnchorDiff: Topology-aware masked diffu- sion with confidence-based rewriting for radiology report generation, 2026. arXiv preprint arXiv:2605.17071

work page internal anchor Pith review Pith/arXiv arXiv 2026

[26] [26]

Sigmoid Loss for Language Image Pre-Training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InInternational Conference on Computer Vision (ICCV), 2023. doi: 10.48550/arXiv.2303.15343

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.15343 2023

[27] [27]

ReXrank: A public leaderboard for ai-powered radiology report generation, 2024

Xiaoman Zhang, Hong-Yu Zhou, Xiaoli Yang, et al. ReXrank: A public leaderboard for ai-powered radiology report generation, 2024. arXiv preprint arXiv:2411.15122

work page arXiv 2024

[28] [28]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2023. doi: 10. 48550/arXiv.2...

work page internal anchor Pith review Pith/arXiv arXiv 2023