arxiv: 2604.08600 · v1 · submitted 2026-04-07 · 🧬 q-bio.TO · eess.IV

Recognition: no theorem link

Gaze2Report: Radiology Report Generation via Visual-Gaze Prompt Tuning of LLMs

Aishik Konwer , Moinak Bhattacharya , Prateek Prasanna

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:12 UTC · model grok-4.3

classification 🧬 q-bio.TO eess.IV

keywords radiology report generationeye gazelarge language modelsprompt tuningscanpath predictiongraph neural networksLoRA fine-tuningmedical imaging

0 comments

The pith

Eye-gaze prediction lets large language models generate more accurate radiology reports without needing real gaze data at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Gaze2Report as a way to improve automated radiology report generation by guiding large language models with patterns that mimic how radiologists look at scans. It adds a scanpath prediction step and a graph neural network to create combined visual and gaze tokens, which are then used as prompts to fine-tune the models via LoRA adapters. This produces reports that better match the locations and features of diseases in the images because the prompts reflect actual visual attention. The design includes on-the-fly prediction so the system runs in ordinary clinical environments that lack eye-tracking hardware. A sympathetic reader would care because current report-generation models often miss key details that human experts notice first, and this method aims to close that gap while remaining practical to deploy.

Core claim

Gaze2Report leverages a scanpath prediction module and Graph Neural Network to generate joint visual-gaze tokens. Combined with instruction and report tokens, these form a multimodal prompt used to fine-tune LoRA layers of large language models for autoregressive report generation. The framework enhances report quality through eye-gaze-guided visual learning and incorporates on-the-fly scanpath prediction, enabling the model to operate without gaze input during inference.

What carries the argument

Scanpath prediction module plus graph neural network that produces joint visual-gaze tokens for multimodal LLM prompting via LoRA fine-tuning.

If this is right

Generated reports align more closely with the actual locations and manifestations of disease visible in the scans.
The system can be used in everyday clinical settings that do not have eye-tracking equipment available at inference.
Multimodal prompts built from predicted gaze tokens increase the relevance of visual features passed to the language model.
LoRA-based tuning keeps the approach computationally efficient while still incorporating gaze-guided information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scanpath prediction idea could be tested on other medical imaging tasks where human visual search patterns are known to matter, such as pathology slides or ultrasound.
If the prediction module can be made robust across different hospitals and scanner types, the need to collect expensive gaze data for every new training set would decrease.
Combining predicted attention with language models might also help flag cases where the model is attending to the wrong image regions before the report is finalized.

Load-bearing premise

The module that predicts radiologist eye movements from images actually captures the attention patterns that matter for diagnosis, and fusing those predictions through the graph network improves report accuracy without adding new mistakes.

What would settle it

Train and evaluate the full Gaze2Report pipeline against a baseline LLM that uses only image features and standard prompts on the same test set; if the gaze-augmented version shows no gain in clinical metrics such as disease mention accuracy or report completeness, the central claim does not hold.

read the original abstract

Existing deep learning methods for radiology report generation enhance diagnostic efficiency but often overlook physician-informed medical priors. This leads to a suboptimal alignment between the structured explanations and disease manifestations. Eye gaze data provides critical insights into a radiologist's visual attention, enhancing the relevance and interpretability of extracted features while aligning with human decision-making processes. However, despite its promising potential, the integration of eye gaze information into AI-driven medical imaging workflows is impeded by challenges such as the complexity of multimodal data fusion and the high cost of gaze acquisition, particularly its absence during inference, limiting its practical applicability in real-world clinical settings. To address these issues, we introduce Gaze2Report, a framework which leverages a scanpath prediction module and Graph Neural Network (GNN) to generate joint visual-gaze tokens. Combined with instruction and report tokens, these form a multimodal prompt used to fine-tune LoRA layers of large language models (LLMs) for autoregressive report generation. Gaze2Report enhances report quality through eye-gaze-guided visual learning and incorporates on-the-fly scanpath prediction, enabling the model to operate without gaze input during inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gaze2Report sketches a pipeline for on-the-fly scanpath prediction plus GNN token fusion to guide LLM radiology reports without real gaze at test time, but the write-up supplies no numbers to show the pieces actually help.

read the letter

The core contribution is a practical workaround for the inference-time absence of eye-gaze data. The authors predict scanpaths from the image alone, encode the visual and predicted-gaze streams as joint tokens through a GNN, then use those tokens alongside instructions to LoRA-tune an LLM for autoregressive report generation. That setup is new in the combination of components and directly targets a deployment obstacle that standard visual prompting ignores. The framing around physician visual priors and the decision to keep the model runnable without extra hardware at test time are clear and sensible choices. Credit is due for spelling out how the GNN handles the multimodal fusion instead of just concatenating features. The architecture description itself is coherent and shows the authors have thought through the data-flow constraints of real radiology workflows. The main weakness is the complete lack of empirical grounding. The abstract and available text give the pipeline but report no scanpath prediction error against held-out gaze, no ablation that removes the GNN or the predicted tokens, and no end-to-end gains on BLEU, CheXbert, or radiologist preference scores. Without those numbers it is impossible to tell whether the predicted attention actually aligns with disease regions or simply injects noise that the LLM then has to overcome. The assumption that the scanpath module will reliably capture what matters for report quality therefore remains untested in the current draft. This paper is for groups already working on gaze-augmented medical vision-language models or on efficient prompt tuning for clinical LLMs. A reader who wants an architecture sketch to adapt or extend will find usable ideas here; anyone needing validated performance improvements will not. The work is coherent on its own terms and engages the right prior literature on report generation and gaze, so it clears the bar for serious refereeing even though the experiments are still missing. I would send it out for review with explicit instructions to the referees to focus on the quantitative validation of the scanpath predictor and the GNN contribution.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Gaze2Report, a framework for radiology report generation that integrates eye-gaze data via a scanpath prediction module and GNN to create joint visual-gaze tokens. These are combined with instruction and report tokens to prompt-tune LoRA-adapted LLMs for generating reports. It emphasizes the ability to predict scanpaths on-the-fly, thus not requiring gaze input at inference time, to improve alignment between reports and disease manifestations based on radiologist attention patterns.

Significance. Should the empirical validation confirm the benefits, this approach could significantly advance the field by incorporating human visual priors into LLM-based medical imaging systems, potentially leading to more accurate and interpretable radiology reports while solving the practical issue of gaze data unavailability during deployment.

major comments (2)

[Abstract] The abstract claims improvements in report quality and alignment with disease manifestations via gaze-guided learning, but no quantitative results, ablation studies, or metrics (such as BLEU scores or CheXbert accuracy) are mentioned to support these claims, which is critical for evaluating the central contribution.
[Method] The description of the GNN fusion of visual and predicted gaze tokens does not include any analysis or experiments validating that the predicted tokens accurately capture attention patterns or that the fusion improves performance without introducing errors, undermining the load-bearing assumption of the framework.

minor comments (1)

[Abstract] The acronym 'GNN' is introduced without expansion on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of the potential significance of Gaze2Report. We have carefully reviewed the major comments and provide point-by-point responses below, outlining revisions that will be incorporated into the next version of the manuscript.

read point-by-point responses

Referee: [Abstract] The abstract claims improvements in report quality and alignment with disease manifestations via gaze-guided learning, but no quantitative results, ablation studies, or metrics (such as BLEU scores or CheXbert accuracy) are mentioned to support these claims, which is critical for evaluating the central contribution.

Authors: We agree that the abstract would benefit from explicit quantitative support. The full manuscript includes comprehensive evaluations using BLEU, METEOR, ROUGE, and CheXbert clinical accuracy metrics, along with ablation studies isolating the contribution of gaze-guided tokens. We will revise the abstract to include key results (e.g., specific BLEU-4 and CheXbert gains) and reference the ablation findings to directly substantiate the claims. revision: yes
Referee: [Method] The description of the GNN fusion of visual and predicted gaze tokens does not include any analysis or experiments validating that the predicted tokens accurately capture attention patterns or that the fusion improves performance without introducing errors, undermining the load-bearing assumption of the framework.

Authors: We acknowledge this point and the importance of validating the scanpath prediction and GNN fusion. The manuscript provides ablation studies in the experiments section comparing performance with and without gaze integration, plus qualitative scanpath visualizations. To strengthen the presentation, we will add quantitative scanpath prediction metrics (e.g., similarity to ground-truth gaze) and an error analysis of the fusion step in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces an architectural framework (scanpath prediction module + GNN fusion into multimodal prompts for LoRA fine-tuning of LLMs) trained on external gaze data during development but operating without gaze at inference. No equations, derivations, or first-principles claims are present that reduce any performance improvement to a fitted parameter defined by the target result itself, a self-referential definition, or a load-bearing self-citation chain. The central claims remain empirically testable via standard metrics and ablations rather than tautological by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract; full methods and results unavailable.

axioms (1)

domain assumption Eye gaze data provides critical insights into a radiologist's visual attention that enhance feature relevance and align with human decision-making.
Explicitly stated in the abstract as the motivation for the approach.

invented entities (1)

joint visual-gaze tokens no independent evidence
purpose: To form a multimodal prompt for LLM fine-tuning that incorporates both image and predicted gaze information.
New construct introduced in the Gaze2Report framework.

pith-pipeline@v0.9.0 · 5509 in / 1319 out tokens · 38285 ms · 2026-05-10T19:12:22.575217+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 6 canonical work pages · 2 internal anchors

[1]

INTRODUCTION The rapid growth of medical imaging data has significantly increased the diagnostic workload for radiologists, neces- sitating fast yet precise reporting to ensure timely clinical decision-making. Automated report generation [1, 2] is an active area of research involving deep learning systems that generate text descriptions from radiological ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

generation instructions

METHODOLOGY Overview.Our proposedGaze2Reportframework consists of two main components: a visual-gaze token generation module and a LLM decoder for text generation. The visual- gaze module involves four steps: 1) extracting visual features from images using a Vision Transformer (ViT) [11], 2) time- aggregating gaze fixations from MedGaze [12] to generate g...
[3]

RE- FLACX includes CXRs with synchronized eye-tracking and transcription pairs annotated by radiologists (1800 train, 707 test samples)

EXPERIMENT DESIGN AND RESULTS Dataset description.This study utilizes three datasets: RE- FLACX [21], IU-XRAY [22], and MIMIC-CXR [23]. RE- FLACX includes CXRs with synchronized eye-tracking and transcription pairs annotated by radiologists (1800 train, 707 test samples). IU-XRAY contains 3955 fully de-identified radiology reports associated with 7470 CXR...
[4]

CONCLUSION Current methods often struggle to generate radiology reports with strong alignment between impressions and disease man- ifestations due to the absence of physician-informed medical priors such as eye gaze. To address this, we enhance feature relevance through rich visual-gaze interaction via GNN and improve interpretability by tuning an LLM wit...
[5]

Acknowledgements:This research was partially supported by NIH grants 1R01CA297843-01, 1R03DE033489-01A1, and NSF grant 2442053

COMPLIANCE WITH ETHICAL STANDARDS This study, conducted on open-source data, did not require ethical approval. Acknowledgements:This research was partially supported by NIH grants 1R01CA297843-01, 1R03DE033489-01A1, and NSF grant 2442053. The content is solely the respon- sibility of the authors and does not necessarily represent the official views of the NIH
[6]

Automated radiographic report generation purely on transformer: A multicriteria super- vised approach,

Zhanyu Wang et al., “Automated radiographic report generation purely on transformer: A multicriteria super- vised approach,”IEEE TMI, 2022

2022
[7]

Exploring and distilling posterior and prior knowledge for radiology report generation,

Fenglin Liu et al., “Exploring and distilling posterior and prior knowledge for radiology report generation,” inIEEE/CVF CVPR, 2021

2021
[8]

Meshed-memory transformer for image captioning,

Marcella Cornia et al., “Meshed-memory transformer for image captioning,” inIEEE/CVF CVPR, 2020

2020
[9]

Unimo: Towards unified-modal un- derstanding and generation via cross-modal contrastive learning,

Wei Li et al., “Unimo: Towards unified-modal un- derstanding and generation via cross-modal contrastive learning,”arXiv preprint arXiv:2012.15409, 2020

work page arXiv 2012
[10]

Improve image cap- tioning by estimating the gazing patterns from the cap- tion,

Rehab Alahmadi and James Hahn, “Improve image cap- tioning by estimating the gazing patterns from the cap- tion,” inIEEE/CVF WACV, 2022

2022
[11]

Camanet: class activation map guided attention network for radiology report genera- tion,

Jun Wang et al., “Camanet: class activation map guided attention network for radiology report genera- tion,”IEEE JBHI, 2024

2024
[12]

Gazeradar: A gaze and radiomics-guided disease localization framework,

Moinak Bhattacharya et al., “Gazeradar: A gaze and radiomics-guided disease localization framework,” in MICCAI, 2022

2022
[13]

Radiotransformer: a cascaded global-focal transformer for visual attention– guided disease classification,

Moinak Bhattacharya et al., “Radiotransformer: a cascaded global-focal transformer for visual attention– guided disease classification,” inECCV, 2022

2022
[14]

Gazediff: a radiologist visual attention guided diffusion model for zero-shot disease classification,

Moinak Bhattacharya and Prateek Prasanna, “Gazediff: a radiologist visual attention guided diffusion model for zero-shot disease classification,” inMIDL, 2024

2024
[15]

Eye gaze guided cross-modal align- ment network for radiology report generation,

Peixi Peng et al., “Eye gaze guided cross-modal align- ment network for radiology report generation,”IEEE JBHI, 2024

2024
[16]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[17]

Multimodal learning and cog- nitive processes in radiology: Medgaze for chest x-ray scanpath prediction,

Akash Awasthi et al., “Multimodal learning and cog- nitive processes in radiology: Medgaze for chest x-ray scanpath prediction,”arXiv:2407.00129, 2024

work page arXiv 2024
[18]

Vision gnn: An image is worth graph of nodes,

Kai Han et al., “Vision gnn: An image is worth graph of nodes,”NeurIPS, 2022

2022
[19]

Show and tell: A neural image caption generator,

Oriol Vinyals et al., “Show and tell: A neural image caption generator,” inIEEE CVPR, 2015

2015
[20]

Show, attend and tell: Neural im- age caption generation with visual attention,

Kelvin Xu et al., “Show, attend and tell: Neural im- age caption generation with visual attention,” inICML, 2015

2015
[21]

Knowing when to look: Adaptive at- tention via a visual sentinel for image captioning,

Jiasen Lu et al., “Knowing when to look: Adaptive at- tention via a visual sentinel for image captioning,” in IEEE CVPR, 2017

2017
[22]

Generating radiology reports via memory-driven transformer,

Zhihong Chen et al., “Generating radiology reports via memory-driven transformer,” inEMNLP, Bonnie Web- ber, Trevor Cohn, Yulan He, and Yang Liu, Eds., 2020

2020
[23]

Cross-modal prototype driven network for radiology report generation,

Jun Wang et al., “Cross-modal prototype driven network for radiology report generation,” inECCV, 2022

2022
[24]

Metransformer: Radiology report generation by transformer with multiple learnable expert tokens,

Zhanyu Wang et al., “Metransformer: Radiology report generation by transformer with multiple learnable expert tokens,” inIEEE/CVF CVPR, 2023

2023
[25]

R2gengpt: Radiology report gen- eration with frozen llms,

Zhanyu Wang et al., “R2gengpt: Radiology report gen- eration with frozen llms,”Meta-Radiology, 2023

2023
[26]

Reflacx, a dataset of reports and eye-tracking data for localization of abnor- malities in chest x-rays,

Ricardo Bigolin Lanfredi et al., “Reflacx, a dataset of reports and eye-tracking data for localization of abnor- malities in chest x-rays,”Scientific data, 2022

2022
[27]

Preparing a collection of radiology examinations for distribution and retrieval,

Dina Demner-Fushman et al., “Preparing a collection of radiology examinations for distribution and retrieval,” JAMIA, 2016

2016
[28]

Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs,

Alistair EW Johnson et al., “Mimic-cxr-jpg, a large pub- licly available database of labeled chest radiographs,” arXiv preprint arXiv:1901.07042, 2019

work page arXiv 1901
[29]

Generating radiology reports via memory-driven transformer,

Zhihong Chen et al., “Generating radiology reports via memory-driven transformer,” inEMNLP, 2020

2020
[30]

Bleu: a method for automatic evaluation of machine translation,

Kishore Papineni et al., “Bleu: a method for automatic evaluation of machine translation,” inACL, 2002

2002
[31]

Meteor: An au- tomatic metric for mt evaluation with improved correla- tion with human judgments,

Satanjeev Banerjee and Alon Lavie, “Meteor: An au- tomatic metric for mt evaluation with improved correla- tion with human judgments,” inACL workshop, 2005

2005
[32]

Rouge: A package for automatic eval- uation of summaries,

Chin-Yew Lin, “Rouge: A package for automatic eval- uation of summaries,” inText summarization branches out, 2004

2004
[33]

Ng, and Matthew P

Akshay Smit et al., “Chexbert: combining automatic labelers and expert annotations for accurate radiology report labeling using bert,”arXiv:2004.09167, 2020

work page arXiv 2004
[34]

Improving the factual correctness of radiology report generation with semantic rewards,

Jean-Benoit Delbrouck et al., “Improving the factual correctness of radiology report generation with semantic rewards,” inACL Findings, 2022

2022