arxiv: 2604.11653 · v1 · submitted 2026-04-13 · 💻 cs.CV

Recognition: unknown

GazeVaLM: A Multi-Observer Eye-Tracking Benchmark for Evaluating Clinical Realism in AI-Generated X-Rays

Aladine Chetouani, Amir Ali Rahsepar, Amir Borhani, Andrew C. Gordon, Ayis Pyrros, Bin Wang, Cagdas Topel, Camila Lopes Vendrami, David Wong, Elif Keles, Elizabeth Krupinski, Eric Hart, Frank H. Miller, Gokcan Okur, Gorkem Durak, Halil Ertugrul Aktas, Hatice Savas, Laetitia Perronne, Marouane Tliba, Matthew Antalek, Nicolo Gennaro, Onural Ozturk, Tugce Agirlar Trabzonlu, Ulas Bagci, Zeynep Isik

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:03 UTC · model grok-4.3

classification 💻 cs.CV

keywords eye-tracking datasetchest X-raysynthetic imagesmultimodal LLMsclinical perceptionVisual Turing testgenerative AIhuman-AI comparison

0 comments

The pith

GazeVaLM releases eye-tracking data from 16 radiologists and outputs from 6 LLMs to compare perception of real versus diffusion-generated chest X-rays.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a public dataset of 960 gaze recordings collected while 16 expert radiologists performed diagnostic assessment and real-fake classification on 30 real and 30 synthetic chest X-rays. It pairs these human data with matching diagnostic labels, authenticity judgments, and confidence scores from six multimodal LLMs run under identical conditions. The resource is positioned to support studies of visual attention, inter-observer consistency, human-AI differences in medical image interpretation, and evaluation of generative model realism. By releasing raw gaze samples, fixation maps, scanpaths, and structured labels together, the work enables reproducible comparisons at both decision and uncertainty levels.

Core claim

GazeVaLM is a dataset of gaze recordings, fixation maps, scanpaths, saliency density maps, diagnostic labels, and authenticity judgments from 16 radiologists viewing 60 chest radiographs (30 real, 30 diffusion-generated), extended with matched predictions and scores from six state-of-the-art multimodal LLMs to enable direct human-AI comparison on clinical perception and realism detection.

What carries the argument

The GazeVaLM dataset, which supplies paired eye-tracking recordings, clinical labels, and LLM outputs under diagnostic and Visual Turing test conditions for matched analysis.

If this is right

The released gaze and label data enable quantitative benchmarking of radiologist versus LLM performance in diagnostic accuracy and authenticity detection.
Analyses of gaze agreement and inter-observer consistency become possible for both real and synthetic images.
Direct comparison of human and model uncertainty levels is supported through released confidence scores.
The dataset facilitates research on how visual attention patterns differ when experts judge image authenticity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training generative models with loss terms that penalize mismatch to observed radiologist scanpaths could improve perceived clinical realism.
The same protocol could be applied to other imaging modalities to test whether perception differences are modality-specific.

Load-bearing premise

That observations from 30 diffusion-generated images and the two specific tasks are sufficient to reveal general differences in how experts and AI perceive clinical realism in chest X-rays.

What would settle it

A new study that applies the same protocol to images from a different generative model or collects data from substantially more radiologists and finds markedly different gaze agreement or authenticity detection rates would indicate the current benchmark does not generalize.

Figures

Figures reproduced from arXiv: 2604.11653 by Aladine Chetouani, Amir Ali Rahsepar, Amir Borhani, Andrew C. Gordon, Ayis Pyrros, Bin Wang, Cagdas Topel, Camila Lopes Vendrami, David Wong, Elif Keles, Elizabeth Krupinski, Eric Hart, Frank H. Miller, Gokcan Okur, Gorkem Durak, Halil Ertugrul Aktas, Hatice Savas, Laetitia Perronne, Marouane Tliba, Matthew Antalek, Nicolo Gennaro, Onural Ozturk, Tugce Agirlar Trabzonlu, Ulas Bagci, Zeynep Isik.

**Figure 1.** Figure 1: Overview of the proposed dataset pipeline, which introduces has two different assessors undertake two different tasks [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Gaze overlays organized by radiograph diagnosis for Diagnosis Task (Task 1) and Visual Turing Test (Task 2), across [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

We introduce GazeVaLM, a public eye-tracking dataset for studying clinical perception during chest radiograph authenticity assessment. The dataset comprises 960 gaze recordings from 16 expert radiologists interpreting 30 real and 30 synthetic chest X-rays (generated by diffusion based generative AI) under two conditions: diagnostic assessment and real-fake classification (Visual Turing test). For each image-observer pair, we provide raw gaze samples, fixation maps, scanpaths, saliency density maps, structured diagnostic labels, and authenticity judgments. We extend the protocol to 6 state-of-the-art multimodal LLMs, releasing their predicted diagnoses, authenticity labels, and confidence scores under matched conditions - enabling direct human-AI comparison at both decision and uncertainty levels. We further provide analyses of gaze agreement, inter-observer consistency, and benchmarking of radiologists versus LLMs in diagnostic accuracy and authenticity detection. GazeVaLM supports research in gaze modeling, clinical decision-making, human-AI comparison, generative image realism assessment, and uncertainty quantification. By jointly releasing visual attention data, clinical labels, and model predictions, we aim to facilitate reproducible research on how experts and AI systems perceive, interpret, and evaluate medical images. The dataset is available at https://huggingface.co/datasets/davidcwong/GazeVaLM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GazeVaLM releases a solid eye-tracking dataset on real and diffusion-generated chest X-rays plus LLM comparisons, but its benchmark claims rest on a narrow set of 30 synthetic images from one generator.

read the letter

The main point is that this paper puts out a public dataset of gaze recordings from radiologists viewing matched real and synthetic chest X-rays, with added diagnostic and authenticity judgments from both the humans and six multimodal LLMs. The data includes raw gaze samples, fixation maps, scanpaths, and labels under two conditions, all hosted on Hugging Face. That gives researchers a ready resource for work on visual attention in radiology and direct human-AI comparisons at the decision and uncertainty level.

Referee Report

2 major / 1 minor

Summary. The paper introduces GazeVaLM, a public eye-tracking dataset comprising 960 recordings from 16 expert radiologists on 30 real and 30 diffusion-generated synthetic chest X-rays. Data are collected under two conditions (diagnostic assessment and Visual Turing test for authenticity), with raw gaze samples, fixation maps, scanpaths, saliency maps, diagnostic labels, and authenticity judgments provided per image-observer pair. The protocol is extended to six state-of-the-art multimodal LLMs, releasing their diagnoses, authenticity labels, and confidence scores for direct human-AI comparison. Analyses of gaze agreement, inter-observer consistency, diagnostic accuracy, and authenticity detection are included. The dataset is released to support research in gaze modeling, clinical decision-making, human-AI differences, generative image realism, and uncertainty quantification.

Significance. If the synthetic images prove representative and the collection protocol is fully documented, the joint release of gaze data, clinical labels, and matched LLM predictions could enable reproducible studies of expert visual attention and human-AI perceptual differences in medical imaging. This would be a useful resource for the field, particularly for gaze modeling and realism assessment tasks.

major comments (2)

[Abstract] Abstract: The claim that GazeVaLM enables general study of clinical perception, authenticity assessment, and human-AI differences in AI-generated X-rays rests on only 30 diffusion-generated images from a single pipeline. No quantitative evidence (e.g., FID scores, perceptual metrics, or comparisons to other generative backbones) is provided that these images span the artifact distribution of current medical generative models, so gaze patterns, diagnostic gaps, and authenticity judgments remain tied to this narrow sample rather than supporting the stated general utility of the benchmark.
[Abstract] Abstract: The description of dataset size, participant count, and image split is given, but no details appear on image generation parameters, radiologist recruitment criteria, gaze calibration, or statistical analysis methods. These omissions are load-bearing for evaluating whether the 960 recordings constitute a reliable, reproducible benchmark.

minor comments (1)

[Abstract] Abstract: The dataset URL is provided, but the abstract could briefly note the total number of images per condition and the exact LLM models used to improve immediate clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the scope and documentation of GazeVaLM. The comments correctly identify areas where the manuscript can be strengthened for clarity and to avoid overgeneralization. We address each major comment below, indicating the revisions we will incorporate.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that GazeVaLM enables general study of clinical perception, authenticity assessment, and human-AI differences in AI-generated X-rays rests on only 30 diffusion-generated images from a single pipeline. No quantitative evidence (e.g., FID scores, perceptual metrics, or comparisons to other generative backbones) is provided that these images span the artifact distribution of current medical generative models, so gaze patterns, diagnostic gaps, and authenticity judgments remain tied to this narrow sample rather than supporting the stated general utility of the benchmark.

Authors: We agree that the synthetic images originate from a single diffusion pipeline and that the manuscript provides no FID scores or cross-model comparisons to demonstrate coverage of the full range of current generative artifacts. This means the gaze patterns and human-AI differences observed are specific to this generation method rather than broadly representative. In the revision we will (1) add FID and perceptual similarity metrics for the 30 synthetic images, (2) include a dedicated Limitations subsection that explicitly states the benchmark is tied to one generative backbone, and (3) revise the abstract and introduction to frame the contribution as a reproducible resource for studying diffusion-generated chest X-rays rather than claiming general utility across all AI-generated medical images. These changes preserve the value of the released multi-observer gaze data while accurately reflecting its scope. revision: yes
Referee: [Abstract] Abstract: The description of dataset size, participant count, and image split is given, but no details appear on image generation parameters, radiologist recruitment criteria, gaze calibration, or statistical analysis methods. These omissions are load-bearing for evaluating whether the 960 recordings constitute a reliable, reproducible benchmark.

Authors: The full manuscript contains these details in the Methods section (image generation parameters and prompts in Section 3.1, radiologist recruitment and inclusion criteria in 3.2, eye-tracker calibration protocol in 3.3, and statistical analysis procedures in Section 4). However, we acknowledge that the abstract and early sections do not surface them sufficiently for a benchmark paper. We will expand the abstract with concise statements of the key parameters, add a summary table of dataset-construction details, and ensure all methodological parameters are cross-referenced in the abstract. This will make the reproducibility information immediately accessible without altering the existing content. revision: yes

Circularity Check

0 steps flagged

Dataset and benchmark release with no derivations, fitted parameters, or self-referential predictions.

full rationale

The manuscript introduces a new eye-tracking dataset (GazeVaLM) comprising 960 recordings from 16 radiologists on 60 chest X-rays (30 real, 30 diffusion-generated) under two protocols, plus extensions to 6 LLMs. It reports raw data, fixation maps, labels, and basic analyses of gaze agreement and accuracy differences. No equations, parameter fitting, uniqueness theorems, or predictions are defined in terms of the authors' own prior choices or fitted inputs. All claims reduce to direct measurement and release of new observations rather than any closed loop of self-definition or renamed fits. This is a standard honest non-finding for a benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset introduction paper; no mathematical derivations, fitted parameters, background axioms, or new postulated entities are required for the central contribution.

pith-pipeline@v0.9.0 · 5659 in / 1218 out tokens · 50327 ms · 2026-05-10T15:03:12.683222+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 11 canonical work pages · 3 internal anchors

[1]

The llama 4 herd: Architecture, training, evaluation, and deployment notes.arXiv preprint arXiv:2601.11659, 2026

The Llama 4 Herd: Architecture, Training, Evaluation, and Deployment Notes.arXiv preprint arXiv:2601.11659(2026). Anthropic

work page arXiv 2026
[2]

REFLACX, a dataset of reports and eye-tracking data for localization of abnormalities in chest x-rays.Scientific Data9, 1 (2022),

2022
[3]

Langlotz, and Akshay S

doi:10.1038/s41597- 022-01441-z Christian Bluethgen, Pierre Chambon, Jean-Benoit Delbrouck, Rogier van der Sluijs, Małgorzata Połacin, Juan Manuel Zambrano Chaves, Tanishq Mathew Abraham, Shivanshu Purohit, Curtis P. Langlotz, and Akshay S. Chaudhari

work page doi:10.1038/s41597-
[4]

Nature Biomedical Engineering(Aug

A Vision- Language Foundation Model for the Generation of Realistic Chest X-ray Images. Nature Biomedical Engineering(Aug. 2024). doi:10.1038/s41551-024-01246-y Ali Borji

work page doi:10.1038/s41551-024-01246-y 2024
[5]

Richard J Chen, Ming Y Lu, Tiffany Y Chen, Drew FK Williamson, and Faisal Mahmood

Pros and cons of GAN evaluation measures.Computer vision and image understanding179 (2019), 41–65. Richard J Chen, Ming Y Lu, Tiffany Y Chen, Drew FK Williamson, and Faisal Mahmood

2019
[6]

Maria JM Chuquicusma, Sarfaraz Hussein, Jeremy Burt, and Ulas Bagci

Synthetic data in machine learning for medicine and healthcare.Nature Biomedical Engineering5, 6 (2021), 493–497. Maria JM Chuquicusma, Sarfaraz Hussein, Jeremy Burt, and Ulas Bagci. 2018a. How to fool radiologists with generative adversarial networks? A visual turing test for lung cancer diagnosis. In2018 IEEE 15th international symposium on biomedical i...

work page doi:10.1109/isbi.2018.8363564 2021
[7]

Trafton Drew, Melissa L-H Võ, and Jeremy M Wolfe

The Eyelink Toolbox: eye tracking with MATLAB and the Psychophysics Toolbox.Behavior Research Methods, Instruments, & Computers34, 4 (2002), 613–617. Trafton Drew, Melissa L-H Võ, and Jeremy M Wolfe

2002
[8]

Andre Esteva, Katherine Chou, Serena Yeung, Nikhil Naik, Ali Madani, Ali Mottaghi, Yun Liu, Eric Topol, Jeff Dean, and Richard Socher

The invisible gorilla strikes again: sustained inattentional blindness in expert observers.Psychological science 24, 9 (2013), 1848–1853. Andre Esteva, Katherine Chou, Serena Yeung, Nikhil Naik, Ali Madani, Ali Mottaghi, Yun Liu, Eric Topol, Jeff Dean, and Richard Socher

2013
[9]

Deep learning-enabled medical computer vision.NPJ digital medicine4, 1 (2021),

2021
[10]

https://proceedings.neurips.cc/paper_files/paper/2014/file/ 5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf Jonathan Ho, Ajay Jain, and Pieter Abbeel

Cur- ran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2014/file/ 5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf Jonathan Ho, Ajay Jain, and Pieter Abbeel

2014
[11]

InAdvances in Neural Information Processing Systems, H

Denoising Diffusion Probabilis- tic Models. InAdvances in Neural Information Processing Systems, H. Larochelle, ETRA ’26, June 01–04, 2026, Marrakesh, Morocco M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol

2026
[12]

Curran Asso- ciates, Inc., 6840–6851. https://proceedings.neurips.cc/paper_files/paper/2020/ file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf Chihcheng Hsieh, Chun Ouyang, Jacinto C Nascimento, Joao Pereira, Joaquim Jorge, and Catarina Moreira

2020
[13]

doi:10.13026/pc72-as03 Version 1.0.0

MIMIC-Eye: Integrating MIMIC Datasets with RE- FLACX and Eye Gaze for Multimodal Deep Learning Applications.PhysioNet (March 2023). doi:10.13026/pc72-as03 Version 1.0.0. Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng

work page doi:10.13026/pc72-as03 2023
[14]

MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data6, 1 (2019),

2019
[15]

2024), 959–981

Image-Based Gener- ative Artificial Intelligence in Radiology: Comprehensive Updates.Korean J Radiol 25, 11 (Nov. 2024), 959–981. Alexandros Karargyris, Satyananda Kashyap, Ismini Lourentzou, Joy Wu, Matthew Tong, Arjun Sharma, Shafiq Abedin, David Beymer, Vandana Mukherjee, Eliza- beth Krupinski, et al

2024
[16]

org/10.13026/QFDZ-ZR67(2020)

Eye gaze data for chest x-rays.PhysioNet https://doi. org/10.13026/QFDZ-ZR67(2020). Diederik P Kingma and Max Welling

work page doi:10.13026/qfdz-zr67(2020 2020
[17]

Auto-Encoding Variational Bayes

Auto-Encoding Variational Bayes. arXiv:1312.6114 [stat.ML] https://arxiv.org/abs/1312.6114 Elizabeth A Krupinski

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Harold L Kundel and Calvin F Nodine

Current perspectives in medical image perception.At- tention, Perception, & Psychophysics72, 5 (2010), 1205–1217. Harold L Kundel and Calvin F Nodine

2010
[19]

Interpreting chest radiographs without visual search.Radiology116, 3 (1975), 527–532. Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen Awm Van Der Laak, Bram Van Gin- neken, and Clara I Sánchez

1975
[20]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al

A survey on deep learning in medical image analysis.Medical image analysis42 (2017), 60–88. Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al

2017
[21]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024). Chong Ma, Hanqi Jiang, Wenting Chen, Yiwei Li, Zihao Wu, Xiaowei Yu, Zhengliang Liu, Lei Guo, Dajiang Zhu, Tuo Zhang, et al

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

G Pengfei, Y Dong, Z Can, and X Daguang

Eye-gaze guided multi-modal alignment for medical representation learning.Advances in Neural Information Processing Systems37 (2024), 6126–6153. G Pengfei, Y Dong, Z Can, and X Daguang

2024
[23]

Blog(2024)

Addressing medical imaging limita- tions with synthetic data generation.NVidia Tech. Blog(2024). Trong Thang Pham, Akash Awasthi, Saba Khan, Esteban Duran Marti, Tien-Phat Nguyen, Khoa Vo, Minh Tran, Son Nguyen, Cuong Tran, Yuki Ikebe, et al

2024
[24]

Stochas- tic Backpropagation and Approximate Inference in Deep Generative Models. arXiv:1401.4082 [stat.ML] https://arxiv.org/abs/1401.4082 Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al

work page arXiv
[25]

OpenAI GPT-5 System Card

Openai gpt-5 system card.arXiv preprint arXiv:2601.03267(2025). Lucas Theis, Aäron van den Oord, and Matthias Bethge

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

A note on the evaluation of generative models

A note on the evaluation of generative models.arXiv preprint arXiv:1511.01844(2015). Bin Wang, Armstrong Aboah, Zheyuan Zhang, Hongyi Pan, and Ulas Bagci. 2024a. Gazesam: Interactive image segmentation with eye gaze and segment anything model. InGaze Meets Machine Learning Workshop. PMLR, 254–265. Bin Wang, Hongyi Pan, Armstrong Aboah, Zheyuan Zhang, Elif...

work page Pith review arXiv 2015
[27]

https://data.x.ai/2025-11-17-grok-4-1-model-card.pdf

Grok 4.1 Model Card. https://data.x.ai/2025-11-17-grok-4-1-model-card.pdf

2025