Recognition: unknown
GazeVaLM: A Multi-Observer Eye-Tracking Benchmark for Evaluating Clinical Realism in AI-Generated X-Rays
Pith reviewed 2026-05-10 15:03 UTC · model grok-4.3
The pith
GazeVaLM releases eye-tracking data from 16 radiologists and outputs from 6 LLMs to compare perception of real versus diffusion-generated chest X-rays.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GazeVaLM is a dataset of gaze recordings, fixation maps, scanpaths, saliency density maps, diagnostic labels, and authenticity judgments from 16 radiologists viewing 60 chest radiographs (30 real, 30 diffusion-generated), extended with matched predictions and scores from six state-of-the-art multimodal LLMs to enable direct human-AI comparison on clinical perception and realism detection.
What carries the argument
The GazeVaLM dataset, which supplies paired eye-tracking recordings, clinical labels, and LLM outputs under diagnostic and Visual Turing test conditions for matched analysis.
If this is right
- The released gaze and label data enable quantitative benchmarking of radiologist versus LLM performance in diagnostic accuracy and authenticity detection.
- Analyses of gaze agreement and inter-observer consistency become possible for both real and synthetic images.
- Direct comparison of human and model uncertainty levels is supported through released confidence scores.
- The dataset facilitates research on how visual attention patterns differ when experts judge image authenticity.
Where Pith is reading between the lines
- Training generative models with loss terms that penalize mismatch to observed radiologist scanpaths could improve perceived clinical realism.
- The same protocol could be applied to other imaging modalities to test whether perception differences are modality-specific.
Load-bearing premise
That observations from 30 diffusion-generated images and the two specific tasks are sufficient to reveal general differences in how experts and AI perceive clinical realism in chest X-rays.
What would settle it
A new study that applies the same protocol to images from a different generative model or collects data from substantially more radiologists and finds markedly different gaze agreement or authenticity detection rates would indicate the current benchmark does not generalize.
Figures
read the original abstract
We introduce GazeVaLM, a public eye-tracking dataset for studying clinical perception during chest radiograph authenticity assessment. The dataset comprises 960 gaze recordings from 16 expert radiologists interpreting 30 real and 30 synthetic chest X-rays (generated by diffusion based generative AI) under two conditions: diagnostic assessment and real-fake classification (Visual Turing test). For each image-observer pair, we provide raw gaze samples, fixation maps, scanpaths, saliency density maps, structured diagnostic labels, and authenticity judgments. We extend the protocol to 6 state-of-the-art multimodal LLMs, releasing their predicted diagnoses, authenticity labels, and confidence scores under matched conditions - enabling direct human-AI comparison at both decision and uncertainty levels. We further provide analyses of gaze agreement, inter-observer consistency, and benchmarking of radiologists versus LLMs in diagnostic accuracy and authenticity detection. GazeVaLM supports research in gaze modeling, clinical decision-making, human-AI comparison, generative image realism assessment, and uncertainty quantification. By jointly releasing visual attention data, clinical labels, and model predictions, we aim to facilitate reproducible research on how experts and AI systems perceive, interpret, and evaluate medical images. The dataset is available at https://huggingface.co/datasets/davidcwong/GazeVaLM.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GazeVaLM, a public eye-tracking dataset comprising 960 recordings from 16 expert radiologists on 30 real and 30 diffusion-generated synthetic chest X-rays. Data are collected under two conditions (diagnostic assessment and Visual Turing test for authenticity), with raw gaze samples, fixation maps, scanpaths, saliency maps, diagnostic labels, and authenticity judgments provided per image-observer pair. The protocol is extended to six state-of-the-art multimodal LLMs, releasing their diagnoses, authenticity labels, and confidence scores for direct human-AI comparison. Analyses of gaze agreement, inter-observer consistency, diagnostic accuracy, and authenticity detection are included. The dataset is released to support research in gaze modeling, clinical decision-making, human-AI differences, generative image realism, and uncertainty quantification.
Significance. If the synthetic images prove representative and the collection protocol is fully documented, the joint release of gaze data, clinical labels, and matched LLM predictions could enable reproducible studies of expert visual attention and human-AI perceptual differences in medical imaging. This would be a useful resource for the field, particularly for gaze modeling and realism assessment tasks.
major comments (2)
- [Abstract] Abstract: The claim that GazeVaLM enables general study of clinical perception, authenticity assessment, and human-AI differences in AI-generated X-rays rests on only 30 diffusion-generated images from a single pipeline. No quantitative evidence (e.g., FID scores, perceptual metrics, or comparisons to other generative backbones) is provided that these images span the artifact distribution of current medical generative models, so gaze patterns, diagnostic gaps, and authenticity judgments remain tied to this narrow sample rather than supporting the stated general utility of the benchmark.
- [Abstract] Abstract: The description of dataset size, participant count, and image split is given, but no details appear on image generation parameters, radiologist recruitment criteria, gaze calibration, or statistical analysis methods. These omissions are load-bearing for evaluating whether the 960 recordings constitute a reliable, reproducible benchmark.
minor comments (1)
- [Abstract] Abstract: The dataset URL is provided, but the abstract could briefly note the total number of images per condition and the exact LLM models used to improve immediate clarity for readers.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the scope and documentation of GazeVaLM. The comments correctly identify areas where the manuscript can be strengthened for clarity and to avoid overgeneralization. We address each major comment below, indicating the revisions we will incorporate.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that GazeVaLM enables general study of clinical perception, authenticity assessment, and human-AI differences in AI-generated X-rays rests on only 30 diffusion-generated images from a single pipeline. No quantitative evidence (e.g., FID scores, perceptual metrics, or comparisons to other generative backbones) is provided that these images span the artifact distribution of current medical generative models, so gaze patterns, diagnostic gaps, and authenticity judgments remain tied to this narrow sample rather than supporting the stated general utility of the benchmark.
Authors: We agree that the synthetic images originate from a single diffusion pipeline and that the manuscript provides no FID scores or cross-model comparisons to demonstrate coverage of the full range of current generative artifacts. This means the gaze patterns and human-AI differences observed are specific to this generation method rather than broadly representative. In the revision we will (1) add FID and perceptual similarity metrics for the 30 synthetic images, (2) include a dedicated Limitations subsection that explicitly states the benchmark is tied to one generative backbone, and (3) revise the abstract and introduction to frame the contribution as a reproducible resource for studying diffusion-generated chest X-rays rather than claiming general utility across all AI-generated medical images. These changes preserve the value of the released multi-observer gaze data while accurately reflecting its scope. revision: yes
-
Referee: [Abstract] Abstract: The description of dataset size, participant count, and image split is given, but no details appear on image generation parameters, radiologist recruitment criteria, gaze calibration, or statistical analysis methods. These omissions are load-bearing for evaluating whether the 960 recordings constitute a reliable, reproducible benchmark.
Authors: The full manuscript contains these details in the Methods section (image generation parameters and prompts in Section 3.1, radiologist recruitment and inclusion criteria in 3.2, eye-tracker calibration protocol in 3.3, and statistical analysis procedures in Section 4). However, we acknowledge that the abstract and early sections do not surface them sufficiently for a benchmark paper. We will expand the abstract with concise statements of the key parameters, add a summary table of dataset-construction details, and ensure all methodological parameters are cross-referenced in the abstract. This will make the reproducibility information immediately accessible without altering the existing content. revision: yes
Circularity Check
Dataset and benchmark release with no derivations, fitted parameters, or self-referential predictions.
full rationale
The manuscript introduces a new eye-tracking dataset (GazeVaLM) comprising 960 recordings from 16 radiologists on 60 chest X-rays (30 real, 30 diffusion-generated) under two protocols, plus extensions to 6 LLMs. It reports raw data, fixation maps, labels, and basic analyses of gaze agreement and accuracy differences. No equations, parameter fitting, uniqueness theorems, or predictions are defined in terms of the authors' own prior choices or fitted inputs. All claims reduce to direct measurement and release of new observations rather than any closed loop of self-definition or renamed fits. This is a standard honest non-finding for a benchmark paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
The Llama 4 Herd: Architecture, Training, Evaluation, and Deployment Notes.arXiv preprint arXiv:2601.11659(2026). Anthropic
-
[2]
REFLACX, a dataset of reports and eye-tracking data for localization of abnormalities in chest x-rays.Scientific Data9, 1 (2022),
2022
-
[3]
doi:10.1038/s41597- 022-01441-z Christian Bluethgen, Pierre Chambon, Jean-Benoit Delbrouck, Rogier van der Sluijs, Małgorzata Połacin, Juan Manuel Zambrano Chaves, Tanishq Mathew Abraham, Shivanshu Purohit, Curtis P. Langlotz, and Akshay S. Chaudhari
-
[4]
Nature Biomedical Engineering(Aug
A Vision- Language Foundation Model for the Generation of Realistic Chest X-ray Images. Nature Biomedical Engineering(Aug. 2024). doi:10.1038/s41551-024-01246-y Ali Borji
-
[5]
Richard J Chen, Ming Y Lu, Tiffany Y Chen, Drew FK Williamson, and Faisal Mahmood
Pros and cons of GAN evaluation measures.Computer vision and image understanding179 (2019), 41–65. Richard J Chen, Ming Y Lu, Tiffany Y Chen, Drew FK Williamson, and Faisal Mahmood
2019
-
[6]
Maria JM Chuquicusma, Sarfaraz Hussein, Jeremy Burt, and Ulas Bagci
Synthetic data in machine learning for medicine and healthcare.Nature Biomedical Engineering5, 6 (2021), 493–497. Maria JM Chuquicusma, Sarfaraz Hussein, Jeremy Burt, and Ulas Bagci. 2018a. How to fool radiologists with generative adversarial networks? A visual turing test for lung cancer diagnosis. In2018 IEEE 15th international symposium on biomedical i...
-
[7]
Trafton Drew, Melissa L-H Võ, and Jeremy M Wolfe
The Eyelink Toolbox: eye tracking with MATLAB and the Psychophysics Toolbox.Behavior Research Methods, Instruments, & Computers34, 4 (2002), 613–617. Trafton Drew, Melissa L-H Võ, and Jeremy M Wolfe
2002
-
[8]
Andre Esteva, Katherine Chou, Serena Yeung, Nikhil Naik, Ali Madani, Ali Mottaghi, Yun Liu, Eric Topol, Jeff Dean, and Richard Socher
The invisible gorilla strikes again: sustained inattentional blindness in expert observers.Psychological science 24, 9 (2013), 1848–1853. Andre Esteva, Katherine Chou, Serena Yeung, Nikhil Naik, Ali Madani, Ali Mottaghi, Yun Liu, Eric Topol, Jeff Dean, and Richard Socher
2013
-
[9]
Deep learning-enabled medical computer vision.NPJ digital medicine4, 1 (2021),
2021
-
[10]
https://proceedings.neurips.cc/paper_files/paper/2014/file/ 5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf Jonathan Ho, Ajay Jain, and Pieter Abbeel
Cur- ran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2014/file/ 5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf Jonathan Ho, Ajay Jain, and Pieter Abbeel
2014
-
[11]
InAdvances in Neural Information Processing Systems, H
Denoising Diffusion Probabilis- tic Models. InAdvances in Neural Information Processing Systems, H. Larochelle, ETRA ’26, June 01–04, 2026, Marrakesh, Morocco M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol
2026
-
[12]
Curran Asso- ciates, Inc., 6840–6851. https://proceedings.neurips.cc/paper_files/paper/2020/ file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf Chihcheng Hsieh, Chun Ouyang, Jacinto C Nascimento, Joao Pereira, Joaquim Jorge, and Catarina Moreira
2020
-
[13]
doi:10.13026/pc72-as03 Version 1.0.0
MIMIC-Eye: Integrating MIMIC Datasets with RE- FLACX and Eye Gaze for Multimodal Deep Learning Applications.PhysioNet (March 2023). doi:10.13026/pc72-as03 Version 1.0.0. Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng
-
[14]
MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data6, 1 (2019),
2019
-
[15]
2024), 959–981
Image-Based Gener- ative Artificial Intelligence in Radiology: Comprehensive Updates.Korean J Radiol 25, 11 (Nov. 2024), 959–981. Alexandros Karargyris, Satyananda Kashyap, Ismini Lourentzou, Joy Wu, Matthew Tong, Arjun Sharma, Shafiq Abedin, David Beymer, Vandana Mukherjee, Eliza- beth Krupinski, et al
2024
-
[16]
Eye gaze data for chest x-rays.PhysioNet https://doi. org/10.13026/QFDZ-ZR67(2020). Diederik P Kingma and Max Welling
-
[17]
Auto-Encoding Variational Bayes
Auto-Encoding Variational Bayes. arXiv:1312.6114 [stat.ML] https://arxiv.org/abs/1312.6114 Elizabeth A Krupinski
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Harold L Kundel and Calvin F Nodine
Current perspectives in medical image perception.At- tention, Perception, & Psychophysics72, 5 (2010), 1205–1217. Harold L Kundel and Calvin F Nodine
2010
-
[19]
Interpreting chest radiographs without visual search.Radiology116, 3 (1975), 527–532. Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen Awm Van Der Laak, Bram Van Gin- neken, and Clara I Sánchez
1975
-
[20]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al
A survey on deep learning in medical image analysis.Medical image analysis42 (2017), 60–88. Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al
2017
-
[21]
Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024). Chong Ma, Hanqi Jiang, Wenting Chen, Yiwei Li, Zihao Wu, Xiaowei Yu, Zhengliang Liu, Lei Guo, Dajiang Zhu, Tuo Zhang, et al
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
G Pengfei, Y Dong, Z Can, and X Daguang
Eye-gaze guided multi-modal alignment for medical representation learning.Advances in Neural Information Processing Systems37 (2024), 6126–6153. G Pengfei, Y Dong, Z Can, and X Daguang
2024
-
[23]
Blog(2024)
Addressing medical imaging limita- tions with synthetic data generation.NVidia Tech. Blog(2024). Trong Thang Pham, Akash Awasthi, Saba Khan, Esteban Duran Marti, Tien-Phat Nguyen, Khoa Vo, Minh Tran, Son Nguyen, Cuong Tran, Yuki Ikebe, et al
2024
- [24]
-
[25]
Openai gpt-5 system card.arXiv preprint arXiv:2601.03267(2025). Lucas Theis, Aäron van den Oord, and Matthias Bethge
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
A note on the evaluation of generative models
A note on the evaluation of generative models.arXiv preprint arXiv:1511.01844(2015). Bin Wang, Armstrong Aboah, Zheyuan Zhang, Hongyi Pan, and Ulas Bagci. 2024a. Gazesam: Interactive image segmentation with eye gaze and segment anything model. InGaze Meets Machine Learning Workshop. PMLR, 254–265. Bin Wang, Hongyi Pan, Armstrong Aboah, Zheyuan Zhang, Elif...
work page Pith review arXiv 2015
-
[27]
https://data.x.ai/2025-11-17-grok-4-1-model-card.pdf
Grok 4.1 Model Card. https://data.x.ai/2025-11-17-grok-4-1-model-card.pdf
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.