GLEAM: A Multimodal Imaging Dataset and HAMM for Glaucoma Classification
Pith reviewed 2026-05-15 12:00 UTC · model grok-4.3
The pith
A new public tri-modal dataset and hierarchical attentive masked modeling framework integrate fundus, OCT, and visual-field data to classify glaucoma across four disease stages.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose glaucoma lesion evaluation and analysis with multimodal imaging (GLEAM), the first publicly available tri-modal glaucoma dataset comprising scanning laser ophthalmoscopy fundus images, circumpapillary OCT images, and visual field pattern deviation maps, annotated with four disease stages, enabling effective exploitation of multimodal complementary information and facilitating accurate diagnosis and treatment across disease stages. To effectively integrate cross-modal information, we propose hierarchical attentive masked modeling (HAMM) for multimodal glaucoma classification. Our framework employs hierarchical attentive encoders and light decoders to focus cross-modalrepresentation
What carries the argument
Hierarchical attentive masked modeling (HAMM), which applies hierarchical attentive encoders to cross-modal representation learning while restricting decoders to lightweight reconstruction tasks.
If this is right
- The dataset supplies aligned examples across structural and functional modalities that can be used to train or benchmark any multimodal glaucoma classifier.
- HAMM's encoder-focused design reduces decoder complexity while preserving cross-modal attention, lowering compute cost for clinical deployment.
- Four-stage labeling supports both binary detection and finer progression monitoring in the same framework.
- Public release removes the data-access barrier that has limited prior multimodal glaucoma studies.
Where Pith is reading between the lines
- If the dataset and method prove robust, similar tri-modal resources could be assembled for other retinal diseases where structural and functional tests are already collected separately.
- The encoder-centric masked modeling pattern may transfer to other medical imaging domains that combine scans from different physical principles.
- Routine clinical workflows could eventually feed the three modalities into one model at acquisition time, shortening diagnostic turnaround.
Load-bearing premise
The three imaging modalities supply complementary signals that the HAMM architecture can fuse more effectively than single-modality or simpler fusion baselines, and that the four-stage annotations are accurate enough to train reliable classifiers.
What would settle it
A head-to-head test on the released GLEAM dataset in which HAMM shows no statistically significant accuracy gain over either single-modality models or a baseline that simply concatenates features from the three modalities.
read the original abstract
We propose glaucoma lesion evaluation and analysis with multimodal imaging (GLEAM), the first publicly available tri-modal glaucoma dataset comprising scanning laser ophthalmoscopy fundus images, circumpapillary OCT images, and visual field pattern deviation maps, annotated with four disease stages, enabling effective exploitation of multimodal complementary information and facilitating accurate diagnosis and treatment across disease stages. To effectively integrate cross-modal information, we propose hierarchical attentive masked modeling (HAMM) for multimodal glaucoma classification. Our framework employs hierarchical attentive encoders and light decoders to focus cross-modal representation learning on the encoder.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces GLEAM, the first publicly available tri-modal glaucoma dataset comprising scanning laser ophthalmoscopy (SLO) fundus images, circumpapillary OCT images, and visual field pattern deviation maps, annotated across four disease stages. It further proposes the Hierarchical Attentive Masked Modeling (HAMM) framework, which employs hierarchical attentive encoders paired with light decoders to perform cross-modal representation learning focused on the encoder for glaucoma classification.
Significance. If the dataset release and HAMM framework are validated with quantitative results, this work would provide a valuable public resource for multimodal glaucoma research by exploiting complementary information across imaging modalities. The emphasis on lightweight decoders and encoder-focused learning offers a potentially efficient alternative to standard multimodal fusion approaches, which could facilitate broader adoption in clinical diagnostic pipelines.
major comments (2)
- [§4] §4 (Experimental Setup): No ablation studies are presented that isolate the contribution of each modality (SLO, OCT, VF) or compare HAMM against standard multimodal baselines such as early/late fusion or cross-attention transformers. Without these, the central claim that the tri-modal dataset enables effective complementary information exploitation remains unverified and load-bearing for the paper's contribution.
- [§3.1] §3.1 (Dataset Annotation): The description of the four-stage disease annotation process lacks details on annotation protocol, number of experts, or inter-rater reliability metrics. This directly affects the trustworthiness of the labels used to train and evaluate the HAMM classifier.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from explicit quantitative performance metrics (e.g., accuracy, AUC) even in summary form to allow readers to gauge the framework's effectiveness without reading the full experiments section.
- [§3.2] Notation for the hierarchical attentive encoders (e.g., definitions of attention heads per modality) is introduced without a clear equation or diagram reference, making the architecture description harder to follow.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which have helped us improve the clarity and rigor of the manuscript. We address each major comment below and have incorporated revisions to strengthen the validation of both the GLEAM dataset and the HAMM framework.
read point-by-point responses
-
Referee: [§4] §4 (Experimental Setup): No ablation studies are presented that isolate the contribution of each modality (SLO, OCT, VF) or compare HAMM against standard multimodal baselines such as early/late fusion or cross-attention transformers. Without these, the central claim that the tri-modal dataset enables effective complementary information exploitation remains unverified and load-bearing for the paper's contribution.
Authors: We agree that explicit ablations are necessary to substantiate the claim of complementary information exploitation. In the revised manuscript, we have added a dedicated ablation subsection in §4 that reports performance for each single modality (SLO, OCT, VF), all pairwise combinations, and the full tri-modal setting. We further benchmark HAMM against early fusion, late fusion, and a cross-attention transformer baseline using identical encoder backbones and training protocols. The new results confirm that tri-modal HAMM outperforms both unimodal and standard fusion approaches, directly verifying the value of the GLEAM dataset. revision: yes
-
Referee: [§3.1] §3.1 (Dataset Annotation): The description of the four-stage disease annotation process lacks details on annotation protocol, number of experts, or inter-rater reliability metrics. This directly affects the trustworthiness of the labels used to train and evaluate the HAMM classifier.
Authors: We thank the referee for highlighting this omission. Section 3.1 has been expanded to describe the annotation protocol in detail: three board-certified glaucoma specialists independently labeled each case according to a standardized four-stage rubric derived from clinical guidelines. We now report the number of experts, the adjudication process for disagreements, and inter-rater reliability metrics (Fleiss’ kappa = 0.78, indicating substantial agreement). These additions establish the reliability of the GLEAM labels. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces a tri-modal glaucoma dataset (GLEAM) and proposes the HAMM framework using hierarchical attentive encoders and light decoders. No equations, derivations, fitted parameters, or predictions appear in the abstract or described full text. The claims rest on dataset release and standard multimodal masked modeling logic without self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations. The derivation chain is self-contained and does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our framework employs hierarchical attentive encoders and light decoders to focus cross-modal representation learning on the encoder. The attention module, named multimodal-channel graph attention (MCGA)...
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
HAMM is built entirely on convolutional neural networks (CNNs)...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
T. V oset al., “Global, regional, and national incidence, prevalence, and years lived with disability for 310 diseases and injuries, 1990–2015: a systematic analysis for the global burden of disease study 2015,”The Lancet, vol. 388, no. 10053, p. 1545 – 1602, 2016
work page 1990
-
[2]
Y .-C. Thamet al., “Global prevalence of glaucoma and projections of glaucoma burden through 2040: A systematic review and meta-analysis,” Ophthalmology, vol. 121, no. 11, pp. 2081–2090, 2014
work page 2040
-
[3]
S. Choeet al., “Natural history of optic disc with physiologic large cup: Incidence, predictors of glaucoma conversion after minimum 10-year follow-up,”American Journal of Ophthalmology, vol. 254, pp. 150–160, 2023
work page 2023
-
[4]
M. Yuet al., “Risk of visual field progression in glaucoma patients with progressive retinal nerve fiber layer thinning: A 5-year prospective study,” Ophthalmology, vol. 123, no. 6, pp. 1201–1210, 2016
work page 2016
-
[5]
S. Asanoet al., “Early detection of glaucomatous visual field progression using pointwise linear regression with binomial test in the central 10 degrees,”American Journal of Ophthalmology, vol. 199, pp. 140–149, 2019
work page 2019
-
[6]
Staging functional damage in glaucoma: Review of different classification methods,
P. Brusini and C. A. Johnson, “Staging functional damage in glaucoma: Review of different classification methods,”Survey of Ophthalmology, vol. 52, no. 2, pp. 156–179, 2007
work page 2007
-
[7]
Origa-light: An online retinal fundus image database for glaucoma analysis and research,
Z. Zhanget al., “Origa-light: An online retinal fundus image database for glaucoma analysis and research,” in2010 Annual International Conference of the IEEE Engineering in Medicine and Biology, 2010, pp. 3065–3068
work page 2010
-
[8]
Drishti-gs: Retinal image dataset for optic nerve head(onh) segmentation,
J. Sivaswamyet al., “Drishti-gs: Retinal image dataset for optic nerve head(onh) segmentation,” in2014 IEEE 11th International Symposium on Biomedical Imaging (ISBI), 2014, pp. 53–56
work page 2014
-
[9]
J. I. Orlandoet al., “Refuge challenge: A unified framework for evaluating automated methods for glaucoma assessment from fundus photographs,” Medical Image Analysis, vol. 59, p. 101570, 2020
work page 2020
-
[10]
J. M. Ahnet al., “A deep learning model for the detection of both advanced and early glaucoma using fundus photography,”PLoS ONE, vol. 13, 2018
work page 2018
-
[11]
Cnns for automatic glaucoma assessment using fundus images: an extensive validation,
A. Diaz-Pintoet al., “Cnns for automatic glaucoma assessment using fundus images: an extensive validation,”BioMedical Engineering OnLine, vol. 18, 2019
work page 2019
-
[12]
Y . Luoet al., “Harvard glaucoma detection and progression: A multimodal multitask dataset and generalization-reinforced semi-supervised learning,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 20 414–20 425
work page 2023
-
[13]
Disc-aware ensemble network for glaucoma screening from fundus image,
H. Fuet al., “Disc-aware ensemble network for glaucoma screening from fundus image,”IEEE Transactions on Medical Imaging, vol. 37, no. 11, pp. 2493–2501, 2018
work page 2018
-
[14]
Glim-net: Chronic glaucoma forecast transformer for irregularly sampled sequential fundus images,
X. Huet al., “Glim-net: Chronic glaucoma forecast transformer for irregularly sampled sequential fundus images,”IEEE Transactions on Medical Imaging, vol. 42, no. 6, pp. 1875–1884, 2023
work page 2023
-
[15]
Cct-net: Category-invariant cross-domain transfer for medical single-to-multiple disease diagnosis,
Y . Zhou, L. Huang, T. Zhou, and L. Shao, “Cct-net: Category-invariant cross-domain transfer for medical single-to-multiple disease diagnosis,” in2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 8240–8250
work page 2021
-
[16]
A workflow for computer-aided diagnosis of glau- coma,
H. Wanget al., “A workflow for computer-aided diagnosis of glau- coma,” in2022 IEEE International Symposium on Biomedical Imaging Challenges (ISBIC), 2022, pp. 1–4
work page 2022
-
[17]
Glaucoformer: Dual-domain global transformer network for generalized glaucoma stage classification,
D. Das, D. R. Nayak, and R. B. Pachori, “Glaucoformer: Dual-domain global transformer network for generalized glaucoma stage classification,” IEEE Journal of Biomedical and Health Informatics, vol. 29, no. 11, pp. 8450–8459, 2025
work page 2025
-
[18]
Fja-net: A fuzzy joint attention guided network for classification of glaucoma stages,
D. Das and D. R. Nayak, “Fja-net: A fuzzy joint attention guided network for classification of glaucoma stages,”IEEE Transactions on Fuzzy Systems, vol. 32, no. 10, pp. 5438–5448, 2024
work page 2024
-
[19]
Artifacts in spectral-domain optical coherence tomography measurements in glaucoma,
S. Asrani, L. Essaid, B. D. Alder, and C. Santiago-Turla, “Artifacts in spectral-domain optical coherence tomography measurements in glaucoma,”JAMA Ophthalmology, vol. 132, no. 4, pp. 396–402, 04 2014
work page 2014
-
[20]
T. Heikka and N. M. Jansonius, “Influence of signal-to-noise ratio, glau- coma stage and segmentation algorithm on oct usability for quantifying layer thicknesses in the peripapillary retina,”Acta Ophthalmologica, vol. 101, no. 3, pp. 251–260, 2023
work page 2023
-
[21]
‘structure–function relationship’ in glaucoma: past thinking and current concepts,
R. Malik, W. H. Swanson, and D. F. Garway-Heath, “‘structure–function relationship’ in glaucoma: past thinking and current concepts,”Clinical & Experimental Ophthalmology, vol. 40, no. 4, pp. 369–380, 2012
work page 2012
-
[22]
C. Bowdet al., “Bayesian machine learning classifiers for combining structural and functional measurements to classify healthy and glauco- matous eyes.”Investigative ophthalmology & visual science, vol. 49 3, pp. 945–53, 2008
work page 2008
-
[23]
A. Antonet al., “Diagnostic accuracy and detection rate of glaucoma screening with optic disk photos, optical coherence tomography images, and telemedicine,”Journal of Clinical Medicine, vol. 11, no. 1, 2022
work page 2022
-
[24]
Gamma challenge: Glaucoma grading from multi-modality images,
J. Wu, H. Fang, F. Liet al., “Gamma challenge: Glaucoma grading from multi-modality images,”Medical Image Analysis, vol. 90, p. 102938, 2023
work page 2023
-
[25]
Elf: An end-to-end local and global multimodal fusion framework for glaucoma grading,
W. Li and C.-M. Pun, “Elf: An end-to-end local and global multimodal fusion framework for glaucoma grading,” in2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2023, pp. 4081– 4085
work page 2023
-
[26]
Mstnet: method for glaucoma grading based on multimodal feature fusion of spatial relations,
Z. Wanget al., “Mstnet: method for glaucoma grading based on multimodal feature fusion of spatial relations,”Physics in Medicine & Biology, vol. 68, no. 24, p. 245002, dec 2023
work page 2023
-
[27]
Z. Cai, L. Lin, H. He, and X. Tang, “Corolla: An efficient multi-modality fusion framework with supervised contrastive learning for glaucoma grading,” in2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI), 2022, pp. 1–4
work page 2022
-
[28]
Geometric correspondence-based multimodal learning for ophthalmic image analysis,
Y . Wanget al., “Geometric correspondence-based multimodal learning for ophthalmic image analysis,”IEEE Transactions on Medical Imaging, vol. 43, no. 5, pp. 1945–1957, 2024
work page 1945
-
[29]
Z. Yanget al., “Etscl: An evidence theory-based supervised contrastive learning framework for multi-modal glaucoma grading,” inOphthalmic Medical Image Analysis (OMIA) at MICCAI, 2024, pp. 11–21
work page 2024
-
[30]
Data on oct and fundus images for the detection of glaucoma,
H. Rajaet al., “Data on oct and fundus images for the detection of glaucoma,”Data in Brief, vol. 29, p. 105342, 2020
work page 2020
-
[31]
X. Huanget al., “Grape: A multi-modal dataset of longitudinal follow-up visual field and fundus images for glaucoma management,”Scientific Data, vol. 10, no. 1, p. 520, 2023
work page 2023
-
[32]
Y . Luoet al., “Harvard glaucoma fairness: A retinal nerve disease dataset for fairness learning and fair identity normalization,”IEEE Transactions on Medical Imaging, vol. 43, no. 7, pp. 2623–2633, 2024
work page 2024
-
[33]
Estimating the rate of retinal ganglion cell loss in glaucoma,
F. A. Medeiroset al., “Estimating the rate of retinal ganglion cell loss in glaucoma,”American Journal of Ophthalmology, vol. 154, no. 5, pp. 814–824.e1, 2012
work page 2012
-
[34]
Bayer,Combining Structure and Function in Glaucoma
A. Bayer,Combining Structure and Function in Glaucoma. Cham: Springer International Publishing, 2018, pp. 329–343
work page 2018
-
[35]
Z. Chenet al., “Combination of enhanced depth imaging optical coherence tomography and fundus images for glaucoma screening,” Journal of medical systems, vol. 43, no. 6, p. 163, 2019
work page 2019
-
[36]
Combining optical coherence tomography and fundus photography to improve glaucoma screening,
T. Watanabeet al., “Combining optical coherence tomography and fundus photography to improve glaucoma screening,”Diagnostics, vol. 12, no. 5, 2022
work page 2022
-
[37]
A. Kamalipouret al., “Combining optical coherence tomography and optical coherence tomography angiography longitudinal data for the detection of visual field progression in glaucoma,”American Journal of Ophthalmology, vol. 246, pp. 141–154, 2023
work page 2023
-
[38]
E. E. Hwanget al., “Utilization of image-based deep learning in multimodal glaucoma detection neural network from a primary patient cohort,”Ophthalmology Science, vol. 5, no. 3, p. 100703, 2025
work page 2025
-
[39]
Y . Li, Y . Han, Z. Li, Y . Zhong, and Z. Guo, “A transfer learning-based multimodal neural network combining metadata and multiple medical images for glaucoma type diagnosis,”Scientific Reports, vol. 13, 2023
work page 2023
-
[40]
M.-I. Georgescuet al., “Multimodal multi-head convolutional attention with various kernel sizes for medical image super-resolution,” in2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 2194–2204
work page 2023
-
[41]
Multimodal fusion learning with dual attention for medical imaging,
J. Dharet al., “Multimodal fusion learning with dual attention for medical imaging,” in2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 4362–4371. 14 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2020
work page 2025
-
[42]
Supervised contrastive learning,
P. Khoslaet al., “Supervised contrastive learning,” inAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 18 661–18 673
work page 2020
-
[43]
Masked autoencoders are scalable vision learners,
K. Heet al., “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 16 000–16 009
work page 2022
-
[44]
A multimodal visual–language foundation model for computational ophthalmology,
D. Shiet al., “A multimodal visual–language foundation model for computational ophthalmology,”npj Digital Medicine, 2025
work page 2025
-
[45]
Multimae: Multi- modal multi-task masked autoencoders,
R. Bachmann, D. Mizrahi, A. Atanov, and A. Zamir, “Multimae: Multi- modal multi-task masked autoencoders,” inEuropean Conference on Computer Vision (ECCV), 2022, pp. 348–367
work page 2022
-
[46]
Urfound: Towards universal retinal foundation models via knowledge-guided masked modeling,
K. Yuet al., “Urfound: Towards universal retinal foundation models via knowledge-guided masked modeling,” inMedical Image Computing and Computer Assisted Intervention – MICCAI 2024, 2024, pp. 753–762
work page 2024
-
[47]
Designing bert for convolutional networks: Sparse and hierarchical masked modeling,
K. Tianet al., “Designing bert for convolutional networks: Sparse and hierarchical masked modeling,” inIn Proceedings of the International Conference on Learning Representations (ICLR), 2023
work page 2023
-
[48]
Association between combined structure function index and glaucoma severity,
S. Ogawaet al., “Association between combined structure function index and glaucoma severity,”Journal of Ophthalmology, vol. 2019, no. 1, p. 9414675, 2019
work page 2019
-
[49]
A large-scale database and a cnn model for attention-based glaucoma detection,
L. Liet al., “A large-scale database and a cnn model for attention-based glaucoma detection,”IEEE Transactions on Medical Imaging, vol. 39, no. 2, pp. 413–424, 2020
work page 2020
-
[50]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778
work page 2016
-
[51]
Xception: Deep learning with depthwise separable convo- lutions,
F. Chollet, “Xception: Deep learning with depthwise separable convo- lutions,” in2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1800–1807
work page 2017
-
[52]
Multimodal intelligence: Representation learning, information fusion, and applications,
C. Zhang, Z. Yang, X. He, and L. Deng, “Multimodal intelligence: Representation learning, information fusion, and applications,”IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 3, pp. 478–493, 2020
work page 2020
-
[53]
On calibration of modern neural networks,
C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” inProceedings of the 34th International Conference on Machine Learning - Volume 70, ser. ICML’17. JMLR.org, 2017, p. 1321–1330
work page 2017
-
[54]
An image is worth 16x16 words: Transformers for image recognition at scale,
A. Dosovitskiyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,” inIn Proceedings of the International Conference on Learning Representations (ICLR), 2021
work page 2021
-
[55]
Z. Liuet al., “A convnet for the 2020s,”Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
work page 2022
-
[56]
Grad-cam: Visual explanations from deep networks via gradient-based localization,
R. R. Selvarajuet al., “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 618–626
work page 2017
-
[57]
Cbam: Convolutional block attention module,
S. Woo, J. Park, J.-Y . Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” inProceedings of the European Conference on Computer Vision (ECCV), September 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.