Subject-Aware Multi-Granularity Alignment for Zero-Shot EEG-to-Image Retrieval
Pith reviewed 2026-05-10 05:12 UTC · model grok-4.3
The pith
SAMGA constructs subject-aware visual targets from multi-granularity features to align EEG signals for zero-shot image retrieval.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SAMGA first constructs a subject-aware visual supervision target by adaptively aggregating multiple intermediate representations from a pretrained vision encoder, allowing the model to absorb subject-dependent granularity deviations during training while preserving subject-agnostic inference. Building on this adaptive target construction, a coarse-to-fine cross-modal alignment strategy is further designed with a shared encoder wherein the coarse stage stabilizes the shared semantic geometry and reduces subject-induced distribution shift, and the fine stage further improves instance-level retrieval discrimination.
What carries the argument
Adaptive aggregation of multiple intermediate representations from a pretrained vision encoder to produce subject-aware visual supervision targets, paired with coarse-to-fine alignment inside a shared encoder.
Load-bearing premise
Adaptively aggregating multiple intermediate representations from a pretrained vision encoder produces a subject-aware visual supervision target that enables effective coarse-to-fine alignment while preserving subject-agnostic inference at test time.
What would settle it
An ablation that replaces the adaptive multi-granularity aggregation with a single fixed-scale target and measures whether inter-subject Top-1 accuracy on THINGS-EEG falls to the level of earlier single-target baselines.
read the original abstract
Zero-shot EEG-to-image retrieval aims to decode perceived visual content from electroencephalography (EEG) by aligning neural responses with pretrained visual representations, providing a promising route toward scalable visual neural decoding and practical brain-computer interfaces. However, robust EEG-to-image retrieval remains challenging, because prior methods usually rely on either a single fixed visual target or a subject-invariant target construction scheme. Such designs overlook two important properties of visually evoked EEG signals: they preserve information across multiple representational scales, and the visual granularity best matched to EEG may vary across subjects. To address these issues, subject-aware multi-granularity alignment (SAMGA) framework is proposed for zero-shot EEG-to-image retrieval. SAMGA first constructs a subject-aware visual supervision target by adaptively aggregating multiple intermediate representations from a pretrained vision encoder, allowing the model to absorb subject-dependent granularity deviations during training while preserving subject-agnostic inference. Building on this adaptive target construction, a coarse-to-fine cross-modal alignment strategy is further designed with a shared encoder wherein the coarse stage stabilizes the shared semantic geometry and reduces subject-induced distribution shift, and the fine stage further improves instance-level retrieval discrimination. Extensive experiments on the THINGS-EEG benchmark demonstrate that the proposed method achieves 91.3% Top-1 and 98.8% Top-5 accuracy in the intra-subject setting, and 34.4% Top-1 and 64.8% Top-5 accuracy in the inter-subject setting, outperforming recent state-of-the-art methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the Subject-Aware Multi-Granularity Alignment (SAMGA) framework for zero-shot EEG-to-image retrieval. It constructs a subject-aware visual supervision target by adaptively aggregating multiple intermediate representations from a pretrained vision encoder, allowing absorption of subject-dependent granularity deviations during training while keeping inference subject-agnostic. A coarse-to-fine cross-modal alignment strategy with a shared encoder is then used: the coarse stage stabilizes semantic geometry and reduces subject-induced shift, while the fine stage improves instance-level discrimination. On the THINGS-EEG benchmark, SAMGA reports 91.3% Top-1 / 98.8% Top-5 intra-subject and 34.4% Top-1 / 64.8% Top-5 inter-subject accuracy, outperforming recent SOTA methods.
Significance. If the performance claims and underlying mechanism hold, the work is significant as it explicitly addresses subject variability and multi-scale representational properties in EEG signals for visual decoding, a key challenge in prior single-target or subject-invariant approaches. The subject-aware target construction combined with coarse-to-fine alignment offers a principled way to improve robustness without test-time subject information, with potential implications for practical BCIs. The benchmark gains are notable, but the absence of implementation details, loss formulations, ablations, and statistical validation in the abstract limits assessment of whether the gains stem from the proposed innovations or other factors.
major comments (2)
- [Abstract] Abstract: The central performance claims (91.3% intra / 34.4% inter Top-1) depend on the adaptive aggregation producing a subject-aware target that meaningfully differs from a subject-invariant average and enables the coarse-to-fine alignment. However, no concrete mechanism (e.g., learned per-subject weights, subject embedding, layer-wise attention, or equations defining the aggregation) is supplied, nor any ablation showing the target differs from a fixed combination. This makes it impossible to verify whether the inter-subject gains are reliable or if subject identity leaks into the fine-stage loss.
- [Abstract] Abstract: The coarse stage is claimed to 'stabilize the shared semantic geometry and reduce subject-induced distribution shift,' but without loss formulations, training procedures, or details on how the stages interact (e.g., shared encoder architecture or scheduling), it is unclear whether the reported inter-subject results are supported or affected by post-hoc choices. No statistical tests or variance measures accompany the accuracy numbers.
minor comments (1)
- [Abstract] The abstract would benefit from naming the specific pretrained vision encoder (e.g., CLIP ViT or ResNet) and the THINGS-EEG dataset split details used for intra- vs. inter-subject evaluation.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The comments highlight opportunities to improve clarity in the abstract and provide stronger empirical support for the proposed mechanisms. We address each point below and have revised the manuscript accordingly, including updates to the abstract, addition of ablations, and inclusion of statistical measures.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central performance claims (91.3% intra / 34.4% inter Top-1) depend on the adaptive aggregation producing a subject-aware target that meaningfully differs from a subject-invariant average and enables the coarse-to-fine alignment. However, no concrete mechanism (e.g., learned per-subject weights, subject embedding, layer-wise attention, or equations defining the aggregation) is supplied, nor any ablation showing the target differs from a fixed combination. This makes it impossible to verify whether the inter-subject gains are reliable or if subject identity leaks into the fine-stage loss.
Authors: We thank the referee for this observation. The full mechanism is described in Section 3.2, where subject embeddings are used to compute learned per-subject weights for adaptive layer-wise aggregation of intermediate representations from the pretrained vision encoder (with the explicit formulation given in Equation 2). We acknowledge that the abstract omitted a concise reference to this process. In the revised manuscript we have updated the abstract to briefly describe the subject-embedding-driven adaptive aggregation. We have also added a dedicated ablation (new Table 3) comparing subject-aware targets against fixed combinations (average pooling and single-layer baselines), showing statistically significant drops in inter-subject accuracy when subject-specific weighting is removed. These results confirm that the gains arise from the proposed mechanism and that inference remains subject-agnostic, with no leakage of subject identity into the fine-stage loss. revision: yes
-
Referee: [Abstract] Abstract: The coarse stage is claimed to 'stabilize the shared semantic geometry and reduce subject-induced distribution shift,' but without loss formulations, training procedures, or details on how the stages interact (e.g., shared encoder architecture or scheduling), it is unclear whether the reported inter-subject results are supported or affected by post-hoc choices. No statistical tests or variance measures accompany the accuracy numbers.
Authors: We appreciate the referee's request for greater transparency. The loss formulations are provided in Section 3.3 (Equations 3 and 4): the coarse stage uses a contrastive semantic alignment loss on the aggregated targets, while the fine stage applies an instance-level discrimination loss; both stages share the same cross-modal encoder whose architecture and two-stage training schedule are detailed in Sections 3.4 and 4.2. To address the absence of statistical validation, we have added per-run standard deviations and paired t-test p-values to the main results table (revised Table 1) and included a short statement on statistical significance in the revised abstract. These additions demonstrate that the reported inter-subject improvements are robust and directly attributable to the coarse-to-fine schedule rather than post-hoc decisions. revision: yes
Circularity Check
No circularity; empirical framework evaluated on external benchmark
full rationale
The paper introduces SAMGA as an engineering method: adaptive aggregation of pretrained vision-encoder layers to form subject-aware targets, followed by coarse-to-fine alignment via a shared encoder. Performance is reported as measured Top-1/Top-5 accuracies on the external THINGS-EEG benchmark (intra- and inter-subject splits), not derived from or forced by the method's own fitted parameters. No equations, uniqueness theorems, or self-citations are presented that reduce the central claims to self-definition or input renaming. The derivation chain consists of standard training and evaluation steps whose outputs (accuracy numbers) are independent of the construction once the benchmark data are fixed.
Axiom & Free-Parameter Ledger
free parameters (1)
- adaptive aggregation parameters for multi-granularity features
axioms (1)
- domain assumption Intermediate layers of a pretrained vision encoder provide representations at multiple granularities that are relevant to visually evoked EEG signals.
Reference graph
Works this paper leans on
-
[1]
Kamitani, Y., & Tong, F. (2005). Decoding the visual and subjective contents of the human brain. Nature neuroscience, 8(5), 679-685
work page 2005
-
[2]
Robinson, A. K., Quek, G. L., & Carlson, T. A. (2023). Visual representations: insights from neural decoding. Annual Review of Vision Science, 9(1), 313-335
work page 2023
-
[3]
Du, C., Fu, K., Li, J., & He, H. (2023). Decoding visual neural representations by multimodal learning of brain-visual-linguistic features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9), 10760-10777
work page 2023
- [4]
-
[5]
T., Dwivedi, K., Roig, G., & Cichy, R
Gifford, A. T., Dwivedi, K., Roig, G., & Cichy, R. M. (2022). A large and rich EEG dataset for modeling human visual object recognition. NeuroImage, 264, 119754
work page 2022
-
[6]
Song, Y., Wang, Y., He, H., & Gao, X. (2025). Recognizing natural images from eeg with language-guided contrastive learning. IEEE Transactions on Neural Networks and Learning Systems
work page 2025
-
[7]
Xiong, D., Hu, L., Jin, J., Ding, Y., Tan, C., Zhang, J., & Tian, Y. (2025). Interpretable Cross-Modal Alignment Network for EEG Visual Decoding With Algorithm Unrolling. IEEE Transactions on Neural Networks and Learning Systems
work page 2025
-
[8]
Wu, H., Li, Q., Zhang, C., He, Z., & Ying, X. (2025). Bridging the vision- brain gap with an uncertainty-aware blur prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2246-2257)
work page 2025
-
[9]
He, B., Sohrabpour, A., Brown, E., & Liu, Z. (2018). Electrophysiological source imaging: a noninvasive window to brain dynamics. Annual review of biomedical engineering, 20(1), 171-196. [10]Kaplan, A. Y., Fingelkurts, A. A., Fingelkurts, A. A., Borisov, S. V., & Darkhovsky, B. S. (2005). Nonstationary nature of the brain activity as revealed by EEG/MEG: ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.