SUP-MCRL: Subject-aware Unified Pseudo-feature Coded Multimodal Contrastive Representation Learning for EEG Visual Decoding

Hongjie Yan; Nizhuan Wang; Shengyu Gong; Wai Ting Siok; Weiming Zeng; Yueyang Li; Zijian Kang

arxiv: 2606.16615 · v2 · pith:ZX4X7C4Anew · submitted 2026-06-15 · 💻 cs.CV

SUP-MCRL: Subject-aware Unified Pseudo-feature Coded Multimodal Contrastive Representation Learning for EEG Visual Decoding

Shengyu Gong , Weiming Zeng , Yueyang Li , Zijian Kang , Hongjie Yan , Wai Ting Siok , Nizhuan Wang This is my paper

Pith reviewed 2026-06-27 03:04 UTC · model grok-4.3

classification 💻 cs.CV

keywords EEG visual decodingmultimodal contrastive learningzero-shot learningbrain-computer interfacenatural imagessemantic alignmentsubject variabilityTHINGS-EEG

0 comments

The pith

Structured alignment supervision via semantic attention and pseudo-feature coding overcomes geometric-only limitations in EEG decoding of natural images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard multimodal contrastive models align EEG signals to images only by minimizing geometric distances, which ignores semantic content in the visuals and variability across subjects, causing many incorrect zero-shot matches on natural scenes. The paper introduces SUP-MCRL to supply more structured supervision through three linked components that learn semantic spatial attention, adapt EEG features across subjects with multi-scale and attention operations, and maintain a running pool of pseudo-features to stabilize training. Experiments on the THINGS-EEG dataset report 66.0 percent top-1 and 91.9 percent top-5 accuracy for the same subject, plus 24.0 percent top-1 and 52.9 percent top-5 when leaving one subject out, both well above prior methods. A reader would care because the result suggests a concrete route to making non-invasive brain-computer interfaces work with everyday visual stimuli rather than only controlled lab pictures.

Core claim

The paper claims that a unified multimodal contrastive framework called SUP-MCRL, built around a Semantic-entity Aware Visual Encoder that extracts semantic content via learned spatial attention, a Unified EEG Enhancer that applies multi-scale atrous convolutions and inter-band attention for cross-subject robustness, and a Prototype-based Progressive Augmenter that maintains an EMA-updated pseudo-feature pool, produces subject-aware representations that achieve markedly higher zero-shot accuracy on natural-image EEG decoding than models limited to geometric alignment.

What carries the argument

The three collaborative mechanisms SAVE, UEE, and PPA inside SUP-MCRL that together supply structured alignment supervision beyond pure geometric distance optimization.

If this is right

The framework reaches 66.0 percent top-1 and 91.9 percent top-5 intra-subject accuracy on THINGS-EEG natural images.
Leave-one-subject-out performance reaches 24.0 percent top-1 and 52.9 percent top-5, showing better cross-subject generalization.
Accounting for semantic consistency and inter-subject variability reduces spurious zero-shot matches compared with geometric-only contrastive models.
The same structured supervision approach yields consistent gains across both intra-subject and leave-one-subject-out protocols.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The pseudo-feature pool mechanism might stabilize contrastive training in other modalities where representation collapse is common.
If the gains hold, the same attention and enhancement blocks could be tested on EEG decoding of non-visual stimuli.
Online updating of the pseudo-feature pool suggests a possible path toward real-time adaptation in deployed brain-computer interfaces.
The emphasis on subject-aware robustness points to value in combining this method with few-shot personalization techniques.

Load-bearing premise

The accuracy gains come chiefly from the three proposed mechanisms rather than from other details of training, data handling, or evaluation on the THINGS-EEG dataset.

What would settle it

A controlled re-run of a baseline multimodal contrastive model on identical THINGS-EEG splits and protocol, without SAVE, UEE, or PPA, that reaches comparable intra-subject and LOSO top-1 and top-5 accuracies.

Figures

Figures reproduced from arXiv: 2606.16615 by Hongjie Yan, Nizhuan Wang, Shengyu Gong, Wai Ting Siok, Weiming Zeng, Yueyang Li, Zijian Kang.

**Figure 2.** Figure 2: Overall architecture of the proposed Semantic-Entity Aware Visual Encoder (SAVE). (A) The main encoder– [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overall architecture of the proposed EEG encoder. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Main pipeline of the Prototype-based Progressive Augmenter (PPA). [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: The overall architecture of the proposed hierarchical codebook framework, consisting of [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Visual comparison of images enhanced by the SAVE module. Top: original images; bottom: SAVE-enhanced [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 8.** Figure 8: Distribution of temperature-scaled cosine similar [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 14.** Figure 14: PIES time-gate heatmap (channels vs. down [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗

**Figure 10.** Figure 10: Adaptive channel scale factors. The red dashed [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Channel-wise attention weights from the en [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 17.** Figure 17: Frobenius norms of band-specific convolutional [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗

**Figure 15.** Figure 15: Fusion weight allocation between the frequency [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗

**Figure 19.** Figure 19: Category-reordered similarity heatmap of the [PITH_FULL_IMAGE:figures/full_fig_p018_19.png] view at source ↗

read the original abstract

Non-invasive brain-computer interfaces exhibit significant performance degradation when moving from controlled laboratory stimuli to real-world natural images. This degradation occurs because conventional multimodal contrastive representation learning models focus exclusively on optimizing geometric distance alignment, thereby failing to account for semantic consistency and inter-subject variability in neural representation and selective attention. As a result, these models are prone to producing spurious zero-shot matches. To address these limitations, we propose SUP-MCRL, a unified framework integrating three collaborative mechanisms: (1) a Semantic-entity Aware Visual Encoder (SAVE) that learns spatial attention to extract semantic content without relying on pre-trained saliency models; (2) a Unified EEG Enhancer (UEE) that employs multi-scale atrous convolutions and inter-band attention for adaptive cross-subject robustness; and (3) a Prototype-based Progressive Augmenter (PPA) that maintains an EMA-updated pseudo-feature pool to prevent representation collapse. Zero-shot experiments on the THINGS-EEG achieve 66.0%/91.9% (Top-1/Top-5) intra-subject and 24.0%/52.9% LOSO accuracy, significantly surpassing state-of-the-art methods and demonstrating that structured alignment supervision is key to overcoming the limitations of cross-modal decoding. Code is available at https://github.com/NZWANG/SUP-MCRL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The reported accuracy jumps on THINGS-EEG are the main takeaway, but the abstract gives no ablations or stats so the three modules' role stays unproven.

read the letter

The paper's central claim is that adding SAVE for semantic attention in the visual encoder, UEE for multi-scale atrous convolutions plus inter-band attention on EEG, and PPA for an EMA-updated pseudo-feature pool produces 66.0/91.9% intra-subject and 24.0/52.9% LOSO top-1/top-5 accuracy on THINGS-EEG zero-shot decoding, beating prior methods. That is the concrete result worth noting.

What is new is the specific framing of these three mechanisms as a way to add structured alignment supervision on top of standard contrastive losses, aimed at semantic consistency and cross-subject robustness for natural images. The authors link code, which lets others check the implementation directly.

The work is straightforward about the problem: conventional multimodal contrastive models ignore semantic content and inter-subject differences, leading to spurious matches. The proposed modules target those gaps in a unified way.

The soft spot is the missing evidence that the modules caused the gains. The abstract states the numbers and claims superiority but shows no error bars, no statistical tests, no ablation tables, and no training protocol details. Without those, the improvements could trace to optimizer settings, augmentation choices, split details, or other factors on this dataset rather than SAVE, UEE, or PPA. The LOSO result remains modest, which is realistic but also shows the task is still hard.

This paper is for researchers working on EEG-based visual decoding and multimodal contrastive methods who want to test a new combination on public data. A reader who needs to reproduce or extend the code could find it useful; someone looking for a settled advance would wait for the full methods and results.

I would send it for peer review. The task matters for BCI applications and the framework is specific enough that referees can evaluate the missing pieces.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SUP-MCRL, a multimodal contrastive representation learning framework for EEG-based visual decoding of natural images. It integrates three mechanisms—Semantic-entity Aware Visual Encoder (SAVE) for semantic spatial attention, Unified EEG Enhancer (UEE) using multi-scale atrous convolutions and inter-band attention, and Prototype-based Progressive Augmenter (PPA) with EMA-updated pseudo-feature pools—to address limitations in geometric alignment, semantic consistency, and inter-subject variability. Zero-shot experiments on THINGS-EEG report intra-subject accuracies of 66.0% Top-1 / 91.9% Top-5 and LOSO accuracies of 24.0% Top-1 / 52.9% Top-5, claiming these significantly surpass prior state-of-the-art methods due to the structured alignment supervision.

Significance. If the reported gains prove robustly attributable to SAVE, UEE, and PPA rather than implementation details, the work could meaningfully advance non-invasive BCI by improving cross-modal decoding for real-world stimuli. Public code release at the cited GitHub repository is a clear strength that aids reproducibility and allows independent verification of the central claims.

major comments (2)

[Abstract and Experiments] Abstract and Experiments: The central claim attributes the 66.0%/91.9% intra-subject and 24.0%/52.9% LOSO gains (and SOTA superiority) specifically to the three proposed mechanisms providing structured alignment supervision. No ablation studies isolating the contribution of SAVE, UEE, or PPA (e.g., variants with each component removed) are described, leaving open the possibility that gains arise from unstated factors such as optimizer choices, loss scaling, data augmentation, or subject-split details on THINGS-EEG.
[Results] Results: The reported accuracy numbers are presented without error bars, standard deviations across multiple runs, or statistical significance tests against baselines. This undermines the load-bearing claim that the results 'significantly surpass state-of-the-art methods,' as it is impossible to assess whether observed differences exceed expected variability.

minor comments (1)

[Abstract] The abstract states that code is available, which is helpful; the full manuscript should expand the methods section with complete training protocol, hyperparameter values, exact data splits, and evaluation details to support reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the manuscript. We address each major point below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments: The central claim attributes the 66.0%/91.9% intra-subject and 24.0%/52.9% LOSO gains (and SOTA superiority) specifically to the three proposed mechanisms providing structured alignment supervision. No ablation studies isolating the contribution of SAVE, UEE, or PPA (e.g., variants with each component removed) are described, leaving open the possibility that gains arise from unstated factors such as optimizer choices, loss scaling, data augmentation, or subject-split details on THINGS-EEG.

Authors: We agree that ablation studies are required to isolate the contributions of SAVE, UEE, and PPA. The revised manuscript will add a dedicated ablation section with experiments that remove each component individually (and in combinations) while keeping all other implementation details fixed. Results will be reported on the same THINGS-EEG splits to quantify the performance drop attributable to each mechanism. revision: yes
Referee: [Results] Results: The reported accuracy numbers are presented without error bars, standard deviations across multiple runs, or statistical significance tests against baselines. This undermines the load-bearing claim that the results 'significantly surpass state-of-the-art methods,' as it is impossible to assess whether observed differences exceed expected variability.

Authors: We acknowledge that the absence of variability measures and statistical tests weakens the strength of the superiority claims. In the revision we will rerun all experiments across multiple random seeds, report mean accuracies with standard deviations and error bars, and include statistical significance tests (e.g., paired t-tests or Wilcoxon tests with p-values) against the reproduced baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity: claims rest on experimental results

full rationale

The paper presents its core claims as empirical outcomes: zero-shot accuracies of 66.0%/91.9% intra-subject and 24.0%/52.9% LOSO on the public THINGS-EEG dataset, attributed to the three proposed mechanisms (SAVE, UEE, PPA). No equations, fitted parameters, or self-citations are shown in the provided text that reduce these measured accuracies to inputs by construction. The derivation chain consists of architectural descriptions followed by benchmark evaluation, which is self-contained against external data and does not exhibit self-definitional, fitted-prediction, or load-bearing self-citation patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; typical deep-learning models contain many hyperparameters whose values are not stated here.

pith-pipeline@v0.9.1-grok · 5797 in / 1187 out tokens · 50993 ms · 2026-06-27T03:04:48.692954+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 5 linked inside Pith

[1]

Yueyang Li, Weiming Zeng, Wenhao Dong, Di Han, Lei Chen, Hongyu Chen, Zijian Kang, Shengyu Gong, Hongjie Yan, Wai Ting Siok, and Nizhuan Wang. A tale of single-channel electroencephalography: Devices, datasets, signal processing, applications, and future directions.IEEE Transactions on Instrumentation and Measurement, 74:1–20, 2025

2025
[2]

Linguistics and human brain: A perspective of computational neuroscience.arXiv preprint arXiv:2602.08275, 2026

Fudong Zhang, Bo Chai, Yujie Wu, Wai Ting Siok, and Nizhuan Wang. Linguistics and human brain: A perspective of computational neuroscience.arXiv preprint arXiv:2602.08275, 2026

Pith/arXiv arXiv 2026
[3]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine ...

2021
[4]

Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning

Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 17612–17625. Curran Associates, Inc., 2022

2022
[5]

Mitigate the gap: Improving cross-modal alignment in clip

Sedigheh Eslami and Gerard de Melo. Mitigate the gap: Improving cross-modal alignment in clip. InThe Thirteenth International Conference on Learning Representations, 2025. 19 SUP-MCRL: Subject-aware Unified Pseudo-feature Coded Multimodal Contrastive Representation Learning for EEG Visual Decoding

2025
[6]

Causality-inspired brain-visual contrastive learning for zero-shot visual decoding.Knowledge-Based Systems, 346:116182, 2026

Yi Xiao, Xuyi Qiao, Yu-Xuan Zhang, and Xianchuan Yu. Causality-inspired brain-visual contrastive learning for zero-shot visual decoding.Knowledge-Based Systems, 346:116182, 2026

2026
[7]

A large and rich eeg dataset for modeling human visual object recognition.NeuroImage, 264:119754, 2022

Alessandro T Gifford, Kshitij Dwivedi, Gemma Roig, and Radoslaw M Cichy. A large and rich eeg dataset for modeling human visual object recognition.NeuroImage, 264:119754, 2022

2022
[8]

Neural-mcrl: Neural multimodal contrastive representation learning for eeg-based visual decoding

Yueyang Li, Zijian Kang, Shengyu Gong, Wenhao Dong, Weiming Zeng, Hongjie Yan, Wai Ting Siok, and Nizhuan Wang. Neural-mcrl: Neural multimodal contrastive representation learning for eeg-based visual decoding. In2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6, 2025

2025
[9]

An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

Pith/arXiv arXiv 2010
[10]

Changde Du, Kaicheng Fu, Jinpeng Li, and Huiguang He. Decoding visual neural representations by multimodal learning of brain-visual-linguistic features.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9):10760–10777, 2023

2023
[11]

Decoding natural images from eeg for object recognition

Yonghao Song, Bingchuan Liu, Xiang Li, Nanlin Shi, Yijun Wang, and Xiaorong Gao. Decoding natural images from eeg for object recognition. In B. Kim, Y . Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y . Sun, editors, International Conference on Learning Representations, volume 2024, pages 47648–47665, 2024

2024
[12]

Neuro-3d: Towards 3d visual decoding from eeg signals

Zhanqiang Guo, Jiamin Wu, Yonghao Song, Jiahui Bu, Weijian Mai, Qihao Zheng, Wanli Ouyang, and Chunfeng Song. Neuro-3d: Towards 3d visual decoding from eeg signals. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23870–23880, 2025

2025
[13]

Eeg-driven natural image reconstruc- tion with regional semantic awareness.Pattern Recognition, 172:112589, 2026

Xin Xiang, Wenhui Zhou, Haonan Zhu, Yunrui Li, Guojun Dai, and Lili Lin. Eeg-driven natural image reconstruc- tion with regional semantic awareness.Pattern Recognition, 172:112589, 2026

2026
[14]

Eeg2vision: A multimodal eeg-based framework for 2d visual reconstruction in cognitive neuroscience.arXiv preprint arXiv:2604.08063, 2026

Emanuele Balloni, Emanuele Frontoni, Chiara Matti, Marina Paolanti, Roberto Pierdicca, and Emiliano Santarnec- chi. Eeg2vision: A multimodal eeg-based framework for 2d visual reconstruction in cognitive neuroscience.arXiv preprint arXiv:2604.08063, 2026

Pith/arXiv arXiv 2026
[15]

Visual decoding and reconstruction via eeg embeddings with guided diffusion.arXiv preprint arXiv:2403.07721, 2024

Dongyang Li, Chen Wei, Shiying Li, Jiachen Zou, Haoyang Qin, and Quanying Liu. Visual decoding and reconstruction via eeg embeddings with guided diffusion.arXiv preprint arXiv:2403.07721, 2024

arXiv 2024
[16]

Neurobridge: Bio-inspired self-supervised eeg-to-image decoding via cognitive priors and bidirectional semantic alignment

Wenjiang Zhang, Sifeng Wang, Yuwei Su, Xinyu Li, Chen Zhang, and Suyu Zhong. Neurobridge: Bio-inspired self-supervised eeg-to-image decoding via cognitive priors and bidirectional semantic alignment. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18028–18036, 2026

2026
[17]

Damind: Zero-shot visual cross-domain alignment and representation for eeg decoding.IEEE Transactions on Image Processing, 35:3214–3227, 2026

Haodong Jing, Yongqiang Ma, Panqi Yang, Haoyu Li, Shuai Huang, Badong Chen, and Nanning Zheng. Damind: Zero-shot visual cross-domain alignment and representation for eeg decoding.IEEE Transactions on Image Processing, 35:3214–3227, 2026

2026
[18]

Mindsae: Advancing semantic perception for m/eeg-based visual decoding via unified multimodal alignment framework.Biomedical Signal Processing and Control, 123:110390, 2026

Chengjian Xu, Yonghao Song, Qiong Wang, and Qingqing Zheng. Mindsae: Advancing semantic perception for m/eeg-based visual decoding via unified multimodal alignment framework.Biomedical Signal Processing and Control, 123:110390, 2026

2026
[19]

Need: Cross-subject and cross-task generalization for video and image reconstruction from eeg signals

Shuai Huang, Huan Luo, Haodong Jing, Qixian Zhang, Litao Chang, Yating Feng, Xiao Lin, Chendong Qin, Han Chen, Shuwen Jia, Siyi Sun, and Yongxiong Wang. Need: Cross-subject and cross-task generalization for video and image reconstruction from eeg signals. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances ...

2025
[20]

Wei Li, Penglu Zhao, Cheng Xu, Yingting Hou, Wenhao Jiang, and Aiguo Song. Deep learning for eeg-based visual classification and reconstruction: Panorama, trends, challenges and opportunities.IEEE Transactions on Biomedical Engineering, 72(11):3374–3390, 2025

2025
[21]

Interpretable cross-modal alignment network for eeg visual decoding with algorithm unrolling.IEEE Transactions on Neural Networks and Learning Systems, 36(11):19894–19908, 2025

Daowen Xiong, Liangliang Hu, Jiahao Jin, Yikang Ding, Congming Tan, Jing Zhang, and Yin Tian. Interpretable cross-modal alignment network for eeg visual decoding with algorithm unrolling.IEEE Transactions on Neural Networks and Learning Systems, 36(11):19894–19908, 2025

2025
[22]

Stambridge: Spectral-temporal amplitude-aware mid-feature bridge for eeg visual decoding.arXiv preprint arXiv:2605.23137, 2026

Jiahe Meng, Weiming Zeng, Yueyang Li, Bo Chai, Hongjie Yan, Zhiguo Zhang, Wai Ting Siok, and Nizhuan Wang. Stambridge: Spectral-temporal amplitude-aware mid-feature bridge for eeg visual decoding.arXiv preprint arXiv:2605.23137, 2026

Pith/arXiv arXiv 2026
[23]

Seeeeg: Semantic-aware eeg-based multi-modal retrieval-augmented generation for high-fidelity visual brain decoding

Jun-Mo Kim, Woohyeok Choi, Sang-Jun Park, Keun-Soo Heo, Young-Han Son, Ji-Hye Oh, Dong-Hee Shin, and Tae-Eui Kam. Seeeeg: Semantic-aware eeg-based multi-modal retrieval-augmented generation for high-fidelity visual brain decoding. In2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 4883–4892, 2025. 20 SUP-MCRL: Subject-awa...

2025
[24]

Eeg conformer: Convolutional transformer for eeg decoding and visualization.IEEE Transactions on Neural Systems and Rehabilitation Engineering, 31:710–719, 2023

Yonghao Song, Qingqing Zheng, Bingchuan Liu, and Xiaorong Gao. Eeg conformer: Convolutional transformer for eeg decoding and visualization.IEEE Transactions on Neural Systems and Rehabilitation Engineering, 31:710–719, 2023

2023
[25]

Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In2017 IEEE International Conference on Computer Vision (ICCV), pages 618–626, 2017

2017
[26]

Eeg-itnet: An explainable inception temporal convolutional network for motor imagery classification.IEEE Access, 10:36672–36685, 2022

Abbas Salami, Javier Andreu-Perez, and Helge Gillmeister. Eeg-itnet: An explainable inception temporal convolutional network for motor imagery classification.IEEE Access, 10:36672–36685, 2022

2022
[27]

Xiong Xiong, Li Su, Jinjie Guo, Tianyuan Song, Ying Wang, Jinguo Huang, and Guixia Kang. Enhancing motor imagery decoding in brain–computer interfaces using riemann tangent space mapping and cross frequency coupling.Biomedical Signal Processing and Control, 99:106797, 2025

2025
[28]

Dtp-net: Learning to reconstruct eeg signals in time-frequency domain by multi-scale feature reuse.IEEE Journal of Biomedical and Health Informatics, 28(5):2662–2673, 2024

Yan Pei, Jiahui Xu, Qianhao Chen, Chenhao Wang, Feng Yu, Lisan Zhang, and Wei Luo. Dtp-net: Learning to reconstruct eeg signals in time-frequency domain by multi-scale feature reuse.IEEE Journal of Biomedical and Health Informatics, 28(5):2662–2673, 2024

2024
[29]

O’Connor, and Kevin McGuinness

Eric Arazo, Diego Ortego, Paul Albert, Noel E. O’Connor, and Kevin McGuinness. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In2020 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2020

2020
[30]

Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks

Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. InWorkshop on challenges in representation learning, ICML, volume 3, page 896. Atlanta, 2013

2013
[31]

Neurodecoder: A new framework for image decoding and reconstruction of eeg signals.IEEE Journal of Biomedical and Health Informatics, pages 1–14, 2026

Wenxuan Ma, Hongxin Zhang, Yexuan Li, and Mingyi Wei. Neurodecoder: A new framework for image decoding and reconstruction of eeg signals.IEEE Journal of Biomedical and Health Informatics, pages 1–14, 2026

2026
[32]

Representation learning with contrastive predictive coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018

Pith/arXiv arXiv 2018
[33]

Lel: Lipschitz continuity constrained ensemble learning for efficient eeg-based intrasubject emotion recognition.IEEE Sensors Journal, 26(9):13446–13456, 2026

Shengyu Gong, Yueyang Li, Zijian Kang, Bo Chai, Weiming Zeng, Hongjie Yan, Zhiguo Zhang, Wai Ting Siok, and Nizhuan Wang. Lel: Lipschitz continuity constrained ensemble learning for efficient eeg-based intrasubject emotion recognition.IEEE Sensors Journal, 26(9):13446–13456, 2026

2026
[34]

Mb2c: Multimodal bidirectional cycle consistency for learning robust visual neural representations

Yayun Wei, Lei Cao, Hao Li, and Yilin Dong. Mb2c: Multimodal bidirectional cycle consistency for learning robust visual neural representations. InProceedings of the 32nd ACM International Conference on Multimedia, MM ’24, page 8992–9000, New York, NY , USA, 2024. Association for Computing Machinery

2024
[35]

Visual neural decoding via improved visual-eeg semantic consistency.arXiv preprint arXiv:2408.06788, 2024

Hongzhou Chen, Lianghua He, Yihang Liu, Longzhen Yang, Shaohua Shang, and MengChu Zhou. Visual neural decoding via improved visual-eeg semantic consistency.arXiv preprint arXiv:2408.06788, 2024

arXiv 2024
[36]

Bridging the vision-brain gap with an uncertainty-aware blur prior

Haitao Wu, Qing Li, Changqing Zhang, Zhen He, and Xiaomin Ying. Bridging the vision-brain gap with an uncertainty-aware blur prior. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2246–2257, 2025

2025
[37]

Spatial-functional awareness transformer-based graph archetype contrastive learning for decoding visual neural representations from eeg.arXiv preprint arXiv:2509.24761, 2025

Yueming Sun and Long Yang. Spatial-functional awareness transformer-based graph archetype contrastive learning for decoding visual neural representations from eeg.arXiv preprint arXiv:2509.24761, 2025. 21 SUP-MCRL: Subject-aware Unified Pseudo-feature Coded Multimodal Contrastive Representation Learning for EEG Visual Decoding Appendix Supplementary Mater...

arXiv 2025

[1] [1]

Yueyang Li, Weiming Zeng, Wenhao Dong, Di Han, Lei Chen, Hongyu Chen, Zijian Kang, Shengyu Gong, Hongjie Yan, Wai Ting Siok, and Nizhuan Wang. A tale of single-channel electroencephalography: Devices, datasets, signal processing, applications, and future directions.IEEE Transactions on Instrumentation and Measurement, 74:1–20, 2025

2025

[2] [2]

Linguistics and human brain: A perspective of computational neuroscience.arXiv preprint arXiv:2602.08275, 2026

Fudong Zhang, Bo Chai, Yujie Wu, Wai Ting Siok, and Nizhuan Wang. Linguistics and human brain: A perspective of computational neuroscience.arXiv preprint arXiv:2602.08275, 2026

Pith/arXiv arXiv 2026

[3] [3]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine ...

2021

[4] [4]

Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning

Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 17612–17625. Curran Associates, Inc., 2022

2022

[5] [5]

Mitigate the gap: Improving cross-modal alignment in clip

Sedigheh Eslami and Gerard de Melo. Mitigate the gap: Improving cross-modal alignment in clip. InThe Thirteenth International Conference on Learning Representations, 2025. 19 SUP-MCRL: Subject-aware Unified Pseudo-feature Coded Multimodal Contrastive Representation Learning for EEG Visual Decoding

2025

[6] [6]

Causality-inspired brain-visual contrastive learning for zero-shot visual decoding.Knowledge-Based Systems, 346:116182, 2026

Yi Xiao, Xuyi Qiao, Yu-Xuan Zhang, and Xianchuan Yu. Causality-inspired brain-visual contrastive learning for zero-shot visual decoding.Knowledge-Based Systems, 346:116182, 2026

2026

[7] [7]

A large and rich eeg dataset for modeling human visual object recognition.NeuroImage, 264:119754, 2022

Alessandro T Gifford, Kshitij Dwivedi, Gemma Roig, and Radoslaw M Cichy. A large and rich eeg dataset for modeling human visual object recognition.NeuroImage, 264:119754, 2022

2022

[8] [8]

Neural-mcrl: Neural multimodal contrastive representation learning for eeg-based visual decoding

Yueyang Li, Zijian Kang, Shengyu Gong, Wenhao Dong, Weiming Zeng, Hongjie Yan, Wai Ting Siok, and Nizhuan Wang. Neural-mcrl: Neural multimodal contrastive representation learning for eeg-based visual decoding. In2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6, 2025

2025

[9] [9]

An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

Pith/arXiv arXiv 2010

[10] [10]

Changde Du, Kaicheng Fu, Jinpeng Li, and Huiguang He. Decoding visual neural representations by multimodal learning of brain-visual-linguistic features.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9):10760–10777, 2023

2023

[11] [11]

Decoding natural images from eeg for object recognition

Yonghao Song, Bingchuan Liu, Xiang Li, Nanlin Shi, Yijun Wang, and Xiaorong Gao. Decoding natural images from eeg for object recognition. In B. Kim, Y . Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y . Sun, editors, International Conference on Learning Representations, volume 2024, pages 47648–47665, 2024

2024

[12] [12]

Neuro-3d: Towards 3d visual decoding from eeg signals

Zhanqiang Guo, Jiamin Wu, Yonghao Song, Jiahui Bu, Weijian Mai, Qihao Zheng, Wanli Ouyang, and Chunfeng Song. Neuro-3d: Towards 3d visual decoding from eeg signals. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23870–23880, 2025

2025

[13] [13]

Eeg-driven natural image reconstruc- tion with regional semantic awareness.Pattern Recognition, 172:112589, 2026

Xin Xiang, Wenhui Zhou, Haonan Zhu, Yunrui Li, Guojun Dai, and Lili Lin. Eeg-driven natural image reconstruc- tion with regional semantic awareness.Pattern Recognition, 172:112589, 2026

2026

[14] [14]

Eeg2vision: A multimodal eeg-based framework for 2d visual reconstruction in cognitive neuroscience.arXiv preprint arXiv:2604.08063, 2026

Emanuele Balloni, Emanuele Frontoni, Chiara Matti, Marina Paolanti, Roberto Pierdicca, and Emiliano Santarnec- chi. Eeg2vision: A multimodal eeg-based framework for 2d visual reconstruction in cognitive neuroscience.arXiv preprint arXiv:2604.08063, 2026

Pith/arXiv arXiv 2026

[15] [15]

Visual decoding and reconstruction via eeg embeddings with guided diffusion.arXiv preprint arXiv:2403.07721, 2024

Dongyang Li, Chen Wei, Shiying Li, Jiachen Zou, Haoyang Qin, and Quanying Liu. Visual decoding and reconstruction via eeg embeddings with guided diffusion.arXiv preprint arXiv:2403.07721, 2024

arXiv 2024

[16] [16]

Neurobridge: Bio-inspired self-supervised eeg-to-image decoding via cognitive priors and bidirectional semantic alignment

Wenjiang Zhang, Sifeng Wang, Yuwei Su, Xinyu Li, Chen Zhang, and Suyu Zhong. Neurobridge: Bio-inspired self-supervised eeg-to-image decoding via cognitive priors and bidirectional semantic alignment. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18028–18036, 2026

2026

[17] [17]

Damind: Zero-shot visual cross-domain alignment and representation for eeg decoding.IEEE Transactions on Image Processing, 35:3214–3227, 2026

Haodong Jing, Yongqiang Ma, Panqi Yang, Haoyu Li, Shuai Huang, Badong Chen, and Nanning Zheng. Damind: Zero-shot visual cross-domain alignment and representation for eeg decoding.IEEE Transactions on Image Processing, 35:3214–3227, 2026

2026

[18] [18]

Mindsae: Advancing semantic perception for m/eeg-based visual decoding via unified multimodal alignment framework.Biomedical Signal Processing and Control, 123:110390, 2026

Chengjian Xu, Yonghao Song, Qiong Wang, and Qingqing Zheng. Mindsae: Advancing semantic perception for m/eeg-based visual decoding via unified multimodal alignment framework.Biomedical Signal Processing and Control, 123:110390, 2026

2026

[19] [19]

Need: Cross-subject and cross-task generalization for video and image reconstruction from eeg signals

Shuai Huang, Huan Luo, Haodong Jing, Qixian Zhang, Litao Chang, Yating Feng, Xiao Lin, Chendong Qin, Han Chen, Shuwen Jia, Siyi Sun, and Yongxiong Wang. Need: Cross-subject and cross-task generalization for video and image reconstruction from eeg signals. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances ...

2025

[20] [20]

Wei Li, Penglu Zhao, Cheng Xu, Yingting Hou, Wenhao Jiang, and Aiguo Song. Deep learning for eeg-based visual classification and reconstruction: Panorama, trends, challenges and opportunities.IEEE Transactions on Biomedical Engineering, 72(11):3374–3390, 2025

2025

[21] [21]

Interpretable cross-modal alignment network for eeg visual decoding with algorithm unrolling.IEEE Transactions on Neural Networks and Learning Systems, 36(11):19894–19908, 2025

Daowen Xiong, Liangliang Hu, Jiahao Jin, Yikang Ding, Congming Tan, Jing Zhang, and Yin Tian. Interpretable cross-modal alignment network for eeg visual decoding with algorithm unrolling.IEEE Transactions on Neural Networks and Learning Systems, 36(11):19894–19908, 2025

2025

[22] [22]

Stambridge: Spectral-temporal amplitude-aware mid-feature bridge for eeg visual decoding.arXiv preprint arXiv:2605.23137, 2026

Jiahe Meng, Weiming Zeng, Yueyang Li, Bo Chai, Hongjie Yan, Zhiguo Zhang, Wai Ting Siok, and Nizhuan Wang. Stambridge: Spectral-temporal amplitude-aware mid-feature bridge for eeg visual decoding.arXiv preprint arXiv:2605.23137, 2026

Pith/arXiv arXiv 2026

[23] [23]

Seeeeg: Semantic-aware eeg-based multi-modal retrieval-augmented generation for high-fidelity visual brain decoding

Jun-Mo Kim, Woohyeok Choi, Sang-Jun Park, Keun-Soo Heo, Young-Han Son, Ji-Hye Oh, Dong-Hee Shin, and Tae-Eui Kam. Seeeeg: Semantic-aware eeg-based multi-modal retrieval-augmented generation for high-fidelity visual brain decoding. In2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 4883–4892, 2025. 20 SUP-MCRL: Subject-awa...

2025

[24] [24]

Eeg conformer: Convolutional transformer for eeg decoding and visualization.IEEE Transactions on Neural Systems and Rehabilitation Engineering, 31:710–719, 2023

Yonghao Song, Qingqing Zheng, Bingchuan Liu, and Xiaorong Gao. Eeg conformer: Convolutional transformer for eeg decoding and visualization.IEEE Transactions on Neural Systems and Rehabilitation Engineering, 31:710–719, 2023

2023

[25] [25]

Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In2017 IEEE International Conference on Computer Vision (ICCV), pages 618–626, 2017

2017

[26] [26]

Eeg-itnet: An explainable inception temporal convolutional network for motor imagery classification.IEEE Access, 10:36672–36685, 2022

Abbas Salami, Javier Andreu-Perez, and Helge Gillmeister. Eeg-itnet: An explainable inception temporal convolutional network for motor imagery classification.IEEE Access, 10:36672–36685, 2022

2022

[27] [27]

Xiong Xiong, Li Su, Jinjie Guo, Tianyuan Song, Ying Wang, Jinguo Huang, and Guixia Kang. Enhancing motor imagery decoding in brain–computer interfaces using riemann tangent space mapping and cross frequency coupling.Biomedical Signal Processing and Control, 99:106797, 2025

2025

[28] [28]

Dtp-net: Learning to reconstruct eeg signals in time-frequency domain by multi-scale feature reuse.IEEE Journal of Biomedical and Health Informatics, 28(5):2662–2673, 2024

Yan Pei, Jiahui Xu, Qianhao Chen, Chenhao Wang, Feng Yu, Lisan Zhang, and Wei Luo. Dtp-net: Learning to reconstruct eeg signals in time-frequency domain by multi-scale feature reuse.IEEE Journal of Biomedical and Health Informatics, 28(5):2662–2673, 2024

2024

[29] [29]

O’Connor, and Kevin McGuinness

Eric Arazo, Diego Ortego, Paul Albert, Noel E. O’Connor, and Kevin McGuinness. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In2020 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2020

2020

[30] [30]

Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks

Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. InWorkshop on challenges in representation learning, ICML, volume 3, page 896. Atlanta, 2013

2013

[31] [31]

Neurodecoder: A new framework for image decoding and reconstruction of eeg signals.IEEE Journal of Biomedical and Health Informatics, pages 1–14, 2026

Wenxuan Ma, Hongxin Zhang, Yexuan Li, and Mingyi Wei. Neurodecoder: A new framework for image decoding and reconstruction of eeg signals.IEEE Journal of Biomedical and Health Informatics, pages 1–14, 2026

2026

[32] [32]

Representation learning with contrastive predictive coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018

Pith/arXiv arXiv 2018

[33] [33]

Lel: Lipschitz continuity constrained ensemble learning for efficient eeg-based intrasubject emotion recognition.IEEE Sensors Journal, 26(9):13446–13456, 2026

Shengyu Gong, Yueyang Li, Zijian Kang, Bo Chai, Weiming Zeng, Hongjie Yan, Zhiguo Zhang, Wai Ting Siok, and Nizhuan Wang. Lel: Lipschitz continuity constrained ensemble learning for efficient eeg-based intrasubject emotion recognition.IEEE Sensors Journal, 26(9):13446–13456, 2026

2026

[34] [34]

Mb2c: Multimodal bidirectional cycle consistency for learning robust visual neural representations

Yayun Wei, Lei Cao, Hao Li, and Yilin Dong. Mb2c: Multimodal bidirectional cycle consistency for learning robust visual neural representations. InProceedings of the 32nd ACM International Conference on Multimedia, MM ’24, page 8992–9000, New York, NY , USA, 2024. Association for Computing Machinery

2024

[35] [35]

Visual neural decoding via improved visual-eeg semantic consistency.arXiv preprint arXiv:2408.06788, 2024

Hongzhou Chen, Lianghua He, Yihang Liu, Longzhen Yang, Shaohua Shang, and MengChu Zhou. Visual neural decoding via improved visual-eeg semantic consistency.arXiv preprint arXiv:2408.06788, 2024

arXiv 2024

[36] [36]

Bridging the vision-brain gap with an uncertainty-aware blur prior

Haitao Wu, Qing Li, Changqing Zhang, Zhen He, and Xiaomin Ying. Bridging the vision-brain gap with an uncertainty-aware blur prior. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2246–2257, 2025

2025

[37] [37]

Spatial-functional awareness transformer-based graph archetype contrastive learning for decoding visual neural representations from eeg.arXiv preprint arXiv:2509.24761, 2025

Yueming Sun and Long Yang. Spatial-functional awareness transformer-based graph archetype contrastive learning for decoding visual neural representations from eeg.arXiv preprint arXiv:2509.24761, 2025. 21 SUP-MCRL: Subject-aware Unified Pseudo-feature Coded Multimodal Contrastive Representation Learning for EEG Visual Decoding Appendix Supplementary Mater...

arXiv 2025