Recognition: unknown
ViBE: Visual-to-M/EEG Brain Encoding via Spatio-Temporal VAE and Distribution-Aligned Projection
Pith reviewed 2026-05-07 12:16 UTC · model grok-4.3
The pith
ViBE generates M/EEG signals from visual stimuli by reconstructing neural responses in a spatio-temporal latent space and aligning visual embeddings to it.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ViBE generates magnetoencephalography and electroencephalography signals from visual stimuli by first using a spatio-temporal convolutional variational autoencoder to reconstruct neural responses, then employing a Q-Former to map CLIP image embeddings into the autoencoder latent space as neural proxy embeddings, and finally applying both mean squared error and sliced Wasserstein distance to align the proxy embeddings with the true latent embeddings.
What carries the argument
Spatio-temporal convolutional variational autoencoder (TSC-VAE) whose latent space receives Q-Former-mapped CLIP embeddings and is aligned to real neural latents via combined MSE and sliced Wasserstein losses.
If this is right
- Higher-fidelity reconstruction of spatio-temporal M/EEG patterns on the THINGS-EEG2 and THINGS-MEG datasets.
- Tighter cross-modal alignment between visual feature spaces and neural response spaces.
- A modular pipeline that separates reconstruction from alignment for future brain encoding models.
- Direct applicability to visual stimulus-to-brain-signal generation tasks.
Where Pith is reading between the lines
- The same alignment strategy might transfer to other sensory modalities where one needs to map external stimuli into recorded neural activity.
- Successful generation could supply synthetic training data for downstream brain-computer interface decoders.
- The framework could be tested on real-time streaming visual input to check whether alignment remains stable outside static image sets.
Load-bearing premise
The assumption that visual features from CLIP, once mapped by Q-Former and aligned with MSE plus sliced Wasserstein distance inside the TSC-VAE latent space, produce M/EEG signals that match real neural responses outside the training distribution.
What would settle it
Generated signals on held-out visual stimuli from the THINGS datasets show low correlation or mismatched statistical structure with simultaneously recorded M/EEG when compared channel-by-channel and time-by-time.
Figures
read the original abstract
Brain encoding models not only serve to decipher how visual stimuli are transformed into neural responses, but also represent a critical step toward visual prostheses that restore vision for patients with severe vision disorders. Brain encoding involves two fundamental steps: achieving faithful reconstruction of neural responses and establishing cross-modal alignment between visual stimuli and neural responses. To this end, we propose ViBE, a novel brain encoding framework for generating magnetoencephalography (MEG) and electroencephalography (EEG) signals from visual stimuli. Specifically, we first design a spatio-temporal convolutional variational autoencoder (TSC-VAE) that captures the spatio-temporal characteristics of M/EEG signals for effective neural response reconstruction. To bridge the modality gap between visual features and neural representations, we employ Q-Former to map CLIP image embeddings to the TSC-VAE latent space, producing neural proxy embeddings. For comprehensive cross-modal alignment, we combine mean squared error (MSE) loss for point-wise feature matching with sliced Wasserstein distance (SWD) for probability distribution alignment between the neural proxy embeddings and TSC-VAE latent embeddings. We conduct extensive experiments on the THINGS-EEG2 and THINGS-MEG datasets, demonstrating the effectiveness of our approach in generating high-quality M/EEG signals from visual stimuli.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ViBE, a brain encoding framework that reconstructs M/EEG signals from visual stimuli using a spatio-temporal convolutional variational autoencoder (TSC-VAE) to capture neural response characteristics, a Q-Former to map CLIP image embeddings into the TSC-VAE latent space as neural proxy embeddings, and a joint training objective combining MSE for point-wise matching with sliced Wasserstein distance (SWD) for distribution alignment. Extensive experiments are claimed on the THINGS-EEG2 and THINGS-MEG datasets to demonstrate high-quality M/EEG signal generation from visual inputs.
Significance. If the quantitative results, baselines, and ablations support the claims, the work could contribute to brain encoding research by integrating spatio-temporal VAE reconstruction with cross-modal distribution alignment, potentially advancing applications toward visual prostheses. The architecture is coherent and the use of standard losses on held-out data avoids obvious circularity, but the lack of reported metrics in the abstract leaves the practical impact unverified from the given description.
major comments (1)
- Abstract: the central claim of 'demonstrating the effectiveness of our approach in generating high-quality M/EEG signals' is not supported by any quantitative metrics, baseline comparisons, ablation results, or error analysis. Without these, the effectiveness cannot be assessed and the experiments section must supply them to substantiate the contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive comment on the abstract. We address the point below and will revise the manuscript to strengthen the presentation of our results.
read point-by-point responses
-
Referee: Abstract: the central claim of 'demonstrating the effectiveness of our approach in generating high-quality M/EEG signals' is not supported by any quantitative metrics, baseline comparisons, ablation results, or error analysis. Without these, the effectiveness cannot be assessed and the experiments section must supply them to substantiate the contribution.
Authors: We agree that the abstract would benefit from explicit quantitative support for the effectiveness claim. The Experiments section of the manuscript already reports quantitative metrics (reconstruction quality on held-out data), baseline comparisons, ablation studies on the TSC-VAE, Q-Former, and loss components, and error analyses across the THINGS-EEG2 and THINGS-MEG datasets. To make the abstract self-contained and directly substantiate the central claim, we will revise it to include key numerical results from those experiments. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper presents a standard encoder-decoder architecture: TSC-VAE reconstructs M/EEG signals from their own spatio-temporal structure, Q-Former maps external CLIP visual embeddings into the VAE latent space, and training uses ordinary MSE plus sliced Wasserstein losses on held-out splits of the public THINGS-EEG2 and THINGS-MEG datasets. None of the reported performance numbers (reconstruction fidelity, alignment metrics) are algebraically forced by the fitted parameters themselves or by any self-referential normalization; the derivation chain consists of empirical training and evaluation against independent ground-truth neural recordings.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption CLIP image embeddings contain visual features that are linearly or non-linearly mappable to M/EEG representations
invented entities (1)
-
TSC-VAE
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Encoding and decoding in fmri.Neuroimage, 56(2):400–410, 2011
Thomas Naselaris, Kendrick N Kay, Shinji Nishimoto, and Jack L Gallant. Encoding and decoding in fmri.Neuroimage, 56(2):400–410, 2011
2011
-
[2]
Guangyin Bao, Qi Zhang, Zixuan Gong, Zhuojia Wu, and Duoqian Miao. Mindsimulator: Exploring brain concept localization via synthetic fmri.arXiv preprint arXiv:2503.02351, 2025
-
[3]
Brain diffusion for visual exploration: Cortical discovery using large scale generative models.Advances in Neural Information Processing Systems, 36:75740–75781, 2023
Andrew Luo, Maggie Henderson, Leila Wehbe, and Michael Tarr. Brain diffusion for visual exploration: Cortical discovery using large scale generative models.Advances in Neural Information Processing Systems, 36:75740–75781, 2023
2023
-
[4]
Andrew F Luo, Jacob Yeung, Rushikesh Zawar, Shaurya Dewan, Margaret M Henderson, Leila Wehbe, and Michael J Tarr. Brain mapping with dense features: Grounding cortical semantic selectivity in natural images with vision transformers.arXiv preprint arXiv:2410.05266, 2024
-
[5]
Development of visual neuroprostheses: trends and challenges.Bioelec- tronic medicine, 4(1):12, 2018
Eduardo Fernandez. Development of visual neuroprostheses: trends and challenges.Bioelec- tronic medicine, 4(1):12, 2018
2018
-
[6]
A head mounted device stimulator for optogenetic retinal prosthesis.Journal of neural engineering, 15(6):065002, 2018
Ahmed Soltan, John Martin Barrett, Pleun Maaskant, Niall Armstrong, Walid Al-Atabany, Lionel Chaudet, Mark Neil, Evelyne Sernagor, and Patrick Degenaar. A head mounted device stimulator for optogenetic retinal prosthesis.Journal of neural engineering, 15(6):065002, 2018
2018
-
[7]
Hybrid neural autoencoders for stimulus en- coding in visual and other sensory neuroprostheses.Advances in Neural Information Processing Systems, 35:22671–22685, 2022
Jacob Granley, Lucas Relic, and Michael Beyeler. Hybrid neural autoencoders for stimulus en- coding in visual and other sensory neuroprostheses.Advances in Neural Information Processing Systems, 35:22671–22685, 2022
2022
-
[8]
Weijian Mai, Jiamin Wu, Yu Zhu, Zhouheng Yao, Dongzhan Zhou, Andrew F Luo, Qihao Zheng, Wanli Ouyang, and Chunfeng Song. Synbrain: Enhancing visual-to-fmri synthesis via probabilistic representation learning.arXiv preprint arXiv:2508.10298, 2025
-
[9]
Dataset distillation with neural characteristic function: A minmax perspective
Shaobo Wang, Yicun Yang, Zhiyuan Liu, Chenghao Sun, Xuming Hu, Conghui He, and Linfeng Zhang. Dataset distillation with neural characteristic function: A minmax perspective. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 25570–25580, 2025
2025
-
[10]
Ganxi Xu, Zhao-Rong Lai, Yuting Tang, Yonghao Song, Guoxu Zhou, Jian Zhu, Jinyi Long, et al. Image-to-brain signal generation for visual prosthesis with clip guided multimodal diffusion models.arXiv preprint arXiv:2509.00787, 2025
-
[11]
Leila Montazeri, Nizar El Zarif, Stuart Trenholm, and Mohamad Sawan. Optogenetic stimulation for restoring vision to patients suffering from retinal degenerative diseases: current strategies and future directions.IEEE transactions on biomedical circuits and systems, 13(6):1792–1807, 2019
2019
-
[12]
A large and rich eeg dataset for modeling human visual object recognition.NeuroImage, 264:119754, 2022
Alessandro T Gifford, Kshitij Dwivedi, Gemma Roig, and Radoslaw M Cichy. A large and rich eeg dataset for modeling human visual object recognition.NeuroImage, 264:119754, 2022
2022
-
[13]
Things-data, a multimodal collection of large-scale datasets for investigating object representations in human brain and behavior.Elife, 12:e82580, 2023
Martin N Hebart, Oliver Contier, Lina Teichmann, Adam H Rockter, Charles Y Zheng, Alexis Kidder, Anna Corriveau, Maryam Vaziri-Pashkam, and Chris I Baker. Things-data, a multimodal collection of large-scale datasets for investigating object representations in human brain and behavior.Elife, 12:e82580, 2023
2023
-
[14]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023. 10
2023
-
[15]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
2017
-
[16]
Dongyang Li, Chen Wei, Shiying Li, Jiachen Zou, Haoyang Qin, and Quanying Liu. Vi- sual decoding and reconstruction via eeg embeddings with guided diffusion.arXiv preprint arXiv:2403.07721, 2024
-
[17]
Vision transformer with quadrangle attention.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):3608–3624, 2024
Qiming Zhang, Jing Zhang, Yufei Xu, and Dacheng Tao. Vision transformer with quadrangle attention.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):3608–3624, 2024
2024
-
[18]
Sliced and radon wasser- stein barycenters of measures.Journal of Mathematical Imaging and Vision, 51(1):22–45, 2015
Nicolas Bonneel, Julien Rabin, Gabriel Peyré, and Hanspeter Pfister. Sliced and radon wasser- stein barycenters of measures.Journal of Mathematical Imaging and Vision, 51(1):22–45, 2015
2015
-
[19]
Personalized visual encoding model construction with small data.Communications Biology, 5(1):1382, 2022
Zijin Gu, Keith Jamison, Mert Sabuncu, and Amy Kuceyeski. Personalized visual encoding model construction with small data.Communications Biology, 5(1):1382, 2022
2022
-
[20]
Natural speech reveals the semantic maps that tile human cerebral cortex.Nature, 532(7600):453–458, 2016
Alexander G Huth, Wendy A De Heer, Thomas L Griffiths, Frédéric E Theunissen, and Jack L Gallant. Natural speech reveals the semantic maps that tile human cerebral cortex.Nature, 532(7600):453–458, 2016
2016
-
[21]
Predicting human brain activity associated with the meanings of nouns.science, 320(5880):1191–1195, 2008
Tom M Mitchell, Svetlana V Shinkareva, Andrew Carlson, Kai-Min Chang, Vicente L Malave, Robert A Mason, and Marcel Adam Just. Predicting human brain activity associated with the meanings of nouns.science, 320(5880):1191–1195, 2008
2008
-
[22]
Brain encoding models based on multimodal transformers can transfer across language and vision.Advances in neural information processing systems, 36:29654–29666, 2023
Jerry Tang, Meng Du, Vy V o, Vasudev Lal, and Alexander Huth. Brain encoding models based on multimodal transformers can transfer across language and vision.Advances in neural information processing systems, 36:29654–29666, 2023
2023
-
[23]
Predicting brain activity using trans- formers.bioRxiv, pages 2023–08, 2023
Hossein Adeli, Sun Minni, and Nikolaus Kriegeskorte. Predicting brain activity using trans- formers.bioRxiv, pages 2023–08, 2023
2023
-
[24]
The wisdom of a crowd of brains: A universal brain encoder.arXiv preprint arXiv:2406.12179, 2024
Roman Beliy, Navve Wasserman, Amit Zalcher, and Michal Irani. The wisdom of a crowd of brains: A universal brain encoder.arXiv preprint arXiv:2406.12179, 2024
-
[25]
Alessandro T Gifford, Benjamin Lahner, Sari Saba-Sadiya, Martina G Vilas, Alex Las- celles, Aude Oliva, Kendrick Kay, Gemma Roig, and Radoslaw M Cichy. The algonauts project 2023 challenge: How the human brain makes sense of natural scenes.arXiv preprint arXiv:2301.03198, 2023
-
[26]
Genetic reactivation of cone photoreceptors restores visual responses in retinitis pigmentosa.science, 329(5990):413–417, 2010
V olker Busskamp, Jens Duebel, David Balya, Mathias Fradot, Tim James Viney, Sandra Siegert, Anna C Groner, Erik Cabuy, Valérie Forster, Mathias Seeliger, et al. Genetic reactivation of cone photoreceptors restores visual responses in retinitis pigmentosa.science, 329(5990):413–417, 2010
2010
-
[27]
Bioengineering strategies for restoring vision.Nature biomedical engineering, 7(4):387–404, 2023
Jasmina Cehajic-Kapetanovic, Mandeep S Singh, Eberhart Zrenner, and Robert E MacLaren. Bioengineering strategies for restoring vision.Nature biomedical engineering, 7(4):387–404, 2023
2023
-
[28]
Behavioural responses to a photovoltaic subretinal prosthesis implanted in non-human primates.Nature biomedical engineering, 4(2):172–180, 2020
Paul-Henri Prévot, Kevin Gehere, Fabrice Arcizet, Himanshu Akolkar, Mina A Khoei, Kévin Blaize, Omar Oubari, Pierre Daye, Marion Lanoë, Manon Valet, et al. Behavioural responses to a photovoltaic subretinal prosthesis implanted in non-human primates.Nature biomedical engineering, 4(2):172–180, 2020
2020
-
[29]
Restoration of patterned vision with an engineered photoactivatable g protein-coupled receptor.Nature communications, 8(1):1862, 2017
Michael H Berry, Amy Holt, Joshua Levitz, Johannes Broichhagen, Benjamin M Gaub, Meike Visel, Cherise Stanley, Krishan Aghi, Yang Joon Kim, Kevin Cao, et al. Restoration of patterned vision with an engineered photoactivatable g protein-coupled receptor.Nature communications, 8(1):1862, 2017
2017
-
[30]
Partial recovery of visual function in a blind patient after optogenetic therapy.Nature medicine, 27(7):1223–1229, 2021
José-Alain Sahel, Elise Boulanger-Scemama, Chloé Pagot, Angelo Arleo, Francesco Galluppi, Joseph N Martel, Simona Degli Esposti, Alexandre Delaux, Jean-Baptiste de Saint Aubert, Caroline de Montleau, et al. Partial recovery of visual function in a blind patient after optogenetic therapy.Nature medicine, 27(7):1223–1229, 2021. 11
2021
-
[31]
Stimulus- and goal-oriented frameworks for understanding natural vision.Nature neuroscience, 22(1):15– 24, 2019
Maxwell H Turner, Luis Gonzalo Sanchez Giraldo, Odelia Schwartz, and Fred Rieke. Stimulus- and goal-oriented frameworks for understanding natural vision.Nature neuroscience, 22(1):15– 24, 2019
2019
-
[32]
Vibe: A universal background subtraction algorithm for video sequences.IEEE Transactions on Image processing, 20(6):1709–1724, 2010
Olivier Barnich and Marc Van Droogenbroeck. Vibe: A universal background subtraction algorithm for video sequences.IEEE Transactions on Image processing, 20(6):1709–1724, 2010
2010
-
[33]
Region-of-interest processing for electronic visual prostheses.Journal of Electronic Imaging, 17(1):013002–013002, 2008
Justin R Boyle, Anthony J Maeder, and Wageeh W Boles. Region-of-interest processing for electronic visual prostheses.Journal of Electronic Imaging, 17(1):013002–013002, 2008
2008
-
[34]
Optimization of visual information presentation for visual prosthesis.International journal of biomedical imaging, 2018(1):3198342, 2018
Fei Guo, Yuan Yang, and Yong Gao. Optimization of visual information presentation for visual prosthesis.International journal of biomedical imaging, 2018(1):3198342, 2018
2018
-
[35]
Recognition of moving object in high dynamic scene for visual prosthesis.IEICE transactions on information and systems, 102(7):1321–1331, 2019
Fei Guo, Yuan Yang, Yang Xiao, Yong Gao, and Ningmei Yu. Recognition of moving object in high dynamic scene for visual prosthesis.IEICE transactions on information and systems, 102(7):1321–1331, 2019
2019
-
[36]
Optimization of neuroprosthetic vision via end-to-end deep reinforcement learning.International Journal of Neural Systems, 32(11):2250052, 2022
Burcu Küçüko ˘glu, Bodo Rueckauer, Nasir Ahmad, Jaap de Ruyter van Steveninck, Umut Güçlü, and Marcel van Gerven. Optimization of neuroprosthetic vision via end-to-end deep reinforcement learning.International Journal of Neural Systems, 32(11):2250052, 2022
2022
-
[37]
A computational model of phosphene appearance for epiretinal prostheses
Jacob Granley and Michael Beyeler. A computational model of phosphene appearance for epiretinal prostheses. In2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pages 4477–4481. IEEE, 2021
2021
-
[38]
Towards biologically plausible phosphene simulation for the differentiable optimization of visual cortical prostheses.Elife, 13:e85812, 2024
Maureen van der Grinten, Jaap de Ruyter van Steveninck, Antonio Lozano, Laura Pijnacker, Bodo Rueckauer, Pieter Roelfsema, Marcel van Gerven, Richard van Wezel, Umut Güçlü, and Ya˘gmur Güçlütürk. Towards biologically plausible phosphene simulation for the differentiable optimization of visual cortical prostheses.Elife, 13:e85812, 2024
2024
-
[39]
End-to-end optimization of prosthetic vision.Journal of Vision, 22(2):20–20, 2022
Jaap de Ruyter van Steveninck, Umut Güçlü, Richard van Wezel, and Marcel van Gerven. End-to-end optimization of prosthetic vision.Journal of Vision, 22(2):20–20, 2022
2022
-
[40]
Decoding natural images from eeg for object recognition.arXiv preprint arXiv:2308.13234, 2023
Yonghao Song, Bingchuan Liu, Xiang Li, Nanlin Shi, Yijun Wang, and Xiaorong Gao. Decoding natural images from eeg for object recognition.arXiv preprint arXiv:2308.13234, 2023
-
[41]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review arXiv 2013
-
[42]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021
2021
-
[43]
Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream.Journal of Neuroscience, 35(27):10005–10014, 2015
Umut Güçlü and Marcel AJ Van Gerven. Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream.Journal of Neuroscience, 35(27):10005–10014, 2015
2015
-
[44]
Performance-optimized hierarchical models predict neural responses in higher visual cortex.Proceedings of the national academy of sciences, 111(23):8619–8624, 2014
Daniel LK Yamins, Ha Hong, Charles F Cadieu, Ethan A Solomon, Darren Seibert, and James J DiCarlo. Performance-optimized hierarchical models predict neural responses in higher visual cortex.Proceedings of the national academy of sciences, 111(23):8619–8624, 2014
2014
-
[45]
Matthias Guggenmos, Philipp Sterzer, and Radoslaw Martin Cichy. Multivariate pattern analysis for meg: A comparison of dissimilarity measures.Neuroimage, 173:434–447, 2018. 12 A Technical appendices and supplementary material A.1 Experiment Details We implement ViBE using PyTorch on four NVIDIA V100S GPUs. Datasets and Preprocessing.We evaluate on THINGS-...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.