pith. machine review for the scientific record. sign in

arxiv: 2604.26218 · v1 · submitted 2026-04-29 · 💻 cs.CV

Recognition: unknown

ViBE: Visual-to-M/EEG Brain Encoding via Spatio-Temporal VAE and Distribution-Aligned Projection

Boyu Wang, Ganxi Xu, Guoxu Zhou, Jian Zhu, Jinyi Long, Shuyan Zhou, Yonghao Song, Yuting Tang, Zhao-Rong Lai

Authors on Pith no claims yet

Pith reviewed 2026-05-07 12:16 UTC · model grok-4.3

classification 💻 cs.CV
keywords brain encodingMEGEEGvisual stimulivariational autoencodercross-modal alignmentdistribution alignmentQ-Former
0
0 comments X

The pith

ViBE generates M/EEG signals from visual stimuli by reconstructing neural responses in a spatio-temporal latent space and aligning visual embeddings to it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ViBE as a framework that turns visual images into corresponding magnetoencephalography and electroencephalography recordings. It separates the task into faithful signal reconstruction and cross-modal alignment between vision and brain activity. A convolutional variational autoencoder first learns to reconstruct the spatio-temporal patterns of real M/EEG data. A Q-Former then translates CLIP visual embeddings into that same latent space, after which point-wise and distributional losses pull the translated embeddings toward the true neural ones. If the approach holds, it supplies a concrete route from pixels to measurable brain signals on existing datasets.

Core claim

ViBE generates magnetoencephalography and electroencephalography signals from visual stimuli by first using a spatio-temporal convolutional variational autoencoder to reconstruct neural responses, then employing a Q-Former to map CLIP image embeddings into the autoencoder latent space as neural proxy embeddings, and finally applying both mean squared error and sliced Wasserstein distance to align the proxy embeddings with the true latent embeddings.

What carries the argument

Spatio-temporal convolutional variational autoencoder (TSC-VAE) whose latent space receives Q-Former-mapped CLIP embeddings and is aligned to real neural latents via combined MSE and sliced Wasserstein losses.

If this is right

  • Higher-fidelity reconstruction of spatio-temporal M/EEG patterns on the THINGS-EEG2 and THINGS-MEG datasets.
  • Tighter cross-modal alignment between visual feature spaces and neural response spaces.
  • A modular pipeline that separates reconstruction from alignment for future brain encoding models.
  • Direct applicability to visual stimulus-to-brain-signal generation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same alignment strategy might transfer to other sensory modalities where one needs to map external stimuli into recorded neural activity.
  • Successful generation could supply synthetic training data for downstream brain-computer interface decoders.
  • The framework could be tested on real-time streaming visual input to check whether alignment remains stable outside static image sets.

Load-bearing premise

The assumption that visual features from CLIP, once mapped by Q-Former and aligned with MSE plus sliced Wasserstein distance inside the TSC-VAE latent space, produce M/EEG signals that match real neural responses outside the training distribution.

What would settle it

Generated signals on held-out visual stimuli from the THINGS datasets show low correlation or mismatched statistical structure with simultaneously recorded M/EEG when compared channel-by-channel and time-by-time.

Figures

Figures reproduced from arXiv: 2604.26218 by Boyu Wang, Ganxi Xu, Guoxu Zhou, Jian Zhu, Jinyi Long, Shuyan Zhou, Yonghao Song, Yuting Tang, Zhao-Rong Lai.

Figure 1
Figure 1. Figure 1: Overview of the training procedure. In Stage I, the TSC-VAE employs temporal convo view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the inference pipeline. Given a test image view at source ↗
Figure 3
Figure 3. Figure 3: Visual comparison of TSConv and TSConvPlus across Stage I and Stage II on THINGS view at source ↗
Figure 4
Figure 4. Figure 4: Distribution comparison of CLIP image embeddings, TSC-VAE latent embeddings, and view at source ↗
Figure 5
Figure 5. Figure 5: Brain region ablation study results. Left: THINGS-EEG2. Right: THINGS-MEG. view at source ↗
read the original abstract

Brain encoding models not only serve to decipher how visual stimuli are transformed into neural responses, but also represent a critical step toward visual prostheses that restore vision for patients with severe vision disorders. Brain encoding involves two fundamental steps: achieving faithful reconstruction of neural responses and establishing cross-modal alignment between visual stimuli and neural responses. To this end, we propose ViBE, a novel brain encoding framework for generating magnetoencephalography (MEG) and electroencephalography (EEG) signals from visual stimuli. Specifically, we first design a spatio-temporal convolutional variational autoencoder (TSC-VAE) that captures the spatio-temporal characteristics of M/EEG signals for effective neural response reconstruction. To bridge the modality gap between visual features and neural representations, we employ Q-Former to map CLIP image embeddings to the TSC-VAE latent space, producing neural proxy embeddings. For comprehensive cross-modal alignment, we combine mean squared error (MSE) loss for point-wise feature matching with sliced Wasserstein distance (SWD) for probability distribution alignment between the neural proxy embeddings and TSC-VAE latent embeddings. We conduct extensive experiments on the THINGS-EEG2 and THINGS-MEG datasets, demonstrating the effectiveness of our approach in generating high-quality M/EEG signals from visual stimuli.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes ViBE, a brain encoding framework that reconstructs M/EEG signals from visual stimuli using a spatio-temporal convolutional variational autoencoder (TSC-VAE) to capture neural response characteristics, a Q-Former to map CLIP image embeddings into the TSC-VAE latent space as neural proxy embeddings, and a joint training objective combining MSE for point-wise matching with sliced Wasserstein distance (SWD) for distribution alignment. Extensive experiments are claimed on the THINGS-EEG2 and THINGS-MEG datasets to demonstrate high-quality M/EEG signal generation from visual inputs.

Significance. If the quantitative results, baselines, and ablations support the claims, the work could contribute to brain encoding research by integrating spatio-temporal VAE reconstruction with cross-modal distribution alignment, potentially advancing applications toward visual prostheses. The architecture is coherent and the use of standard losses on held-out data avoids obvious circularity, but the lack of reported metrics in the abstract leaves the practical impact unverified from the given description.

major comments (1)
  1. Abstract: the central claim of 'demonstrating the effectiveness of our approach in generating high-quality M/EEG signals' is not supported by any quantitative metrics, baseline comparisons, ablation results, or error analysis. Without these, the effectiveness cannot be assessed and the experiments section must supply them to substantiate the contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the abstract. We address the point below and will revise the manuscript to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: Abstract: the central claim of 'demonstrating the effectiveness of our approach in generating high-quality M/EEG signals' is not supported by any quantitative metrics, baseline comparisons, ablation results, or error analysis. Without these, the effectiveness cannot be assessed and the experiments section must supply them to substantiate the contribution.

    Authors: We agree that the abstract would benefit from explicit quantitative support for the effectiveness claim. The Experiments section of the manuscript already reports quantitative metrics (reconstruction quality on held-out data), baseline comparisons, ablation studies on the TSC-VAE, Q-Former, and loss components, and error analyses across the THINGS-EEG2 and THINGS-MEG datasets. To make the abstract self-contained and directly substantiate the central claim, we will revise it to include key numerical results from those experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents a standard encoder-decoder architecture: TSC-VAE reconstructs M/EEG signals from their own spatio-temporal structure, Q-Former maps external CLIP visual embeddings into the VAE latent space, and training uses ordinary MSE plus sliced Wasserstein losses on held-out splits of the public THINGS-EEG2 and THINGS-MEG datasets. None of the reported performance numbers (reconstruction fidelity, alignment metrics) are algebraically forced by the fitted parameters themselves or by any self-referential normalization; the derivation chain consists of empirical training and evaluation against independent ground-truth neural recordings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the unproven effectiveness of the proposed TSC-VAE architecture and the assumption that CLIP features plus Q-Former can be aligned to neural latent space via standard losses; no numerical free parameters are stated in the abstract.

axioms (1)
  • domain assumption CLIP image embeddings contain visual features that are linearly or non-linearly mappable to M/EEG representations
    Invoked when Q-Former is used to produce neural proxy embeddings from CLIP features.
invented entities (1)
  • TSC-VAE no independent evidence
    purpose: Captures spatio-temporal characteristics of M/EEG signals for reconstruction
    New architectural design introduced in the paper; no independent evidence of its superiority is provided in the abstract.

pith-pipeline@v0.9.0 · 5556 in / 1363 out tokens · 100806 ms · 2026-05-07T12:16:59.682348+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 10 canonical work pages · 1 internal anchor

  1. [1]

    Encoding and decoding in fmri.Neuroimage, 56(2):400–410, 2011

    Thomas Naselaris, Kendrick N Kay, Shinji Nishimoto, and Jack L Gallant. Encoding and decoding in fmri.Neuroimage, 56(2):400–410, 2011

  2. [2]

    Mindsimulator: Exploring brain concept localization via synthetic fmri.arXiv preprint arXiv:2503.02351, 2025

    Guangyin Bao, Qi Zhang, Zixuan Gong, Zhuojia Wu, and Duoqian Miao. Mindsimulator: Exploring brain concept localization via synthetic fmri.arXiv preprint arXiv:2503.02351, 2025

  3. [3]

    Brain diffusion for visual exploration: Cortical discovery using large scale generative models.Advances in Neural Information Processing Systems, 36:75740–75781, 2023

    Andrew Luo, Maggie Henderson, Leila Wehbe, and Michael Tarr. Brain diffusion for visual exploration: Cortical discovery using large scale generative models.Advances in Neural Information Processing Systems, 36:75740–75781, 2023

  4. [4]

    Brain mapping with dense features: Grounding cortical semantic selectivity in natural images with vision transformers.arXiv preprint arXiv:2410.05266, 2024

    Andrew F Luo, Jacob Yeung, Rushikesh Zawar, Shaurya Dewan, Margaret M Henderson, Leila Wehbe, and Michael J Tarr. Brain mapping with dense features: Grounding cortical semantic selectivity in natural images with vision transformers.arXiv preprint arXiv:2410.05266, 2024

  5. [5]

    Development of visual neuroprostheses: trends and challenges.Bioelec- tronic medicine, 4(1):12, 2018

    Eduardo Fernandez. Development of visual neuroprostheses: trends and challenges.Bioelec- tronic medicine, 4(1):12, 2018

  6. [6]

    A head mounted device stimulator for optogenetic retinal prosthesis.Journal of neural engineering, 15(6):065002, 2018

    Ahmed Soltan, John Martin Barrett, Pleun Maaskant, Niall Armstrong, Walid Al-Atabany, Lionel Chaudet, Mark Neil, Evelyne Sernagor, and Patrick Degenaar. A head mounted device stimulator for optogenetic retinal prosthesis.Journal of neural engineering, 15(6):065002, 2018

  7. [7]

    Hybrid neural autoencoders for stimulus en- coding in visual and other sensory neuroprostheses.Advances in Neural Information Processing Systems, 35:22671–22685, 2022

    Jacob Granley, Lucas Relic, and Michael Beyeler. Hybrid neural autoencoders for stimulus en- coding in visual and other sensory neuroprostheses.Advances in Neural Information Processing Systems, 35:22671–22685, 2022

  8. [8]

    Synbrain: Enhancing visual-to-fmri synthesis via probabilistic representation learning.arXiv preprint arXiv:2508.10298, 2025

    Weijian Mai, Jiamin Wu, Yu Zhu, Zhouheng Yao, Dongzhan Zhou, Andrew F Luo, Qihao Zheng, Wanli Ouyang, and Chunfeng Song. Synbrain: Enhancing visual-to-fmri synthesis via probabilistic representation learning.arXiv preprint arXiv:2508.10298, 2025

  9. [9]

    Dataset distillation with neural characteristic function: A minmax perspective

    Shaobo Wang, Yicun Yang, Zhiyuan Liu, Chenghao Sun, Xuming Hu, Conghui He, and Linfeng Zhang. Dataset distillation with neural characteristic function: A minmax perspective. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 25570–25580, 2025

  10. [10]

    Image-to-brain signal generation for visual prosthesis with clip guided multimodal diffusion models.arXiv preprint arXiv:2509.00787, 2025

    Ganxi Xu, Zhao-Rong Lai, Yuting Tang, Yonghao Song, Guoxu Zhou, Jian Zhu, Jinyi Long, et al. Image-to-brain signal generation for visual prosthesis with clip guided multimodal diffusion models.arXiv preprint arXiv:2509.00787, 2025

  11. [11]

    Leila Montazeri, Nizar El Zarif, Stuart Trenholm, and Mohamad Sawan. Optogenetic stimulation for restoring vision to patients suffering from retinal degenerative diseases: current strategies and future directions.IEEE transactions on biomedical circuits and systems, 13(6):1792–1807, 2019

  12. [12]

    A large and rich eeg dataset for modeling human visual object recognition.NeuroImage, 264:119754, 2022

    Alessandro T Gifford, Kshitij Dwivedi, Gemma Roig, and Radoslaw M Cichy. A large and rich eeg dataset for modeling human visual object recognition.NeuroImage, 264:119754, 2022

  13. [13]

    Things-data, a multimodal collection of large-scale datasets for investigating object representations in human brain and behavior.Elife, 12:e82580, 2023

    Martin N Hebart, Oliver Contier, Lina Teichmann, Adam H Rockter, Charles Y Zheng, Alexis Kidder, Anna Corriveau, Maryam Vaziri-Pashkam, and Chris I Baker. Things-data, a multimodal collection of large-scale datasets for investigating object representations in human brain and behavior.Elife, 12:e82580, 2023

  14. [14]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023. 10

  15. [15]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  16. [16]

    Vi- sual decoding and reconstruction via eeg embeddings with guided diffusion.arXiv preprint arXiv:2403.07721, 2024

    Dongyang Li, Chen Wei, Shiying Li, Jiachen Zou, Haoyang Qin, and Quanying Liu. Vi- sual decoding and reconstruction via eeg embeddings with guided diffusion.arXiv preprint arXiv:2403.07721, 2024

  17. [17]

    Vision transformer with quadrangle attention.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):3608–3624, 2024

    Qiming Zhang, Jing Zhang, Yufei Xu, and Dacheng Tao. Vision transformer with quadrangle attention.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):3608–3624, 2024

  18. [18]

    Sliced and radon wasser- stein barycenters of measures.Journal of Mathematical Imaging and Vision, 51(1):22–45, 2015

    Nicolas Bonneel, Julien Rabin, Gabriel Peyré, and Hanspeter Pfister. Sliced and radon wasser- stein barycenters of measures.Journal of Mathematical Imaging and Vision, 51(1):22–45, 2015

  19. [19]

    Personalized visual encoding model construction with small data.Communications Biology, 5(1):1382, 2022

    Zijin Gu, Keith Jamison, Mert Sabuncu, and Amy Kuceyeski. Personalized visual encoding model construction with small data.Communications Biology, 5(1):1382, 2022

  20. [20]

    Natural speech reveals the semantic maps that tile human cerebral cortex.Nature, 532(7600):453–458, 2016

    Alexander G Huth, Wendy A De Heer, Thomas L Griffiths, Frédéric E Theunissen, and Jack L Gallant. Natural speech reveals the semantic maps that tile human cerebral cortex.Nature, 532(7600):453–458, 2016

  21. [21]

    Predicting human brain activity associated with the meanings of nouns.science, 320(5880):1191–1195, 2008

    Tom M Mitchell, Svetlana V Shinkareva, Andrew Carlson, Kai-Min Chang, Vicente L Malave, Robert A Mason, and Marcel Adam Just. Predicting human brain activity associated with the meanings of nouns.science, 320(5880):1191–1195, 2008

  22. [22]

    Brain encoding models based on multimodal transformers can transfer across language and vision.Advances in neural information processing systems, 36:29654–29666, 2023

    Jerry Tang, Meng Du, Vy V o, Vasudev Lal, and Alexander Huth. Brain encoding models based on multimodal transformers can transfer across language and vision.Advances in neural information processing systems, 36:29654–29666, 2023

  23. [23]

    Predicting brain activity using trans- formers.bioRxiv, pages 2023–08, 2023

    Hossein Adeli, Sun Minni, and Nikolaus Kriegeskorte. Predicting brain activity using trans- formers.bioRxiv, pages 2023–08, 2023

  24. [24]

    The wisdom of a crowd of brains: A universal brain encoder.arXiv preprint arXiv:2406.12179, 2024

    Roman Beliy, Navve Wasserman, Amit Zalcher, and Michal Irani. The wisdom of a crowd of brains: A universal brain encoder.arXiv preprint arXiv:2406.12179, 2024

  25. [25]

    The algonauts project 2023 challenge: How the human brain makes sense of natural scenes.arXiv preprint arXiv:2301.03198, 2023

    Alessandro T Gifford, Benjamin Lahner, Sari Saba-Sadiya, Martina G Vilas, Alex Las- celles, Aude Oliva, Kendrick Kay, Gemma Roig, and Radoslaw M Cichy. The algonauts project 2023 challenge: How the human brain makes sense of natural scenes.arXiv preprint arXiv:2301.03198, 2023

  26. [26]

    Genetic reactivation of cone photoreceptors restores visual responses in retinitis pigmentosa.science, 329(5990):413–417, 2010

    V olker Busskamp, Jens Duebel, David Balya, Mathias Fradot, Tim James Viney, Sandra Siegert, Anna C Groner, Erik Cabuy, Valérie Forster, Mathias Seeliger, et al. Genetic reactivation of cone photoreceptors restores visual responses in retinitis pigmentosa.science, 329(5990):413–417, 2010

  27. [27]

    Bioengineering strategies for restoring vision.Nature biomedical engineering, 7(4):387–404, 2023

    Jasmina Cehajic-Kapetanovic, Mandeep S Singh, Eberhart Zrenner, and Robert E MacLaren. Bioengineering strategies for restoring vision.Nature biomedical engineering, 7(4):387–404, 2023

  28. [28]

    Behavioural responses to a photovoltaic subretinal prosthesis implanted in non-human primates.Nature biomedical engineering, 4(2):172–180, 2020

    Paul-Henri Prévot, Kevin Gehere, Fabrice Arcizet, Himanshu Akolkar, Mina A Khoei, Kévin Blaize, Omar Oubari, Pierre Daye, Marion Lanoë, Manon Valet, et al. Behavioural responses to a photovoltaic subretinal prosthesis implanted in non-human primates.Nature biomedical engineering, 4(2):172–180, 2020

  29. [29]

    Restoration of patterned vision with an engineered photoactivatable g protein-coupled receptor.Nature communications, 8(1):1862, 2017

    Michael H Berry, Amy Holt, Joshua Levitz, Johannes Broichhagen, Benjamin M Gaub, Meike Visel, Cherise Stanley, Krishan Aghi, Yang Joon Kim, Kevin Cao, et al. Restoration of patterned vision with an engineered photoactivatable g protein-coupled receptor.Nature communications, 8(1):1862, 2017

  30. [30]

    Partial recovery of visual function in a blind patient after optogenetic therapy.Nature medicine, 27(7):1223–1229, 2021

    José-Alain Sahel, Elise Boulanger-Scemama, Chloé Pagot, Angelo Arleo, Francesco Galluppi, Joseph N Martel, Simona Degli Esposti, Alexandre Delaux, Jean-Baptiste de Saint Aubert, Caroline de Montleau, et al. Partial recovery of visual function in a blind patient after optogenetic therapy.Nature medicine, 27(7):1223–1229, 2021. 11

  31. [31]

    Stimulus- and goal-oriented frameworks for understanding natural vision.Nature neuroscience, 22(1):15– 24, 2019

    Maxwell H Turner, Luis Gonzalo Sanchez Giraldo, Odelia Schwartz, and Fred Rieke. Stimulus- and goal-oriented frameworks for understanding natural vision.Nature neuroscience, 22(1):15– 24, 2019

  32. [32]

    Vibe: A universal background subtraction algorithm for video sequences.IEEE Transactions on Image processing, 20(6):1709–1724, 2010

    Olivier Barnich and Marc Van Droogenbroeck. Vibe: A universal background subtraction algorithm for video sequences.IEEE Transactions on Image processing, 20(6):1709–1724, 2010

  33. [33]

    Region-of-interest processing for electronic visual prostheses.Journal of Electronic Imaging, 17(1):013002–013002, 2008

    Justin R Boyle, Anthony J Maeder, and Wageeh W Boles. Region-of-interest processing for electronic visual prostheses.Journal of Electronic Imaging, 17(1):013002–013002, 2008

  34. [34]

    Optimization of visual information presentation for visual prosthesis.International journal of biomedical imaging, 2018(1):3198342, 2018

    Fei Guo, Yuan Yang, and Yong Gao. Optimization of visual information presentation for visual prosthesis.International journal of biomedical imaging, 2018(1):3198342, 2018

  35. [35]

    Recognition of moving object in high dynamic scene for visual prosthesis.IEICE transactions on information and systems, 102(7):1321–1331, 2019

    Fei Guo, Yuan Yang, Yang Xiao, Yong Gao, and Ningmei Yu. Recognition of moving object in high dynamic scene for visual prosthesis.IEICE transactions on information and systems, 102(7):1321–1331, 2019

  36. [36]

    Optimization of neuroprosthetic vision via end-to-end deep reinforcement learning.International Journal of Neural Systems, 32(11):2250052, 2022

    Burcu Küçüko ˘glu, Bodo Rueckauer, Nasir Ahmad, Jaap de Ruyter van Steveninck, Umut Güçlü, and Marcel van Gerven. Optimization of neuroprosthetic vision via end-to-end deep reinforcement learning.International Journal of Neural Systems, 32(11):2250052, 2022

  37. [37]

    A computational model of phosphene appearance for epiretinal prostheses

    Jacob Granley and Michael Beyeler. A computational model of phosphene appearance for epiretinal prostheses. In2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pages 4477–4481. IEEE, 2021

  38. [38]

    Towards biologically plausible phosphene simulation for the differentiable optimization of visual cortical prostheses.Elife, 13:e85812, 2024

    Maureen van der Grinten, Jaap de Ruyter van Steveninck, Antonio Lozano, Laura Pijnacker, Bodo Rueckauer, Pieter Roelfsema, Marcel van Gerven, Richard van Wezel, Umut Güçlü, and Ya˘gmur Güçlütürk. Towards biologically plausible phosphene simulation for the differentiable optimization of visual cortical prostheses.Elife, 13:e85812, 2024

  39. [39]

    End-to-end optimization of prosthetic vision.Journal of Vision, 22(2):20–20, 2022

    Jaap de Ruyter van Steveninck, Umut Güçlü, Richard van Wezel, and Marcel van Gerven. End-to-end optimization of prosthetic vision.Journal of Vision, 22(2):20–20, 2022

  40. [40]

    Decoding natural images from eeg for object recognition.arXiv preprint arXiv:2308.13234, 2023

    Yonghao Song, Bingchuan Liu, Xiang Li, Nanlin Shi, Yijun Wang, and Xiaorong Gao. Decoding natural images from eeg for object recognition.arXiv preprint arXiv:2308.13234, 2023

  41. [41]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

  42. [42]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021

  43. [43]

    Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream.Journal of Neuroscience, 35(27):10005–10014, 2015

    Umut Güçlü and Marcel AJ Van Gerven. Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream.Journal of Neuroscience, 35(27):10005–10014, 2015

  44. [44]

    Performance-optimized hierarchical models predict neural responses in higher visual cortex.Proceedings of the national academy of sciences, 111(23):8619–8624, 2014

    Daniel LK Yamins, Ha Hong, Charles F Cadieu, Ethan A Solomon, Darren Seibert, and James J DiCarlo. Performance-optimized hierarchical models predict neural responses in higher visual cortex.Proceedings of the national academy of sciences, 111(23):8619–8624, 2014

  45. [45]

    ViBE (Ours)

    Matthias Guggenmos, Philipp Sterzer, and Radoslaw Martin Cichy. Multivariate pattern analysis for meg: A comparison of dissimilarity measures.Neuroimage, 173:434–447, 2018. 12 A Technical appendices and supplementary material A.1 Experiment Details We implement ViBE using PyTorch on four NVIDIA V100S GPUs. Datasets and Preprocessing.We evaluate on THINGS-...