Brain-IT-VQA: From Brain Signals to Answers

Matias Cosarinsky; Michal Irani; Navve Wasserman; Oliver Heinimann; Roman Beliy

arxiv: 2605.29588 · v2 · pith:2IP6X7FXnew · submitted 2026-05-28 · 💻 cs.CV · cs.AI· q-bio.NC

Brain-IT-VQA: From Brain Signals to Answers

Roman Beliy , Matias Cosarinsky , Oliver Heinimann , Navve Wasserman , Michal Irani This is my paper

Pith reviewed 2026-06-29 08:18 UTC · model grok-4.3

classification 💻 cs.CV cs.AIq-bio.NC

keywords fMRI decodingvisual question answeringbrain signal to textNSD-VQA benchmarklanguage token decodingvisual information in brain responses

0 comments

The pith

Brain-IT-VQA converts fMRI signals into language tokens that a language model uses to answer questions about viewed images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework that reads fMRI brain scans taken while people view natural images and produces answers to questions about those images. It works by first turning brain activity into sequences of language tokens, then passing those tokens to a language model for final answers. This yields better performance than earlier fMRI captioning and VQA systems. The authors also release NSD-VQA, a benchmark that supplies roughly twenty controlled question-answer pairs for each image across twenty categories designed to separate different layers of visual understanding. With this benchmark they measure which kinds of visual and semantic content can be extracted reliably from fMRI responses and how different brain regions contribute to each kind of question.

Core claim

Brain-IT-VQA decodes language tokens from fMRI responses to natural images and integrates them with a language model to answer visual questions, substantially outperforming prior fMRI-based captioning and VQA methods. The NSD-VQA benchmark supplies on average twenty question-answer pairs per image across twenty controlled categories that separate levels of visual understanding, allowing quantification of which visual and semantic information is reliably decodable and analysis of brain-region contributions across question types.

What carries the argument

Brain Interaction Transformer that decodes language tokens directly from brain activity for use by a downstream language model.

If this is right

The method produces higher accuracy than earlier fMRI captioning and VQA systems on the new benchmark.
Quantification shows which visual and semantic features are reliably present in fMRI responses to natural images.
Brain-region contributions can be mapped separately for each of the twenty question categories.
The controlled question set enables more interpretable evaluation despite small fMRI test sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The token-decoding step could be reused with other brain-recording methods such as EEG to test cross-modal consistency.
If token sequences prove stable across subjects, they might serve as a compact intermediate representation for brain-to-text systems.
The benchmark categories could be extended to test whether high-level semantic relations remain decodable when low-level visual features are controlled.
Performance differences across categories may indicate which visual computations occur early enough in the visual hierarchy to be captured by current fMRI resolution.

Load-bearing premise

The twenty controlled question categories in NSD-VQA truly separate distinct levels of visual understanding and the limited fMRI test data still supports stable performance comparisons.

What would settle it

A replication in which the new model shows no accuracy gain over prior fMRI VQA baselines on the NSD-VQA test set, or in which accuracy does not differ systematically across the twenty question categories in the expected pattern.

Figures

Figures reproduced from arXiv: 2605.29588 by Matias Cosarinsky, Michal Irani, Navve Wasserman, Oliver Heinimann, Roman Beliy.

**Figure 1.** Figure 1: fMRI-to-Language decoding: Captioning & VQA directly from fMRI brain activity. (a) Overview of the pipeline; fMRI signals recorded while a subject views an image are used to generate a caption or answer questions about the image. (b) Example captions on question-answer pairs generated by Brain-IT-VQA directly from fMRI signals. models remain limited in performance, and are not necessarily optimized for the… view at source ↗

**Figure 2.** Figure 2: Overview of the Brain-IT-VQA architecture. which types of visual or semantic information are recoverable from neural signals. Recent work has broadened the evaluation scope: BrainHub [13] extends NSD with captioning and grounding tasks, and BrainChat [14] introduces fMRI question answering evaluated by classification accuracy on broad COCO VQA questions. However, neither provides controlled question catego… view at source ↗

**Figure 3.** Figure 3: NSD-VQA Dataset construction pipeline. Starting from NSD images, we generate structured annotations using a VLM, followed by filtering and verification. Template-based question generation then produces multiple question–answer pairs per image across controlled question categories. 3.3 Training Setup Training proceeds in two stages: In the first stage (BIT-L Pretraining), BIT-L is trained to predict two tar… view at source ↗

**Figure 4.** Figure 4: Visualization of voxel-cluster marginal contributions across question categories. Different clusters show varying levels of importance depending on the question type (e.g., object, attribute, relation), highlighting how distinct brain regions support different aspects of visual and semantic processing. Technical Details: We estimate the marginal contribution of each voxel cluster to each question category … view at source ↗

read the original abstract

Decoding visual content from fMRI signals recorded while a person views images, and specifically answering questions about the seen images, is a long-standing challenge. While significant progress has been made in recent years in visual question answering (VQA) from fMRI, performance remains limited. Moreover, although recent models can make increasingly accurate predictions, they have rarely been used as tools for understanding the structure of visual representations in the brain. We present Brain-IT-VQA, a framework for visual question answering from fMRI. Building on the Brain Interaction Transformer (Brain-IT), our method decodes language tokens from brain activity and integrates them with a language model to answer visual questions. Our model substantially outperforms previous fMRI-based captioning and VQA approaches. We further introduce NSD-VQA, a new dataset and benchmark for visual question answering from fMRI. Unlike existing image-fMRI VQA datasets, which typically provide only a few broad and weakly controlled questions per image, NSD-VQA provides on average 20 question-answer pairs per image across 20 controlled question categories that disentangle multiple levels of visual understanding. This enables more reliable and interpretable evaluation despite limited fMRI test data. Together, Brain-IT-VQA and NSD-VQA provide both a strong predictive framework and a tool for studying brain representations. Using this benchmark, we quantify which forms of visual and semantic information can be reliably decoded from fMRI responses to natural images. We further analyze the contributions of different brain regions across question types.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New NSD-VQA benchmark with controlled categories is the real addition here, but limited test data makes the disentangled claims hard to trust without more evidence.

read the letter

The paper's main new piece is the NSD-VQA dataset: 20 question-answer pairs per image across 20 categories meant to separate different levels of visual and semantic understanding from fMRI. They also describe Brain-IT-VQA, which decodes language tokens from brain signals via their prior Brain-IT transformer and passes them to a language model for answering.

This setup is a step past earlier fMRI VQA work that used fewer, broader questions. The controlled categories and region analysis across question types give a clearer way to probe what information is actually decodable.

The soft spot is exactly the one in the stress-test note. fMRI test sets are small by nature, and nothing in the abstract or description shows that the 20 categories are independent enough or that per-category variance is low enough for the performance gains and brain-region breakdowns to be reliable. If questions correlate or noise swamps the signal, the "more interpretable evaluation" claim does not hold up.

This is for labs working on brain decoding, visual representations in fMRI, or early BCI ideas. Readers who need a new benchmark to test models against controlled questions could use the dataset even if the current numbers need tightening.

It deserves a serious referee because the dataset itself is a usable contribution that others can build on, provided the evaluation details get stronger scrutiny.

Referee Report

2 major / 1 minor

Summary. The paper introduces Brain-IT-VQA, a framework extending the Brain Interaction Transformer to decode language tokens from fMRI signals and integrate them with a language model for visual question answering. It claims substantial outperformance over prior fMRI-based captioning and VQA methods. The work also presents the NSD-VQA benchmark, which supplies ~20 question-answer pairs per image across 20 controlled categories designed to disentangle visual and semantic levels, enabling more reliable evaluation and quantification of decodable information from natural-image fMRI responses, plus region-wise analyses.

Significance. If the performance gains and benchmark validity hold, the contribution would be a stronger predictive model for brain-signal decoding together with a controlled tool for probing the structure of visual representations in cortex.

major comments (2)

[Abstract] Abstract: the central claim that Brain-IT-VQA 'substantially outperforms previous fMRI-based captioning and VQA approaches' is stated without any quantitative results, baselines, error bars, or statistical tests, rendering the outperformance assertion impossible to evaluate.
[Abstract] Abstract (NSD-VQA description): the assertion that the 20 controlled categories 'disentangle multiple levels of visual understanding' and permit 'more reliable and interpretable evaluation despite limited fMRI test data' is load-bearing for the benchmark's utility, yet no supporting analysis (inter-question correlations, per-category variance, or power calculations) is referenced; this directly engages the concern that small test sets may render performance deltas and region analyses uninterpretable.

minor comments (1)

[Abstract] The abstract states that the framework 'quantify which forms of visual and semantic information can be reliably decoded' and analyzes 'contributions of different brain regions across question types,' but supplies no methodological details on the decoding metrics, statistical tests, or region definitions used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each major comment below and indicate planned revisions to strengthen the presentation of results and benchmark justification.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that Brain-IT-VQA 'substantially outperforms previous fMRI-based captioning and VQA approaches' is stated without any quantitative results, baselines, error bars, or statistical tests, rendering the outperformance assertion impossible to evaluate.

Authors: We agree that the abstract would be improved by including key quantitative support. The main text reports specific metrics (accuracy, BLEU, and CIDEr scores with error bars and statistical tests against prior methods in Tables 1-2 and Section 4.1). In revision we will add concise numerical results and baseline references to the abstract while keeping it within length limits. revision: yes
Referee: [Abstract] Abstract (NSD-VQA description): the assertion that the 20 controlled categories 'disentangle multiple levels of visual understanding' and permit 'more reliable and interpretable evaluation despite limited fMRI test data' is load-bearing for the benchmark's utility, yet no supporting analysis (inter-question correlations, per-category variance, or power calculations) is referenced; this directly engages the concern that small test sets may render performance deltas and region analyses uninterpretable.

Authors: The main text already provides per-category performance variance, region-wise contribution maps, and interpretability results (Sections 4.2-4.3) that support the benchmark's utility. We will revise the abstract to explicitly reference these analyses. We can also add a short note on inter-question correlations if the referee deems it necessary; power calculations for the test set size can be included in the supplement. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces Brain-IT-VQA by extending the prior Brain-IT model and presents NSD-VQA as a new benchmark whose 20 controlled categories are asserted to disentangle visual understanding levels. Performance gains and region analyses are framed as empirical outcomes from applying the decoder to fMRI data, with no equations, fitted-parameter renamings, or self-citation chains shown that would reduce the central claims to inputs by construction. The derivation remains self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no information on free parameters, axioms, or invented entities; ledger is empty.

pith-pipeline@v0.9.1-grok · 5814 in / 1134 out tokens · 34141 ms · 2026-06-29T08:18:43.018211+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 22 canonical work pages · 9 internal anchors

[1]

Brain-it: Image reconstruction from fmri via brain-interaction transformer

Roman Beliy, Amit Zalcher, Jonathan Kogman, Navve Wasserman, and Michal Irani. Brain-it: Image reconstruction from fmri via brain-interaction transformer. InInternational Conference on Learning Representations (ICLR), 2026

2026
[2]

Variational autoencoder: An unsupervised model for encoding and decoding fmri activity in visual cortex.NeuroImage, 198:125–136, 2019

Kuan Han, Haiguang Wen, Junxing Shi, Kun-Han Lu, Yizhen Zhang, Di Fu, and Zhongming Liu. Variational autoencoder: An unsupervised model for encoding and decoding fmri activity in visual cortex.NeuroImage, 198:125–136, 2019

2019
[3]

Dcnn-gan: Reconstructing realistic image from fmri

Yunfeng Lin, Jiangbei Li, and Hanjing Wang. Dcnn-gan: Reconstructing realistic image from fmri. In2019 16th International Conference on Machine Vision Applications (MVA), pages 1–6. IEEE, 2019

2019
[4]

Reconstructing natural scenes from fmri patterns using bigbigan

Milad Mozafari, Leila Reddy, and Rufin VanRullen. Reconstructing natural scenes from fmri patterns using bigbigan. In2020 International joint conference on neural networks (IJCNN), pages 1–8. IEEE, 2020

2020
[5]

Biggan-based bayesian reconstruction of natural images from human brain activity.Neuroscience, 444:92–105, 2020

Kai Qiao, Jian Chen, Linyuan Wang, Chi Zhang, Li Tong, and Bin Yan. Biggan-based bayesian reconstruction of natural images from human brain activity.Neuroscience, 444:92–105, 2020

2020
[6]

Reconstructing seen image from brain activity by visually-guided cognitive representation and adversarial learning.NeuroImage, 228:117602, 2021

Ziqi Ren, Jie Li, Xuetong Xue, Xin Li, Fan Yang, Zhicheng Jiao, and Xinbo Gao. Reconstructing seen image from brain activity by visually-guided cognitive representation and adversarial learning.NeuroImage, 228:117602, 2021

2021
[7]

Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding

Zijiao Chen, Jiaxin Qing, Tiange Xiang, Wan Lin Yue, and Juan Helen Zhou. Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22710–22720, 2023

2023
[8]

High-resolution image reconstruction with latent diffusion models from human brain activity

Yu Takagi and Shinji Nishimoto. High-resolution image reconstruction with latent diffusion models from human brain activity. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14453–14463, 2023

2023
[9]

Natural scene reconstruction from fmri signals using generative latent diffusion.Scientific Reports, 13(1):15666, 2023

Furkan Ozcelik and Rufin VanRullen. Natural scene reconstruction from fmri signals using generative latent diffusion.Scientific Reports, 13(1):15666, 2023

2023
[10]

Deep image recon- struction from human brain activity.PLoS computational biology, 15(1):e1006633, 2019

Guohua Shen, Tomoyasu Horikawa, Kei Majima, and Yukiyasu Kamitani. Deep image recon- struction from human brain activity.PLoS computational biology, 15(1):e1006633, 2019

2019
[11]

Constraint-free natural image reconstruction from fmri signals based on convolutional neural network.Frontiers in human neuroscience, 12:242, 2018

Chi Zhang, Kai Qiao, Linyuan Wang, Li Tong, Ying Zeng, and Bin Yan. Constraint-free natural image reconstruction from fmri signals based on convolutional neural network.Frontiers in human neuroscience, 12:242, 2018

2018
[12]

Mindbridge: A cross-subject brain decoding framework

Shizun Wang, Songhua Liu, Zhenxiong Tan, and Xinchao Wang. Mindbridge: A cross-subject brain decoding framework. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11333–11342, 2024

2024
[13]

Umbrae: Unified multi- modal brain decoding

Weihao Xia, Raoul de Charette, Cengiz Oztireli, and Jing-Hao Xue. Umbrae: Unified multi- modal brain decoding. InEuropean Conference on Computer Vision, pages 242–259. Springer, 2024

2024
[14]

SSM2Mel: State Space Model to Reconstruct Mel Spectrogram from the EEG,

Wanqiu Huang, Ke Ma, Tingyu Xie, and Hongwei Wang. Brainchat: Interactive semantic information decoding from fmri using large-scale vision-language pretrained models. InICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2025. doi: 10.1109/ICASSP49660.2025.10889434. 10

work page doi:10.1109/icassp49660.2025.10889434 2025
[15]

MindLLM: A subject-agnostic and versatile model for fMRI-to-text decoding

Weikang Qiu, Zheng Huang, Haoyu Hu, Aosong Feng, Yujun Yan, and Rex Ying. MindLLM: A subject-agnostic and versatile model for fMRI-to-text decoding. InForty-second Interna- tional Conference on Machine Learning, 2025. URL https://openreview.net/forum? id=EiAQrilPYP

2025
[16]

M. et al. Ferrante. Brain captioning: Decoding human brain activity into images and text.arXiv preprint arXiv:2305.11560, 2023

work page arXiv 2023
[17]

Unibrain: A unified model for cross-subject brain decoding.arXiv preprint arXiv:2412.19487, 2024

Zicheng Wang, Zhen Zhao, Luping Zhou, and Parashkev Nachev. Unibrain: A unified model for cross-subject brain decoding.arXiv preprint arXiv:2412.19487, 2024

work page arXiv 2024
[18]

Mindgpt: Interpreting what you see with non-invasive brain recordings, 2023

Jiaxuan Chen, Yu Qi, Yueming Wang, and Gang Pan. Mindgpt: Interpreting what you see with non-invasive brain recordings, 2023. URLhttps://arxiv.org/abs/2309.15729

work page arXiv 2023
[19]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

2022
[20]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023
[21]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[22]

GPT-4 Technical Report

OpenAI. GPT-4 technical report. Technical report, OpenAI, 2023. arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, et al. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

2023
[25]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Wayne Xin Zhao, Ji-Rong Wang, Wenhao Wang, Chunyu Kit Chan, and Steven C.H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.arXiv preprint arXiv:2305.06500, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Identifying natural images from human brain activity.Nature, 452(7185):352–355, 2008

Kendrick N Kay, Thomas Naselaris, Ryan J Prenger, and Jack L Gallant. Identifying natural images from human brain activity.Nature, 452(7185):352–355, 2008

2008
[27]

Bayesian reconstruction of natural images from human brain activity.Neuron, 63(6):902–915, 2009

Thomas Naselaris, Ryan J Prenger, Kendrick N Kay, Michael Oliver, and Jack L Gallant. Bayesian reconstruction of natural images from human brain activity.Neuron, 63(6):902–915, 2009

2009
[28]

Reconstructing visual experiences from brain activity evoked by natural movies.Current biology, 21(19):1641–1646, 2011

Shinji Nishimoto, An T Vu, Thomas Naselaris, Yuval Benjamini, Bin Yu, and Jack L Gallant. Reconstructing visual experiences from brain activity evoked by natural movies.Current biology, 21(19):1641–1646, 2011

2011
[29]

Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream.Journal of Neuroscience, 35 (27):10005–10014, 2015

Umut Güçlü and Marcel AJ Van Gerven. Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream.Journal of Neuroscience, 35 (27):10005–10014, 2015

2015
[30]

Generative adversarial networks for reconstructing natural images from brain activity

Katja Seeliger, Umut Güçlü, Luca Ambrogioni, Yagmur Güçlütürk, and Marcel AJ Van Ger- ven. Generative adversarial networks for reconstructing natural images from brain activity. NeuroImage, 181:775–785, 2018

2018
[31]

Generative adversarial networks conditioned on brain activity reconstruct seen images

Ghislain St-Yves and Thomas Naselaris. Generative adversarial networks conditioned on brain activity reconstruct seen images. In2018 IEEE international conference on systems, man, and cybernetics (SMC), pages 1054–1061. IEEE, 2018

2018
[32]

From voxels to pixels and back: Self-supervision in natural-image reconstruction from fmri.Advances in Neural Information Processing Systems, 32:6514–6524, 2019

Roman Beliy, Guy Gaziv, Assaf Hoogi, Francesca Strappini, Tal Golan, and Michal Irani. From voxels to pixels and back: Self-supervision in natural-image reconstruction from fmri.Advances in Neural Information Processing Systems, 32:6514–6524, 2019. 11

2019
[33]

S., Tripathy, M., Villanueva, C

Paul S Scotti, Mihir Tripathy, Cesar Kadir Torrico Villanueva, Reese Kneeland, Tong Chen, Ashutosh Narang, Charan Santhirasegaran, Jonathan Xu, Thomas Naselaris, Kenneth A Norman, et al. Mindeye2: Shared-subject models enable fmri-to-image with 1 hour of data.arXiv preprint arXiv:2403.11207, 2024

work page arXiv 2024
[34]

Mindtuner: Cross-subject visual decoding with visual fingerprint and semantic correction

Zixuan Gong, Qi Zhang, Guangyin Bao, Lei Zhu, Rongtao Xu, Ke Liu, Liang Hu, and Duoqian Miao. Mindtuner: Cross-subject visual decoding with visual fingerprint and semantic correction. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 14247– 14255, 2025

2025
[35]

Through their eyes: multi-subject brain decoding with simple alignment techniques.Imaging Neuroscience, 2:1–21, 2024

Matteo Ferrante, Tommaso Boccato, Furkan Ozcelik, Rufin VanRullen, and Nicola Toschi. Through their eyes: multi-subject brain decoding with simple alignment techniques.Imaging Neuroscience, 2:1–21, 2024

2024
[36]

See through their minds: Learning transferable brain decoding models from cross-subject fmri

Yulong Liu, Yongqiang Ma, Guibo Zhu, Haodong Jing, and Nanning Zheng. See through their minds: Learning transferable brain decoding models from cross-subject fmri. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 5730–5738, 2025

2025
[37]

The Wisdom of a Crowd of Brains: A Universal Brain Encoder

Roman Beliy, Navve Wasserman, Amit Zalcher, and Michal Irani. The wisdom of a crowd of brains: A universal brain encoder.arXiv preprint arXiv:2406.12179, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Interpretable fmri captioning via contrastive learning

Vyacheslav Shen, Kassymzhomart Kunanbayev, Donggon Jang, and Daeshik Kim. Interpretable fmri captioning via contrastive learning. In James C. Gee, Daniel C. Alexander, Jaesung Hong, Juan Eugenio Iglesias, Carole H. Sudre, Archana Venkataraman, Polina Golland, Jong Hyo Kim, and Jinah Park, editors,Medical Image Computing and Computer Assisted Intervention ...

2025
[39]

A massive 7t fmri dataset to bridge cognitive neuroscience and artificial intelligence.Nature neuroscience, 25(1):116–126, 2022

Emily J Allen, Ghislain St-Yves, Yihan Wu, Jesse L Breedlove, Jacob S Prince, Logan T Dowdle, Matthias Nau, Brad Caron, Franco Pestilli, Ian Charest, et al. A massive 7t fmri dataset to bridge cognitive neuroscience and artificial intelligence.Nature neuroscience, 25(1):116–126, 2022

2022
[40]

BOLD5000: A public fMRI dataset while viewing 5000 visual images

Nadine Chang, John A Pyles, Austin Marcus, Abhinav Gupta, Michael J Tarr, and Elissa M Aminoff. BOLD5000: A public fMRI dataset while viewing 5000 visual images. InScientific Data, volume 6, page 49, 2019

2019
[41]

Neuropictor: Refining fmri-to-image reconstruction via multi-individual pretraining and multi-level modulation

Jingyang Huo, Yikai Wang, Yun Wang, Xuelin Qian, Chong Li, Yanwei Fu, and Jianfeng Feng. Neuropictor: Refining fmri-to-image reconstruction via multi-individual pretraining and multi-level modulation. InEuropean Conference on Computer Vision, pages 56–73. Springer, 2024

2024
[42]

BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain

Navve Wasserman, Matias Cosarinsky, Yuval Golbari, Aude Oliva, Antonio Torralba, Tamar Rott Shaham, and Michal Irani. Brainexplore: Large-scale discovery of interpretable visual represen- tations in the human brain, 2025. URLhttps://arxiv.org/abs/2512.08560

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team. Gemma: Open models based on gemini research and technology, 2024. URL https://arxiv.org/abs/2403.08295

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

C-pack: Packaged resources to advance general chinese embedding, 2023

Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. C-pack: Packaged resources to advance general chinese embedding, 2023

2023
[46]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/ 2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014. 12

2014
[48]

A. T. Gifford, B. Lahner, S. Saba-Sadiya, M. G. Vilas, A. Lascelles, A. Oliva, K. Kay, G. Roig, and R. M. Cichy. The algonauts project 2023 challenge: How the human brain makes sense of natural scenes, 2023. URLhttps://arxiv.org/abs/2301.03198

work page arXiv 2023
[49]

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering . In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6325–6334, Los Alamitos, CA, USA, July 2017. IEEE Computer Society. doi: 10.1109/CVPR.2017.670. URL...

work page doi:10.1109/cvpr.2017.670 2017
[50]

The Color of the Cat is Gray: 1 Million Full-Sentences Visual Question Answering (FSVQA)

Andrew Shin, Yoshitaka Ushiku, and Tatsuya Harada. The color of the cat is gray: 1 million full-sentences visual question answering (fsvqa), 2016. URL https://arxiv.org/abs/ 1609.06657

work page internal anchor Pith review Pith/arXiv arXiv 2016
[51]

Saito, Z

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering . In2015 IEEE Inter- national Conference on Computer Vision (ICCV), pages 2425–2433, Los Alamitos, CA, USA, December 2015. IEEE Computer Society. doi: 10.1109/ICCV .2015.279. URL https://doi.ieeecomputersociet...

work page doi:10.1109/iccv 2015
[52]

Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , pages =

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Comput...

work page doi:10.3115/1073083.1073135 2002
[53]

METEOR: An automatic metric for MT evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare V oss, editors,Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan,...

2005
[54]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summariza- tion Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URLhttps://aclanthology.org/W04-1013/

2004
[55]

Lawrence Zitnick, and Devi Parikh

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus-based image description evaluation . In2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4566–4575, Los Alamitos, CA, USA, June 2015. IEEE Computer Society. doi: 10.1109/CVPR.2015.7299087. URL https://doi.ieeecomputersociety. org/10.1109/CVPR.2015.7299087

work page doi:10.1109/cvpr.2015.7299087 2015
[56]

Spice: Semantic propositional image caption evaluation

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional image caption evaluation. InECCV, 2016

2016
[57]

High-resolution image reconstruction with latent diffusion models from human brain activity

Yu Takagi and Shinji Nishimoto. High-resolution image reconstruction with latent diffusion models from human brain activity. InCVPR, 2023

2023
[58]

J. et al. Han. Onellm: One framework to align all modalities with language. InCVPR, 2024

2024
[59]

and Zhang, Z

W. Mai and Z. Zhang. Unibrain: Unify image reconstruction and captioning all in one diffusion model from human brain activity.arXiv preprint arXiv:2308.07428, 2023

work page arXiv 2023
[60]

Brain-language fusion enables interactive neural readout and in-silico experimenta- tion, 2025

Victoria Bosch, Daniel Anthes, Adrien Doerig, Sushrut Thorat, Peter König, and Tim Christian Kietzmann. Brain-language fusion enables interactive neural readout and in-silico experimenta- tion, 2025. URLhttps://arxiv.org/abs/2509.23941

work page arXiv 2025
[61]

Ida Gobbini, Maura L Furey, Alumit Ishai, Jennifer L Schouten, and Pietro Pietrini

James V Haxby, M. Ida Gobbini, Maura L Furey, Alumit Ishai, Jennifer L Schouten, and Pietro Pietrini. Distributed and overlapping representations of faces and objects in ventral temporal cortex.Science, 293(5539):2425–2430, 2001

2001
[62]

Beyond mind-reading: multi-voxel pattern analysis of fmri data.Trends in Cognitive Sciences, 10(9):424–430, 2006

Kenneth A Norman, Sean M Polyn, Greg J Detre, and James V Haxby. Beyond mind-reading: multi-voxel pattern analysis of fmri data.Trends in Cognitive Sciences, 10(9):424–430, 2006. 13

2006
[63]

Zeiler and Rob Fergus

Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In Computer Vision – ECCV 2014, volume 8689 ofLecture Notes in Computer Science, pages 818–833. Springer, 2014

2014
[64]

why should I trust you?

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should I trust you?": Explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144, 2016

2016
[65]

Apurva Ratan Murty, and Nancy Kanwisher

Meenakshi Khosla, N. Apurva Ratan Murty, and Nancy Kanwisher. A highly selective response to food in human visual cortex revealed by hypothesis-free voxel decomposi- tion.Current Biology, 32(19):4159–4171.e9, 2022. ISSN 0960-9822. doi: https://doi.org/ 10.1016/j.cub.2022.08.009. URL https://www.sciencedirect.com/science/article/ pii/S0960982222012866

work page doi:10.1016/j.cub.2022.08.009 2022
[66]

Scaling instruction-finetuned language models, 2022

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, et al. Scaling instruction-finetuned language models, 2022

2022
[67]

objects". OUTPUT FORMAT (STRICT JSON ONLY): {

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=Bkg6RiCqY7. 14 Appendix A Limitation fMRI Assumptions.Following standard practice in the field, our model assumes that fMRI responses are memoryless and replicable. The memoryless assum...

2019

[1] [1]

Brain-it: Image reconstruction from fmri via brain-interaction transformer

Roman Beliy, Amit Zalcher, Jonathan Kogman, Navve Wasserman, and Michal Irani. Brain-it: Image reconstruction from fmri via brain-interaction transformer. InInternational Conference on Learning Representations (ICLR), 2026

2026

[2] [2]

Variational autoencoder: An unsupervised model for encoding and decoding fmri activity in visual cortex.NeuroImage, 198:125–136, 2019

Kuan Han, Haiguang Wen, Junxing Shi, Kun-Han Lu, Yizhen Zhang, Di Fu, and Zhongming Liu. Variational autoencoder: An unsupervised model for encoding and decoding fmri activity in visual cortex.NeuroImage, 198:125–136, 2019

2019

[3] [3]

Dcnn-gan: Reconstructing realistic image from fmri

Yunfeng Lin, Jiangbei Li, and Hanjing Wang. Dcnn-gan: Reconstructing realistic image from fmri. In2019 16th International Conference on Machine Vision Applications (MVA), pages 1–6. IEEE, 2019

2019

[4] [4]

Reconstructing natural scenes from fmri patterns using bigbigan

Milad Mozafari, Leila Reddy, and Rufin VanRullen. Reconstructing natural scenes from fmri patterns using bigbigan. In2020 International joint conference on neural networks (IJCNN), pages 1–8. IEEE, 2020

2020

[5] [5]

Biggan-based bayesian reconstruction of natural images from human brain activity.Neuroscience, 444:92–105, 2020

Kai Qiao, Jian Chen, Linyuan Wang, Chi Zhang, Li Tong, and Bin Yan. Biggan-based bayesian reconstruction of natural images from human brain activity.Neuroscience, 444:92–105, 2020

2020

[6] [6]

Reconstructing seen image from brain activity by visually-guided cognitive representation and adversarial learning.NeuroImage, 228:117602, 2021

Ziqi Ren, Jie Li, Xuetong Xue, Xin Li, Fan Yang, Zhicheng Jiao, and Xinbo Gao. Reconstructing seen image from brain activity by visually-guided cognitive representation and adversarial learning.NeuroImage, 228:117602, 2021

2021

[7] [7]

Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding

Zijiao Chen, Jiaxin Qing, Tiange Xiang, Wan Lin Yue, and Juan Helen Zhou. Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22710–22720, 2023

2023

[8] [8]

High-resolution image reconstruction with latent diffusion models from human brain activity

Yu Takagi and Shinji Nishimoto. High-resolution image reconstruction with latent diffusion models from human brain activity. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14453–14463, 2023

2023

[9] [9]

Natural scene reconstruction from fmri signals using generative latent diffusion.Scientific Reports, 13(1):15666, 2023

Furkan Ozcelik and Rufin VanRullen. Natural scene reconstruction from fmri signals using generative latent diffusion.Scientific Reports, 13(1):15666, 2023

2023

[10] [10]

Deep image recon- struction from human brain activity.PLoS computational biology, 15(1):e1006633, 2019

Guohua Shen, Tomoyasu Horikawa, Kei Majima, and Yukiyasu Kamitani. Deep image recon- struction from human brain activity.PLoS computational biology, 15(1):e1006633, 2019

2019

[11] [11]

Constraint-free natural image reconstruction from fmri signals based on convolutional neural network.Frontiers in human neuroscience, 12:242, 2018

Chi Zhang, Kai Qiao, Linyuan Wang, Li Tong, Ying Zeng, and Bin Yan. Constraint-free natural image reconstruction from fmri signals based on convolutional neural network.Frontiers in human neuroscience, 12:242, 2018

2018

[12] [12]

Mindbridge: A cross-subject brain decoding framework

Shizun Wang, Songhua Liu, Zhenxiong Tan, and Xinchao Wang. Mindbridge: A cross-subject brain decoding framework. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11333–11342, 2024

2024

[13] [13]

Umbrae: Unified multi- modal brain decoding

Weihao Xia, Raoul de Charette, Cengiz Oztireli, and Jing-Hao Xue. Umbrae: Unified multi- modal brain decoding. InEuropean Conference on Computer Vision, pages 242–259. Springer, 2024

2024

[14] [14]

SSM2Mel: State Space Model to Reconstruct Mel Spectrogram from the EEG,

Wanqiu Huang, Ke Ma, Tingyu Xie, and Hongwei Wang. Brainchat: Interactive semantic information decoding from fmri using large-scale vision-language pretrained models. InICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2025. doi: 10.1109/ICASSP49660.2025.10889434. 10

work page doi:10.1109/icassp49660.2025.10889434 2025

[15] [15]

MindLLM: A subject-agnostic and versatile model for fMRI-to-text decoding

Weikang Qiu, Zheng Huang, Haoyu Hu, Aosong Feng, Yujun Yan, and Rex Ying. MindLLM: A subject-agnostic and versatile model for fMRI-to-text decoding. InForty-second Interna- tional Conference on Machine Learning, 2025. URL https://openreview.net/forum? id=EiAQrilPYP

2025

[16] [16]

M. et al. Ferrante. Brain captioning: Decoding human brain activity into images and text.arXiv preprint arXiv:2305.11560, 2023

work page arXiv 2023

[17] [17]

Unibrain: A unified model for cross-subject brain decoding.arXiv preprint arXiv:2412.19487, 2024

Zicheng Wang, Zhen Zhao, Luping Zhou, and Parashkev Nachev. Unibrain: A unified model for cross-subject brain decoding.arXiv preprint arXiv:2412.19487, 2024

work page arXiv 2024

[18] [18]

Mindgpt: Interpreting what you see with non-invasive brain recordings, 2023

Jiaxuan Chen, Yu Qi, Yueming Wang, and Gang Pan. Mindgpt: Interpreting what you see with non-invasive brain recordings, 2023. URLhttps://arxiv.org/abs/2309.15729

work page arXiv 2023

[19] [19]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

2022

[20] [20]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023

[21] [21]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024

[22] [22]

GPT-4 Technical Report

OpenAI. GPT-4 technical report. Technical report, OpenAI, 2023. arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, et al. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

2023

[25] [25]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Wayne Xin Zhao, Ji-Rong Wang, Wenhao Wang, Chunyu Kit Chan, and Steven C.H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.arXiv preprint arXiv:2305.06500, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Identifying natural images from human brain activity.Nature, 452(7185):352–355, 2008

Kendrick N Kay, Thomas Naselaris, Ryan J Prenger, and Jack L Gallant. Identifying natural images from human brain activity.Nature, 452(7185):352–355, 2008

2008

[27] [27]

Bayesian reconstruction of natural images from human brain activity.Neuron, 63(6):902–915, 2009

Thomas Naselaris, Ryan J Prenger, Kendrick N Kay, Michael Oliver, and Jack L Gallant. Bayesian reconstruction of natural images from human brain activity.Neuron, 63(6):902–915, 2009

2009

[28] [28]

Reconstructing visual experiences from brain activity evoked by natural movies.Current biology, 21(19):1641–1646, 2011

Shinji Nishimoto, An T Vu, Thomas Naselaris, Yuval Benjamini, Bin Yu, and Jack L Gallant. Reconstructing visual experiences from brain activity evoked by natural movies.Current biology, 21(19):1641–1646, 2011

2011

[29] [29]

Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream.Journal of Neuroscience, 35 (27):10005–10014, 2015

Umut Güçlü and Marcel AJ Van Gerven. Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream.Journal of Neuroscience, 35 (27):10005–10014, 2015

2015

[30] [30]

Generative adversarial networks for reconstructing natural images from brain activity

Katja Seeliger, Umut Güçlü, Luca Ambrogioni, Yagmur Güçlütürk, and Marcel AJ Van Ger- ven. Generative adversarial networks for reconstructing natural images from brain activity. NeuroImage, 181:775–785, 2018

2018

[31] [31]

Generative adversarial networks conditioned on brain activity reconstruct seen images

Ghislain St-Yves and Thomas Naselaris. Generative adversarial networks conditioned on brain activity reconstruct seen images. In2018 IEEE international conference on systems, man, and cybernetics (SMC), pages 1054–1061. IEEE, 2018

2018

[32] [32]

From voxels to pixels and back: Self-supervision in natural-image reconstruction from fmri.Advances in Neural Information Processing Systems, 32:6514–6524, 2019

Roman Beliy, Guy Gaziv, Assaf Hoogi, Francesca Strappini, Tal Golan, and Michal Irani. From voxels to pixels and back: Self-supervision in natural-image reconstruction from fmri.Advances in Neural Information Processing Systems, 32:6514–6524, 2019. 11

2019

[33] [33]

S., Tripathy, M., Villanueva, C

Paul S Scotti, Mihir Tripathy, Cesar Kadir Torrico Villanueva, Reese Kneeland, Tong Chen, Ashutosh Narang, Charan Santhirasegaran, Jonathan Xu, Thomas Naselaris, Kenneth A Norman, et al. Mindeye2: Shared-subject models enable fmri-to-image with 1 hour of data.arXiv preprint arXiv:2403.11207, 2024

work page arXiv 2024

[34] [34]

Mindtuner: Cross-subject visual decoding with visual fingerprint and semantic correction

Zixuan Gong, Qi Zhang, Guangyin Bao, Lei Zhu, Rongtao Xu, Ke Liu, Liang Hu, and Duoqian Miao. Mindtuner: Cross-subject visual decoding with visual fingerprint and semantic correction. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 14247– 14255, 2025

2025

[35] [35]

Through their eyes: multi-subject brain decoding with simple alignment techniques.Imaging Neuroscience, 2:1–21, 2024

Matteo Ferrante, Tommaso Boccato, Furkan Ozcelik, Rufin VanRullen, and Nicola Toschi. Through their eyes: multi-subject brain decoding with simple alignment techniques.Imaging Neuroscience, 2:1–21, 2024

2024

[36] [36]

See through their minds: Learning transferable brain decoding models from cross-subject fmri

Yulong Liu, Yongqiang Ma, Guibo Zhu, Haodong Jing, and Nanning Zheng. See through their minds: Learning transferable brain decoding models from cross-subject fmri. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 5730–5738, 2025

2025

[37] [37]

The Wisdom of a Crowd of Brains: A Universal Brain Encoder

Roman Beliy, Navve Wasserman, Amit Zalcher, and Michal Irani. The wisdom of a crowd of brains: A universal brain encoder.arXiv preprint arXiv:2406.12179, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

Interpretable fmri captioning via contrastive learning

Vyacheslav Shen, Kassymzhomart Kunanbayev, Donggon Jang, and Daeshik Kim. Interpretable fmri captioning via contrastive learning. In James C. Gee, Daniel C. Alexander, Jaesung Hong, Juan Eugenio Iglesias, Carole H. Sudre, Archana Venkataraman, Polina Golland, Jong Hyo Kim, and Jinah Park, editors,Medical Image Computing and Computer Assisted Intervention ...

2025

[39] [39]

A massive 7t fmri dataset to bridge cognitive neuroscience and artificial intelligence.Nature neuroscience, 25(1):116–126, 2022

Emily J Allen, Ghislain St-Yves, Yihan Wu, Jesse L Breedlove, Jacob S Prince, Logan T Dowdle, Matthias Nau, Brad Caron, Franco Pestilli, Ian Charest, et al. A massive 7t fmri dataset to bridge cognitive neuroscience and artificial intelligence.Nature neuroscience, 25(1):116–126, 2022

2022

[40] [40]

BOLD5000: A public fMRI dataset while viewing 5000 visual images

Nadine Chang, John A Pyles, Austin Marcus, Abhinav Gupta, Michael J Tarr, and Elissa M Aminoff. BOLD5000: A public fMRI dataset while viewing 5000 visual images. InScientific Data, volume 6, page 49, 2019

2019

[41] [41]

Neuropictor: Refining fmri-to-image reconstruction via multi-individual pretraining and multi-level modulation

Jingyang Huo, Yikai Wang, Yun Wang, Xuelin Qian, Chong Li, Yanwei Fu, and Jianfeng Feng. Neuropictor: Refining fmri-to-image reconstruction via multi-individual pretraining and multi-level modulation. InEuropean Conference on Computer Vision, pages 56–73. Springer, 2024

2024

[42] [42]

BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain

Navve Wasserman, Matias Cosarinsky, Yuval Golbari, Aude Oliva, Antonio Torralba, Tamar Rott Shaham, and Michal Irani. Brainexplore: Large-scale discovery of interpretable visual represen- tations in the human brain, 2025. URLhttps://arxiv.org/abs/2512.08560

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team. Gemma: Open models based on gemini research and technology, 2024. URL https://arxiv.org/abs/2403.08295

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

C-pack: Packaged resources to advance general chinese embedding, 2023

Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. C-pack: Packaged resources to advance general chinese embedding, 2023

2023

[46] [46]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/ 2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014. 12

2014

[48] [48]

A. T. Gifford, B. Lahner, S. Saba-Sadiya, M. G. Vilas, A. Lascelles, A. Oliva, K. Kay, G. Roig, and R. M. Cichy. The algonauts project 2023 challenge: How the human brain makes sense of natural scenes, 2023. URLhttps://arxiv.org/abs/2301.03198

work page arXiv 2023

[49] [49]

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering . In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6325–6334, Los Alamitos, CA, USA, July 2017. IEEE Computer Society. doi: 10.1109/CVPR.2017.670. URL...

work page doi:10.1109/cvpr.2017.670 2017

[50] [50]

The Color of the Cat is Gray: 1 Million Full-Sentences Visual Question Answering (FSVQA)

Andrew Shin, Yoshitaka Ushiku, and Tatsuya Harada. The color of the cat is gray: 1 million full-sentences visual question answering (fsvqa), 2016. URL https://arxiv.org/abs/ 1609.06657

work page internal anchor Pith review Pith/arXiv arXiv 2016

[51] [51]

Saito, Z

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering . In2015 IEEE Inter- national Conference on Computer Vision (ICCV), pages 2425–2433, Los Alamitos, CA, USA, December 2015. IEEE Computer Society. doi: 10.1109/ICCV .2015.279. URL https://doi.ieeecomputersociet...

work page doi:10.1109/iccv 2015

[52] [52]

Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , pages =

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Comput...

work page doi:10.3115/1073083.1073135 2002

[53] [53]

METEOR: An automatic metric for MT evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare V oss, editors,Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan,...

2005

[54] [54]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summariza- tion Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URLhttps://aclanthology.org/W04-1013/

2004

[55] [55]

Lawrence Zitnick, and Devi Parikh

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus-based image description evaluation . In2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4566–4575, Los Alamitos, CA, USA, June 2015. IEEE Computer Society. doi: 10.1109/CVPR.2015.7299087. URL https://doi.ieeecomputersociety. org/10.1109/CVPR.2015.7299087

work page doi:10.1109/cvpr.2015.7299087 2015

[56] [56]

Spice: Semantic propositional image caption evaluation

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional image caption evaluation. InECCV, 2016

2016

[57] [57]

High-resolution image reconstruction with latent diffusion models from human brain activity

Yu Takagi and Shinji Nishimoto. High-resolution image reconstruction with latent diffusion models from human brain activity. InCVPR, 2023

2023

[58] [58]

J. et al. Han. Onellm: One framework to align all modalities with language. InCVPR, 2024

2024

[59] [59]

and Zhang, Z

W. Mai and Z. Zhang. Unibrain: Unify image reconstruction and captioning all in one diffusion model from human brain activity.arXiv preprint arXiv:2308.07428, 2023

work page arXiv 2023

[60] [60]

Brain-language fusion enables interactive neural readout and in-silico experimenta- tion, 2025

Victoria Bosch, Daniel Anthes, Adrien Doerig, Sushrut Thorat, Peter König, and Tim Christian Kietzmann. Brain-language fusion enables interactive neural readout and in-silico experimenta- tion, 2025. URLhttps://arxiv.org/abs/2509.23941

work page arXiv 2025

[61] [61]

Ida Gobbini, Maura L Furey, Alumit Ishai, Jennifer L Schouten, and Pietro Pietrini

James V Haxby, M. Ida Gobbini, Maura L Furey, Alumit Ishai, Jennifer L Schouten, and Pietro Pietrini. Distributed and overlapping representations of faces and objects in ventral temporal cortex.Science, 293(5539):2425–2430, 2001

2001

[62] [62]

Beyond mind-reading: multi-voxel pattern analysis of fmri data.Trends in Cognitive Sciences, 10(9):424–430, 2006

Kenneth A Norman, Sean M Polyn, Greg J Detre, and James V Haxby. Beyond mind-reading: multi-voxel pattern analysis of fmri data.Trends in Cognitive Sciences, 10(9):424–430, 2006. 13

2006

[63] [63]

Zeiler and Rob Fergus

Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In Computer Vision – ECCV 2014, volume 8689 ofLecture Notes in Computer Science, pages 818–833. Springer, 2014

2014

[64] [64]

why should I trust you?

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should I trust you?": Explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144, 2016

2016

[65] [65]

Apurva Ratan Murty, and Nancy Kanwisher

Meenakshi Khosla, N. Apurva Ratan Murty, and Nancy Kanwisher. A highly selective response to food in human visual cortex revealed by hypothesis-free voxel decomposi- tion.Current Biology, 32(19):4159–4171.e9, 2022. ISSN 0960-9822. doi: https://doi.org/ 10.1016/j.cub.2022.08.009. URL https://www.sciencedirect.com/science/article/ pii/S0960982222012866

work page doi:10.1016/j.cub.2022.08.009 2022

[66] [66]

Scaling instruction-finetuned language models, 2022

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, et al. Scaling instruction-finetuned language models, 2022

2022

[67] [67]

objects". OUTPUT FORMAT (STRICT JSON ONLY): {

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=Bkg6RiCqY7. 14 Appendix A Limitation fMRI Assumptions.Following standard practice in the field, our model assumes that fMRI responses are memoryless and replicable. The memoryless assum...

2019