pith. sign in

arxiv: 2605.29588 · v2 · pith:2IP6X7FXnew · submitted 2026-05-28 · 💻 cs.CV · cs.AI· q-bio.NC

Brain-IT-VQA: From Brain Signals to Answers

Pith reviewed 2026-06-29 08:18 UTC · model grok-4.3

classification 💻 cs.CV cs.AIq-bio.NC
keywords fMRI decodingvisual question answeringbrain signal to textNSD-VQA benchmarklanguage token decodingvisual information in brain responses
0
0 comments X

The pith

Brain-IT-VQA converts fMRI signals into language tokens that a language model uses to answer questions about viewed images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework that reads fMRI brain scans taken while people view natural images and produces answers to questions about those images. It works by first turning brain activity into sequences of language tokens, then passing those tokens to a language model for final answers. This yields better performance than earlier fMRI captioning and VQA systems. The authors also release NSD-VQA, a benchmark that supplies roughly twenty controlled question-answer pairs for each image across twenty categories designed to separate different layers of visual understanding. With this benchmark they measure which kinds of visual and semantic content can be extracted reliably from fMRI responses and how different brain regions contribute to each kind of question.

Core claim

Brain-IT-VQA decodes language tokens from fMRI responses to natural images and integrates them with a language model to answer visual questions, substantially outperforming prior fMRI-based captioning and VQA methods. The NSD-VQA benchmark supplies on average twenty question-answer pairs per image across twenty controlled categories that separate levels of visual understanding, allowing quantification of which visual and semantic information is reliably decodable and analysis of brain-region contributions across question types.

What carries the argument

Brain Interaction Transformer that decodes language tokens directly from brain activity for use by a downstream language model.

If this is right

  • The method produces higher accuracy than earlier fMRI captioning and VQA systems on the new benchmark.
  • Quantification shows which visual and semantic features are reliably present in fMRI responses to natural images.
  • Brain-region contributions can be mapped separately for each of the twenty question categories.
  • The controlled question set enables more interpretable evaluation despite small fMRI test sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The token-decoding step could be reused with other brain-recording methods such as EEG to test cross-modal consistency.
  • If token sequences prove stable across subjects, they might serve as a compact intermediate representation for brain-to-text systems.
  • The benchmark categories could be extended to test whether high-level semantic relations remain decodable when low-level visual features are controlled.
  • Performance differences across categories may indicate which visual computations occur early enough in the visual hierarchy to be captured by current fMRI resolution.

Load-bearing premise

The twenty controlled question categories in NSD-VQA truly separate distinct levels of visual understanding and the limited fMRI test data still supports stable performance comparisons.

What would settle it

A replication in which the new model shows no accuracy gain over prior fMRI VQA baselines on the NSD-VQA test set, or in which accuracy does not differ systematically across the twenty question categories in the expected pattern.

Figures

Figures reproduced from arXiv: 2605.29588 by Matias Cosarinsky, Michal Irani, Navve Wasserman, Oliver Heinimann, Roman Beliy.

Figure 1
Figure 1. Figure 1: fMRI-to-Language decoding: Captioning & VQA directly from fMRI brain activity. (a) Overview of the pipeline; fMRI signals recorded while a subject views an image are used to generate a caption or answer questions about the image. (b) Example captions on question-answer pairs generated by Brain-IT-VQA directly from fMRI signals. models remain limited in performance, and are not necessarily optimized for the… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Brain-IT-VQA architecture. which types of visual or semantic information are recoverable from neural signals. Recent work has broadened the evaluation scope: BrainHub [13] extends NSD with captioning and grounding tasks, and BrainChat [14] introduces fMRI question answering evaluated by classification accuracy on broad COCO VQA questions. However, neither provides controlled question catego… view at source ↗
Figure 3
Figure 3. Figure 3: NSD-VQA Dataset construction pipeline. Starting from NSD images, we generate structured annotations using a VLM, followed by filtering and verification. Template-based question generation then produces multiple question–answer pairs per image across controlled question categories. 3.3 Training Setup Training proceeds in two stages: In the first stage (BIT-L Pretraining), BIT-L is trained to predict two tar… view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of voxel-cluster marginal contributions across question categories. Different clusters show varying levels of importance depending on the question type (e.g., object, attribute, relation), highlighting how distinct brain regions support different aspects of visual and semantic processing. Technical Details: We estimate the marginal contribution of each voxel cluster to each question category … view at source ↗
read the original abstract

Decoding visual content from fMRI signals recorded while a person views images, and specifically answering questions about the seen images, is a long-standing challenge. While significant progress has been made in recent years in visual question answering (VQA) from fMRI, performance remains limited. Moreover, although recent models can make increasingly accurate predictions, they have rarely been used as tools for understanding the structure of visual representations in the brain. We present Brain-IT-VQA, a framework for visual question answering from fMRI. Building on the Brain Interaction Transformer (Brain-IT), our method decodes language tokens from brain activity and integrates them with a language model to answer visual questions. Our model substantially outperforms previous fMRI-based captioning and VQA approaches. We further introduce NSD-VQA, a new dataset and benchmark for visual question answering from fMRI. Unlike existing image-fMRI VQA datasets, which typically provide only a few broad and weakly controlled questions per image, NSD-VQA provides on average 20 question-answer pairs per image across 20 controlled question categories that disentangle multiple levels of visual understanding. This enables more reliable and interpretable evaluation despite limited fMRI test data. Together, Brain-IT-VQA and NSD-VQA provide both a strong predictive framework and a tool for studying brain representations. Using this benchmark, we quantify which forms of visual and semantic information can be reliably decoded from fMRI responses to natural images. We further analyze the contributions of different brain regions across question types.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Brain-IT-VQA, a framework extending the Brain Interaction Transformer to decode language tokens from fMRI signals and integrate them with a language model for visual question answering. It claims substantial outperformance over prior fMRI-based captioning and VQA methods. The work also presents the NSD-VQA benchmark, which supplies ~20 question-answer pairs per image across 20 controlled categories designed to disentangle visual and semantic levels, enabling more reliable evaluation and quantification of decodable information from natural-image fMRI responses, plus region-wise analyses.

Significance. If the performance gains and benchmark validity hold, the contribution would be a stronger predictive model for brain-signal decoding together with a controlled tool for probing the structure of visual representations in cortex.

major comments (2)
  1. [Abstract] Abstract: the central claim that Brain-IT-VQA 'substantially outperforms previous fMRI-based captioning and VQA approaches' is stated without any quantitative results, baselines, error bars, or statistical tests, rendering the outperformance assertion impossible to evaluate.
  2. [Abstract] Abstract (NSD-VQA description): the assertion that the 20 controlled categories 'disentangle multiple levels of visual understanding' and permit 'more reliable and interpretable evaluation despite limited fMRI test data' is load-bearing for the benchmark's utility, yet no supporting analysis (inter-question correlations, per-category variance, or power calculations) is referenced; this directly engages the concern that small test sets may render performance deltas and region analyses uninterpretable.
minor comments (1)
  1. [Abstract] The abstract states that the framework 'quantify which forms of visual and semantic information can be reliably decoded' and analyzes 'contributions of different brain regions across question types,' but supplies no methodological details on the decoding metrics, statistical tests, or region definitions used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each major comment below and indicate planned revisions to strengthen the presentation of results and benchmark justification.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that Brain-IT-VQA 'substantially outperforms previous fMRI-based captioning and VQA approaches' is stated without any quantitative results, baselines, error bars, or statistical tests, rendering the outperformance assertion impossible to evaluate.

    Authors: We agree that the abstract would be improved by including key quantitative support. The main text reports specific metrics (accuracy, BLEU, and CIDEr scores with error bars and statistical tests against prior methods in Tables 1-2 and Section 4.1). In revision we will add concise numerical results and baseline references to the abstract while keeping it within length limits. revision: yes

  2. Referee: [Abstract] Abstract (NSD-VQA description): the assertion that the 20 controlled categories 'disentangle multiple levels of visual understanding' and permit 'more reliable and interpretable evaluation despite limited fMRI test data' is load-bearing for the benchmark's utility, yet no supporting analysis (inter-question correlations, per-category variance, or power calculations) is referenced; this directly engages the concern that small test sets may render performance deltas and region analyses uninterpretable.

    Authors: The main text already provides per-category performance variance, region-wise contribution maps, and interpretability results (Sections 4.2-4.3) that support the benchmark's utility. We will revise the abstract to explicitly reference these analyses. We can also add a short note on inter-question correlations if the referee deems it necessary; power calculations for the test set size can be included in the supplement. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces Brain-IT-VQA by extending the prior Brain-IT model and presents NSD-VQA as a new benchmark whose 20 controlled categories are asserted to disentangle visual understanding levels. Performance gains and region analyses are framed as empirical outcomes from applying the decoder to fMRI data, with no equations, fitted-parameter renamings, or self-citation chains shown that would reduce the central claims to inputs by construction. The derivation remains self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no information on free parameters, axioms, or invented entities; ledger is empty.

pith-pipeline@v0.9.1-grok · 5814 in / 1134 out tokens · 34141 ms · 2026-06-29T08:18:43.018211+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 22 canonical work pages · 9 internal anchors

  1. [1]

    Brain-it: Image reconstruction from fmri via brain-interaction transformer

    Roman Beliy, Amit Zalcher, Jonathan Kogman, Navve Wasserman, and Michal Irani. Brain-it: Image reconstruction from fmri via brain-interaction transformer. InInternational Conference on Learning Representations (ICLR), 2026

  2. [2]

    Variational autoencoder: An unsupervised model for encoding and decoding fmri activity in visual cortex.NeuroImage, 198:125–136, 2019

    Kuan Han, Haiguang Wen, Junxing Shi, Kun-Han Lu, Yizhen Zhang, Di Fu, and Zhongming Liu. Variational autoencoder: An unsupervised model for encoding and decoding fmri activity in visual cortex.NeuroImage, 198:125–136, 2019

  3. [3]

    Dcnn-gan: Reconstructing realistic image from fmri

    Yunfeng Lin, Jiangbei Li, and Hanjing Wang. Dcnn-gan: Reconstructing realistic image from fmri. In2019 16th International Conference on Machine Vision Applications (MVA), pages 1–6. IEEE, 2019

  4. [4]

    Reconstructing natural scenes from fmri patterns using bigbigan

    Milad Mozafari, Leila Reddy, and Rufin VanRullen. Reconstructing natural scenes from fmri patterns using bigbigan. In2020 International joint conference on neural networks (IJCNN), pages 1–8. IEEE, 2020

  5. [5]

    Biggan-based bayesian reconstruction of natural images from human brain activity.Neuroscience, 444:92–105, 2020

    Kai Qiao, Jian Chen, Linyuan Wang, Chi Zhang, Li Tong, and Bin Yan. Biggan-based bayesian reconstruction of natural images from human brain activity.Neuroscience, 444:92–105, 2020

  6. [6]

    Reconstructing seen image from brain activity by visually-guided cognitive representation and adversarial learning.NeuroImage, 228:117602, 2021

    Ziqi Ren, Jie Li, Xuetong Xue, Xin Li, Fan Yang, Zhicheng Jiao, and Xinbo Gao. Reconstructing seen image from brain activity by visually-guided cognitive representation and adversarial learning.NeuroImage, 228:117602, 2021

  7. [7]

    Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding

    Zijiao Chen, Jiaxin Qing, Tiange Xiang, Wan Lin Yue, and Juan Helen Zhou. Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22710–22720, 2023

  8. [8]

    High-resolution image reconstruction with latent diffusion models from human brain activity

    Yu Takagi and Shinji Nishimoto. High-resolution image reconstruction with latent diffusion models from human brain activity. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14453–14463, 2023

  9. [9]

    Natural scene reconstruction from fmri signals using generative latent diffusion.Scientific Reports, 13(1):15666, 2023

    Furkan Ozcelik and Rufin VanRullen. Natural scene reconstruction from fmri signals using generative latent diffusion.Scientific Reports, 13(1):15666, 2023

  10. [10]

    Deep image recon- struction from human brain activity.PLoS computational biology, 15(1):e1006633, 2019

    Guohua Shen, Tomoyasu Horikawa, Kei Majima, and Yukiyasu Kamitani. Deep image recon- struction from human brain activity.PLoS computational biology, 15(1):e1006633, 2019

  11. [11]

    Constraint-free natural image reconstruction from fmri signals based on convolutional neural network.Frontiers in human neuroscience, 12:242, 2018

    Chi Zhang, Kai Qiao, Linyuan Wang, Li Tong, Ying Zeng, and Bin Yan. Constraint-free natural image reconstruction from fmri signals based on convolutional neural network.Frontiers in human neuroscience, 12:242, 2018

  12. [12]

    Mindbridge: A cross-subject brain decoding framework

    Shizun Wang, Songhua Liu, Zhenxiong Tan, and Xinchao Wang. Mindbridge: A cross-subject brain decoding framework. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11333–11342, 2024

  13. [13]

    Umbrae: Unified multi- modal brain decoding

    Weihao Xia, Raoul de Charette, Cengiz Oztireli, and Jing-Hao Xue. Umbrae: Unified multi- modal brain decoding. InEuropean Conference on Computer Vision, pages 242–259. Springer, 2024

  14. [14]

    SSM2Mel: State Space Model to Reconstruct Mel Spectrogram from the EEG,

    Wanqiu Huang, Ke Ma, Tingyu Xie, and Hongwei Wang. Brainchat: Interactive semantic information decoding from fmri using large-scale vision-language pretrained models. InICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2025. doi: 10.1109/ICASSP49660.2025.10889434. 10

  15. [15]

    MindLLM: A subject-agnostic and versatile model for fMRI-to-text decoding

    Weikang Qiu, Zheng Huang, Haoyu Hu, Aosong Feng, Yujun Yan, and Rex Ying. MindLLM: A subject-agnostic and versatile model for fMRI-to-text decoding. InForty-second Interna- tional Conference on Machine Learning, 2025. URL https://openreview.net/forum? id=EiAQrilPYP

  16. [16]

    M. et al. Ferrante. Brain captioning: Decoding human brain activity into images and text.arXiv preprint arXiv:2305.11560, 2023

  17. [17]

    Unibrain: A unified model for cross-subject brain decoding.arXiv preprint arXiv:2412.19487, 2024

    Zicheng Wang, Zhen Zhao, Luping Zhou, and Parashkev Nachev. Unibrain: A unified model for cross-subject brain decoding.arXiv preprint arXiv:2412.19487, 2024

  18. [18]

    Mindgpt: Interpreting what you see with non-invasive brain recordings, 2023

    Jiaxuan Chen, Yu Qi, Yueming Wang, and Gang Pan. Mindgpt: Interpreting what you see with non-invasive brain recordings, 2023. URLhttps://arxiv.org/abs/2309.15729

  19. [19]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

  20. [20]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  21. [21]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  22. [22]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report. Technical report, OpenAI, 2023. arXiv:2303.08774

  23. [23]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, et al. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  24. [24]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

  25. [25]

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Wayne Xin Zhao, Ji-Rong Wang, Wenhao Wang, Chunyu Kit Chan, and Steven C.H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.arXiv preprint arXiv:2305.06500, 2023

  26. [26]

    Identifying natural images from human brain activity.Nature, 452(7185):352–355, 2008

    Kendrick N Kay, Thomas Naselaris, Ryan J Prenger, and Jack L Gallant. Identifying natural images from human brain activity.Nature, 452(7185):352–355, 2008

  27. [27]

    Bayesian reconstruction of natural images from human brain activity.Neuron, 63(6):902–915, 2009

    Thomas Naselaris, Ryan J Prenger, Kendrick N Kay, Michael Oliver, and Jack L Gallant. Bayesian reconstruction of natural images from human brain activity.Neuron, 63(6):902–915, 2009

  28. [28]

    Reconstructing visual experiences from brain activity evoked by natural movies.Current biology, 21(19):1641–1646, 2011

    Shinji Nishimoto, An T Vu, Thomas Naselaris, Yuval Benjamini, Bin Yu, and Jack L Gallant. Reconstructing visual experiences from brain activity evoked by natural movies.Current biology, 21(19):1641–1646, 2011

  29. [29]

    Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream.Journal of Neuroscience, 35 (27):10005–10014, 2015

    Umut Güçlü and Marcel AJ Van Gerven. Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream.Journal of Neuroscience, 35 (27):10005–10014, 2015

  30. [30]

    Generative adversarial networks for reconstructing natural images from brain activity

    Katja Seeliger, Umut Güçlü, Luca Ambrogioni, Yagmur Güçlütürk, and Marcel AJ Van Ger- ven. Generative adversarial networks for reconstructing natural images from brain activity. NeuroImage, 181:775–785, 2018

  31. [31]

    Generative adversarial networks conditioned on brain activity reconstruct seen images

    Ghislain St-Yves and Thomas Naselaris. Generative adversarial networks conditioned on brain activity reconstruct seen images. In2018 IEEE international conference on systems, man, and cybernetics (SMC), pages 1054–1061. IEEE, 2018

  32. [32]

    From voxels to pixels and back: Self-supervision in natural-image reconstruction from fmri.Advances in Neural Information Processing Systems, 32:6514–6524, 2019

    Roman Beliy, Guy Gaziv, Assaf Hoogi, Francesca Strappini, Tal Golan, and Michal Irani. From voxels to pixels and back: Self-supervision in natural-image reconstruction from fmri.Advances in Neural Information Processing Systems, 32:6514–6524, 2019. 11

  33. [33]

    S., Tripathy, M., Villanueva, C

    Paul S Scotti, Mihir Tripathy, Cesar Kadir Torrico Villanueva, Reese Kneeland, Tong Chen, Ashutosh Narang, Charan Santhirasegaran, Jonathan Xu, Thomas Naselaris, Kenneth A Norman, et al. Mindeye2: Shared-subject models enable fmri-to-image with 1 hour of data.arXiv preprint arXiv:2403.11207, 2024

  34. [34]

    Mindtuner: Cross-subject visual decoding with visual fingerprint and semantic correction

    Zixuan Gong, Qi Zhang, Guangyin Bao, Lei Zhu, Rongtao Xu, Ke Liu, Liang Hu, and Duoqian Miao. Mindtuner: Cross-subject visual decoding with visual fingerprint and semantic correction. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 14247– 14255, 2025

  35. [35]

    Through their eyes: multi-subject brain decoding with simple alignment techniques.Imaging Neuroscience, 2:1–21, 2024

    Matteo Ferrante, Tommaso Boccato, Furkan Ozcelik, Rufin VanRullen, and Nicola Toschi. Through their eyes: multi-subject brain decoding with simple alignment techniques.Imaging Neuroscience, 2:1–21, 2024

  36. [36]

    See through their minds: Learning transferable brain decoding models from cross-subject fmri

    Yulong Liu, Yongqiang Ma, Guibo Zhu, Haodong Jing, and Nanning Zheng. See through their minds: Learning transferable brain decoding models from cross-subject fmri. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 5730–5738, 2025

  37. [37]

    The Wisdom of a Crowd of Brains: A Universal Brain Encoder

    Roman Beliy, Navve Wasserman, Amit Zalcher, and Michal Irani. The wisdom of a crowd of brains: A universal brain encoder.arXiv preprint arXiv:2406.12179, 2024

  38. [38]

    Interpretable fmri captioning via contrastive learning

    Vyacheslav Shen, Kassymzhomart Kunanbayev, Donggon Jang, and Daeshik Kim. Interpretable fmri captioning via contrastive learning. In James C. Gee, Daniel C. Alexander, Jaesung Hong, Juan Eugenio Iglesias, Carole H. Sudre, Archana Venkataraman, Polina Golland, Jong Hyo Kim, and Jinah Park, editors,Medical Image Computing and Computer Assisted Intervention ...

  39. [39]

    A massive 7t fmri dataset to bridge cognitive neuroscience and artificial intelligence.Nature neuroscience, 25(1):116–126, 2022

    Emily J Allen, Ghislain St-Yves, Yihan Wu, Jesse L Breedlove, Jacob S Prince, Logan T Dowdle, Matthias Nau, Brad Caron, Franco Pestilli, Ian Charest, et al. A massive 7t fmri dataset to bridge cognitive neuroscience and artificial intelligence.Nature neuroscience, 25(1):116–126, 2022

  40. [40]

    BOLD5000: A public fMRI dataset while viewing 5000 visual images

    Nadine Chang, John A Pyles, Austin Marcus, Abhinav Gupta, Michael J Tarr, and Elissa M Aminoff. BOLD5000: A public fMRI dataset while viewing 5000 visual images. InScientific Data, volume 6, page 49, 2019

  41. [41]

    Neuropictor: Refining fmri-to-image reconstruction via multi-individual pretraining and multi-level modulation

    Jingyang Huo, Yikai Wang, Yun Wang, Xuelin Qian, Chong Li, Yanwei Fu, and Jianfeng Feng. Neuropictor: Refining fmri-to-image reconstruction via multi-individual pretraining and multi-level modulation. InEuropean Conference on Computer Vision, pages 56–73. Springer, 2024

  42. [42]

    BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain

    Navve Wasserman, Matias Cosarinsky, Yuval Golbari, Aude Oliva, Antonio Torralba, Tamar Rott Shaham, and Michal Irani. Brainexplore: Large-scale discovery of interpretable visual represen- tations in the human brain, 2025. URLhttps://arxiv.org/abs/2512.08560

  43. [43]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

  44. [44]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team. Gemma: Open models based on gemini research and technology, 2024. URL https://arxiv.org/abs/2403.08295

  45. [45]

    C-pack: Packaged resources to advance general chinese embedding, 2023

    Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. C-pack: Packaged resources to advance general chinese embedding, 2023

  46. [46]

    The Llama 3 Herd of Models

    Aaron Grattafiori et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/ 2407.21783

  47. [47]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014. 12

  48. [48]

    A. T. Gifford, B. Lahner, S. Saba-Sadiya, M. G. Vilas, A. Lascelles, A. Oliva, K. Kay, G. Roig, and R. M. Cichy. The algonauts project 2023 challenge: How the human brain makes sense of natural scenes, 2023. URLhttps://arxiv.org/abs/2301.03198

  49. [49]

    Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering . In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6325–6334, Los Alamitos, CA, USA, July 2017. IEEE Computer Society. doi: 10.1109/CVPR.2017.670. URL...

  50. [50]

    The Color of the Cat is Gray: 1 Million Full-Sentences Visual Question Answering (FSVQA)

    Andrew Shin, Yoshitaka Ushiku, and Tatsuya Harada. The color of the cat is gray: 1 million full-sentences visual question answering (fsvqa), 2016. URL https://arxiv.org/abs/ 1609.06657

  51. [51]

    Saito, Z

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering . In2015 IEEE Inter- national Conference on Computer Vision (ICCV), pages 2425–2433, Los Alamitos, CA, USA, December 2015. IEEE Computer Society. doi: 10.1109/ICCV .2015.279. URL https://doi.ieeecomputersociet...

  52. [52]

    Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , pages =

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Comput...

  53. [53]

    METEOR: An automatic metric for MT evaluation with improved correlation with human judgments

    Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare V oss, editors,Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan,...

  54. [54]

    ROUGE: A package for automatic evaluation of summaries

    Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summariza- tion Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URLhttps://aclanthology.org/W04-1013/

  55. [55]

    Lawrence Zitnick, and Devi Parikh

    Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus-based image description evaluation . In2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4566–4575, Los Alamitos, CA, USA, June 2015. IEEE Computer Society. doi: 10.1109/CVPR.2015.7299087. URL https://doi.ieeecomputersociety. org/10.1109/CVPR.2015.7299087

  56. [56]

    Spice: Semantic propositional image caption evaluation

    Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional image caption evaluation. InECCV, 2016

  57. [57]

    High-resolution image reconstruction with latent diffusion models from human brain activity

    Yu Takagi and Shinji Nishimoto. High-resolution image reconstruction with latent diffusion models from human brain activity. InCVPR, 2023

  58. [58]

    J. et al. Han. Onellm: One framework to align all modalities with language. InCVPR, 2024

  59. [59]

    and Zhang, Z

    W. Mai and Z. Zhang. Unibrain: Unify image reconstruction and captioning all in one diffusion model from human brain activity.arXiv preprint arXiv:2308.07428, 2023

  60. [60]

    Brain-language fusion enables interactive neural readout and in-silico experimenta- tion, 2025

    Victoria Bosch, Daniel Anthes, Adrien Doerig, Sushrut Thorat, Peter König, and Tim Christian Kietzmann. Brain-language fusion enables interactive neural readout and in-silico experimenta- tion, 2025. URLhttps://arxiv.org/abs/2509.23941

  61. [61]

    Ida Gobbini, Maura L Furey, Alumit Ishai, Jennifer L Schouten, and Pietro Pietrini

    James V Haxby, M. Ida Gobbini, Maura L Furey, Alumit Ishai, Jennifer L Schouten, and Pietro Pietrini. Distributed and overlapping representations of faces and objects in ventral temporal cortex.Science, 293(5539):2425–2430, 2001

  62. [62]

    Beyond mind-reading: multi-voxel pattern analysis of fmri data.Trends in Cognitive Sciences, 10(9):424–430, 2006

    Kenneth A Norman, Sean M Polyn, Greg J Detre, and James V Haxby. Beyond mind-reading: multi-voxel pattern analysis of fmri data.Trends in Cognitive Sciences, 10(9):424–430, 2006. 13

  63. [63]

    Zeiler and Rob Fergus

    Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In Computer Vision – ECCV 2014, volume 8689 ofLecture Notes in Computer Science, pages 818–833. Springer, 2014

  64. [64]

    why should I trust you?

    Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should I trust you?": Explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144, 2016

  65. [65]

    Apurva Ratan Murty, and Nancy Kanwisher

    Meenakshi Khosla, N. Apurva Ratan Murty, and Nancy Kanwisher. A highly selective response to food in human visual cortex revealed by hypothesis-free voxel decomposi- tion.Current Biology, 32(19):4159–4171.e9, 2022. ISSN 0960-9822. doi: https://doi.org/ 10.1016/j.cub.2022.08.009. URL https://www.sciencedirect.com/science/article/ pii/S0960982222012866

  66. [66]

    Scaling instruction-finetuned language models, 2022

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, et al. Scaling instruction-finetuned language models, 2022

  67. [67]

    objects". OUTPUT FORMAT (STRICT JSON ONLY): {

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=Bkg6RiCqY7. 14 Appendix A Limitation fMRI Assumptions.Following standard practice in the field, our model assumes that fMRI responses are memoryless and replicable. The memoryless assum...