Recognition: unknown
A foundation model of vision, audition, and language for in-silico neuroscience
Pith reviewed 2026-05-08 17:02 UTC · model grok-4.3
The pith
A single tri-modal foundation model trained on over 1,000 hours of fMRI data predicts brain responses for novel stimuli, tasks, and subjects while recovering established neuroscience findings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TRIBE v2, a tri-modal foundation model, accurately predicts high-resolution brain responses for novel stimuli, tasks and subjects, superseding traditional linear encoding models and delivering several-fold improvements in accuracy. Critically, it enables in silico experimentation by recovering a variety of results from seminal visual and neuro-linguistic paradigms established by decades of empirical research. Extracting interpretable latent features from the model reveals the fine-grained topography of multisensory integration and establishes artificial intelligence as a unifying framework for exploring the functional organization of the human brain.
What carries the argument
The TRIBE v2 tri-modal foundation model, which processes combined video, audio, and language inputs to generate predictions of fMRI-measured brain activity and extract latent features that map multisensory integration.
Load-bearing premise
A single model trained on the collected naturalistic and experimental fMRI scans will generalize to truly new stimuli, tasks, and subjects without overfitting or circular evaluation.
What would settle it
A test set of new stimuli or tasks where the model fails to predict brain activity better than linear baselines or fails to recover the established patterns from classic visual and language experiments.
read the original abstract
Cognitive neuroscience is fragmented into specialized models, each tailored to specific experimental paradigms, hence preventing a unified model of cognition in the human brain. Here, we introduce TRIBE v2, a tri-modal (video, audio and language) foundation model capable of predicting human brain activity in a variety of naturalistic and experimental conditions. Leveraging a unified dataset of over 1,000 hours of fMRI across 720 subjects, we demonstrate that our model accurately predicts high-resolution brain responses for novel stimuli, tasks and subjects, superseding traditional linear encoding models, delivering several-fold improvements in accuracy. Critically, TRIBE v2 enables in silico experimentation: tested on seminal visual and neuro-linguistic paradigms, it recovers a variety of results established by decades of empirical research. Finally, by extracting interpretable latent features, TRIBE v2 reveals the fine-grained topography of multisensory integration. These results establish artificial intelligence as a unifying framework for exploring the functional organization of the human brain.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TRIBE v2, a tri-modal (video, audio, language) foundation model trained on over 1,000 hours of fMRI data from 720 subjects. It claims to predict high-resolution brain responses for novel stimuli, tasks, and subjects with several-fold accuracy gains over traditional linear encoding models, recover established empirical results from seminal visual and neuro-linguistic paradigms via in-silico testing, and reveal fine-grained multisensory integration topography through interpretable latent features.
Significance. If the generalization and recovery claims hold after proper validation, this would represent a substantial advance toward unified models in cognitive neuroscience, potentially enabling scalable in-silico experimentation that reduces reliance on new empirical data collection and integrates fragmented paradigm-specific approaches.
major comments (3)
- [Abstract] Abstract: The headline claim of 'several-fold improvements in accuracy' over linear encoding models and accurate prediction 'for novel stimuli, tasks and subjects' is presented without any quantitative metrics (e.g., correlation coefficients, R² values), specific baselines, statistical tests, or controls for subject/stimulus overlap.
- [Abstract] Abstract and Methods: No description is given of the cross-validation or hold-out protocol (subject-wise, stimulus-wise, or task-wise splits), leaving the generalization claims vulnerable to potential data leakage or within-distribution interpolation rather than true out-of-distribution performance.
- [Abstract] Abstract: The assertion that TRIBE v2 'recovers a variety of results established by decades of empirical research' on visual and neuro-linguistic paradigms lacks any specifics on which paradigms were tested, which exact results were recovered, or quantitative measures of recovery fidelity.
minor comments (2)
- [Abstract] The abstract uses 'in-silico' with a hyphen; standardize to the conventional 'in silico' throughout.
- [Results] The manuscript would benefit from explicit comparison tables or figures showing performance against linear baselines on the same held-out data.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the presentation of our results. We address each major comment point by point below and indicate where revisions will be made to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline claim of 'several-fold improvements in accuracy' over linear encoding models and accurate prediction 'for novel stimuli, tasks and subjects' is presented without any quantitative metrics (e.g., correlation coefficients, R² values), specific baselines, statistical tests, or controls for subject/stimulus overlap.
Authors: We agree that the abstract would be strengthened by including representative quantitative metrics. The main text reports specific correlation coefficients, R² values, statistical tests, and controls for subject/stimulus overlap when comparing TRIBE v2 to linear encoding models. We will revise the abstract to incorporate key quantitative values and name the primary baselines, while preserving brevity. revision: yes
-
Referee: [Abstract] Abstract and Methods: No description is given of the cross-validation or hold-out protocol (subject-wise, stimulus-wise, or task-wise splits), leaving the generalization claims vulnerable to potential data leakage or within-distribution interpolation rather than true out-of-distribution performance.
Authors: The Methods section details the cross-validation and hold-out protocols, including subject-wise splits for novel subjects and stimulus-wise splits for novel stimuli. These were designed to support out-of-distribution generalization. We will add a concise summary of the validation protocol to the abstract to address this concern directly. revision: yes
-
Referee: [Abstract] Abstract: The assertion that TRIBE v2 'recovers a variety of results established by decades of empirical research' on visual and neuro-linguistic paradigms lacks any specifics on which paradigms were tested, which exact results were recovered, or quantitative measures of recovery fidelity.
Authors: The Results section specifies the paradigms tested (e.g., visual retinotopy and category selectivity; linguistic syntactic processing) and provides quantitative comparisons of recovery fidelity to empirical findings. We will revise the abstract to name the key paradigms and indicate the quantitative fidelity of the in-silico recoveries. revision: yes
Circularity Check
No circularity: empirical training on held-out fMRI data with external validation
full rationale
The paper trains a tri-modal foundation model on a large multi-subject fMRI corpus and evaluates its ability to predict responses to novel stimuli/tasks/subjects while recovering known empirical findings. No equations, self-definitional loops, fitted-input-as-prediction, or load-bearing self-citations are present that would make any claimed prediction equivalent to its training inputs by construction. The derivation chain is self-contained: model parameters are learned from data, performance is measured on separate test conditions, and recovery of prior results functions as independent validation rather than renaming or smuggling ansatzes.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Nature , volume=
Identifying natural images from human brain activity , author=. Nature , volume=. 2008 , publisher=
2008
-
[2]
Nature Neuroscience , volume=
Semantic reconstruction of continuous language from non-invasive brain recordings , author=. Nature Neuroscience , volume=
-
[3]
arXiv preprint arXiv:2401.09918 , year=
CLIP-Decoding: Toward Human-Level Visual Decoding through Augmented Training , author=. arXiv preprint arXiv:2401.09918 , year=
-
[4]
arXiv preprint arXiv:2402.12345 , year=
Toward a Human-Like Visual System: Decoding and Calibrating High-Level Semantic Representations from Brain Activity , author=. arXiv preprint arXiv:2402.12345 , year=
-
[5]
The wisdom of a crowd of brains: A universal brain encoder.arXiv preprint arXiv:2406.12179, 2024
The wisdom of a crowd of brains: A universal brain encoder , author=. arXiv preprint arXiv:2406.12179 , year=
-
[6]
Behavioral and brain sciences , volume=
The weirdest people in the world? , author=. Behavioral and brain sciences , volume=. 2010 , publisher=
2010
-
[7]
Nature Communications , volume=
Towards decoding individual words from non-invasive brain recordings , author=. Nature Communications , volume=. 2025 , publisher=
2025
-
[8]
Annual review of psychology , volume=
Speech computations of the human superior temporal gyrus , author=. Annual review of psychology , volume=. 2022 , publisher=
2022
-
[9]
Nature human behaviour , volume=
Language, mind and brain , author=. Nature human behaviour , volume=. 2017 , publisher=
2017
-
[10]
Cell , volume=
The code for facial identity in the primate brain , author=. Cell , volume=. 2017 , publisher=
2017
-
[11]
Scaling laws for decoding images from brain activity.arXiv preprint arXiv:2501.15322, 2025
Scaling laws for decoding images from brain activity , author=. arXiv preprint arXiv:2501.15322 , year=
-
[12]
arXiv preprint arXiv:2310.19812 , year=
Brain decoding: toward real-time reconstruction of visual perception , author=. arXiv preprint arXiv:2310.19812 , year=
-
[13]
Proceedings of the National Academy of Sciences , volume=
Cortical representation of the constituent structure of sentences , author=. Proceedings of the National Academy of Sciences , volume=. 2011 , publisher=
2011
-
[14]
Advances in Neural Information Processing Systems , volume=
A unified, scalable framework for neural population decoding , author=. Advances in Neural Information Processing Systems , volume=
-
[15]
Nature , volume=
Vicarious body maps bridge vision and touch in the human brain , author=. Nature , volume=
-
[16]
bioRxiv , pages=
One hundred neural networks and brains watching videos: Lessons from alignment , author=. bioRxiv , pages=. 2024 , publisher=
2024
-
[17]
bioRxiv , pages=
MOSAIC: A scalable framework for fMRI dataset aggregation and modeling of human vision , author=. bioRxiv , pages=. 2025 , publisher=
2025
-
[18]
Proceedings of the national academy of sciences , volume=
Consistent resting-state networks across healthy subjects , author=. Proceedings of the national academy of sciences , volume=. 2006 , publisher=
2006
-
[19]
Nature Human Behaviour , volume=
In silico discovery of representational relationships across visual cortex , author=. Nature Human Behaviour , volume=. 2025 , publisher=
2025
-
[20]
Neuroimage , volume=
FSL , author=. Neuroimage , volume=. 2012 , publisher=
2012
-
[21]
American Journal of Psychiatry , volume=
Structural brain magnetic resonance imaging of limbic and thalamic volumes in pediatric bipolar disorder , author=. American Journal of Psychiatry , volume=. 2005 , publisher=
2005
-
[22]
Schizophrenia research , volume=
Decreased volume of left and total anterior insular lobule in schizophrenia , author=. Schizophrenia research , volume=. 2006 , publisher=
2006
-
[23]
Neuroimage , volume=
An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest , author=. Neuroimage , volume=. 2006 , publisher=
2006
-
[24]
Human brain mapping , volume=
Optimal experimental design for event-related fMRI , author=. Human brain mapping , volume=. 1999 , publisher=
1999
-
[25]
Nature , volume=
A multi-modal parcellation of human cerebral cortex , author=. Nature , volume=. 2016 , publisher=
2016
-
[26]
Proceedings of the National Academy of Sciences , volume=
Situating the default-mode network along a principal gradient of macroscale cortical organization , author=. Proceedings of the National Academy of Sciences , volume=. 2016 , publisher=
2016
-
[27]
Language, cognition and neuroscience , volume=
The revolution will not be controlled: natural stimuli in speech neuroscience , author=. Language, cognition and neuroscience , volume=. 2020 , publisher=
2020
-
[28]
Nature Reviews Neuroscience , volume=
The language network as a natural kind within the broader landscape of the human brain , author=. Nature Reviews Neuroscience , volume=. 2024 , publisher=
2024
-
[29]
Neuron , volume=
A common, high-dimensional model of the representational space in human ventral temporal cortex , author=. Neuron , volume=. 2011 , publisher=
2011
-
[30]
Neuron , volume=
A task-optimized neural network replicates human auditory behavior, predicts brain responses, and reveals a cortical processing hierarchy , author=. Neuron , volume=. 2018 , publisher=
2018
-
[31]
A foundation model to predict and capture human cognition.Nature, 644: 1002–1009, 2025
Binz, Marcel and Akata, Elif and Bethge, Matthias and Br. A foundation model to predict and capture human cognition , journal =. 2025 , volume =. doi:10.1038/s41586-025-09215-4 , publisher =
-
[32]
Frontiers in Computational Neuroscience , volume=
Artificial neural networks as models of neural information processing , author=. Frontiers in Computational Neuroscience , volume=. 2017 , publisher=
2017
-
[33]
Annual review of vision science , volume=
Deep neural networks: a new framework for modeling biological vision and brain information processing , author=. Annual review of vision science , volume=. 2015 , publisher=
2015
-
[34]
science , volume=
Predicting human brain activity associated with the meanings of nouns , author=. science , volume=. 2008 , publisher=
2008
-
[35]
Proceedings of the national academy of sciences , volume=
Performance-optimized hierarchical models predict neural responses in higher visual cortex , author=. Proceedings of the national academy of sciences , volume=. 2014 , publisher=
2014
-
[36]
Nature neuroscience , volume=
A deep learning framework for neuroscience , author=. Nature neuroscience , volume=. 2019 , publisher=
2019
-
[37]
Nature neuroscience , volume=
Using goal-driven deep learning models to understand sensory cortex , author=. Nature neuroscience , volume=. 2016 , publisher=
2016
-
[38]
Cell , volume=
Decoding the brain: From neural representations to mechanistic models , author=. Cell , volume=. 2024 , publisher=
2024
-
[39]
Nature , volume=
Foundation model of neural activity predicts response to new stimulus types , author=. Nature , volume=. 2025 , publisher=
2025
-
[40]
Fourteenth Critical Assessment of Techniques for Protein Structure Prediction , volume=
AlphaFold 2 , author=. Fourteenth Critical Assessment of Techniques for Protein Structure Prediction , volume=. 2020 , publisher=
2020
-
[41]
arXiv preprint arXiv:2411.11783 , year=
Open catalyst experiments 2024 (OCx24): bridging experiments and computational models , author=. arXiv preprint arXiv:2411.11783 , year=
-
[42]
Scientific data , volume=
Individual Brain Charting, a high-resolution fMRI dataset for cognitive mapping , author=. Scientific data , volume=. 2018 , publisher=
2018
-
[43]
elife , volume=
Hyperalignment: Modeling shared information encoded in idiosyncratic cortical topographies , author=. elife , volume=. 2020 , publisher=
2020
-
[44]
Entropy , volume=
Explainable ai: A review of machine learning interpretability methods , author=. Entropy , volume=. 2020 , publisher=
2020
-
[45]
Nature reviews neuroscience , volume=
Scanning the horizon: towards transparent and reproducible neuroimaging research , author=. Nature reviews neuroscience , volume=. 2017 , publisher=
2017
-
[46]
arXiv preprint arXiv:2501.00504 , year=
The algonauts project 2025 challenge: How the human brain makes sense of multimodal movies , author=. arXiv preprint arXiv:2501.00504 , year=
-
[47]
Scaling Laws for Neural Language Models
Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=
work page internal anchor Pith review arXiv 2001
-
[48]
The Fourteenth International Conference on Learning Representations , year=
TRIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction , author=. The Fourteenth International Conference on Learning Representations , year=
-
[49]
Narratives
The “Narratives” fMRI dataset for evaluating models of naturalistic language comprehension , author=. Scientific data , volume=. 2021 , publisher=
2021
-
[50]
Scientific data , volume=
Le Petit Prince multilingual naturalistic fMRI corpus , author=. Scientific data , volume=. 2022 , publisher=
2022
-
[51]
Scientific data , volume=
A naturalistic neuroimaging database for understanding the brain using ecological stimuli , author=. Scientific data , volume=. 2020 , publisher=
2020
-
[52]
Imaging Neuroscience , volume=
Neurosynth compose: A Web-Based platform for flexible and reproducible neuroimaging Meta-Analysis , author=. Imaging Neuroscience , volume=. 2026 , publisher=
2026
-
[53]
Nature communications , volume=
Modeling short visual events through the BOLD moments video fMRI dataset and metadata , author=. Nature communications , volume=. 2024 , publisher=
2024
-
[54]
LoRA: Low-Rank Adaptation of Large Language Models
Lora: Low-rank adaptation of large language models. arXiv 2021 , author=. arXiv preprint arXiv:2106.09685 , volume=
work page internal anchor Pith review arXiv 2021
-
[55]
Neuroscience , volume=
Toward coordinate-based cognition dictionaries: A BrainMap and neurosynth demo , author=. Neuroscience , volume=. 2022 , publisher=
2022
-
[56]
Cerebral cortex , volume=
Neural encoding and decoding with deep learning for dynamic natural vision , author=. Cerebral cortex , volume=. 2018 , publisher=
2018
-
[57]
Neuroimage , volume=
The WU-Minn human connectome project: an overview , author=. Neuroimage , volume=. 2013 , publisher=
2013
-
[58]
arXiv preprint arXiv:2507.17958 , year=
VIBE: Video-Input Brain Encoder for fMRI Response Modeling , author=. arXiv preprint arXiv:2507.17958 , year=
-
[59]
arXiv preprint arXiv:2507.17897 , year=
Multimodal Recurrent Ensembles for Predicting Brain Responses to Naturalistic Movies (Algonauts 2025) , author=. arXiv preprint arXiv:2507.17897 , year=
-
[60]
arXiv preprint arXiv:2507.19956 , year=
Predicting Brain Responses To Natural Movies With Multimodal LLMs , author=. arXiv preprint arXiv:2507.19956 , year=
-
[61]
arXiv preprint arXiv:2507.18104 , year=
A Multimodal Seq2Seq Transformer for Predicting Brain Responses to Naturalistic Stimuli , author=. arXiv preprint arXiv:2507.18104 , year=
-
[62]
2023 , doi =
Adeli, Hossein and Minni, Sun and Kriegeskorte, Nikolaus , title =. 2023 , doi =. https://www.biorxiv.org/content/early/2023/08/05/2023.08.02.551743.full.pdf , journal =
2023
-
[63]
arXiv preprint arXiv:2308.00262 , year=
The algonauts project 2023 challenge: Uark-ualbany team solution , author=. arXiv preprint arXiv:2308.00262 , year=
-
[64]
arXiv preprint arXiv:2308.01175 , year=
Memory encoding model , author=. arXiv preprint arXiv:2308.01175 , year=
-
[65]
Averaging weights leads to wider optima and better generalization , author=. arXiv preprint arXiv:1803.05407 , year=
-
[66]
The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=
work page internal anchor Pith review arXiv
-
[67]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning , author=. arXiv preprint arXiv:2506.09985 , year=
work page internal anchor Pith review arXiv
-
[68]
Journal of Consumer Behaviour: An International Research Review , volume=
Neuroethics of neuromarketing , author=. Journal of Consumer Behaviour: An International Research Review , volume=. 2008 , publisher=
2008
-
[69]
Perceiver IO: A general architecture for structured inputs & outputs.Preprint arXiv:2107.14795,
Jaegle, Andrew and Borgeaud, Sebastian and Alayrac, Jean-Baptiste and Doersch, Carl and Ionescu, Catalin and Ding, David and Koppula, Skanda and Zoran, Daniel and Brock, Andrew and Shelhamer, Evan and H. arXiv , year =. 2107.14795 , doi =
-
[70]
Abdin, Marah and Aneja, Jyoti and Behl, Harkirat and Bubeck, S. arXiv , year =. 2412.08905 , doi =
work page internal anchor Pith review arXiv
-
[71]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =
Srivastava, Siddharth and Sharma, Gaurav , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =
2024
-
[72]
NeuroImage , volume=
BOLD hemodynamic response function changes significantly with healthy aging , author=. NeuroImage , volume=. 2019 , publisher=
2019
-
[73]
Journal of Cerebral Blood Flow & Metabolism , volume=
The hemodynamic impulse response to a single neural event , author=. Journal of Cerebral Blood Flow & Metabolism , volume=. 2003 , publisher=
2003
-
[74]
Magnetic resonance imaging , volume=
Dynamics and nonlinearities of the BOLD response at very short stimulus durations , author=. Magnetic resonance imaging , volume=. 2008 , publisher=
2008
-
[75]
Nature , volume=
Array programming with NumPy , author=. Nature , volume=. 2020 , publisher=
2020
-
[76]
Neuron , volume=
Conscious processing and the global neuronal workspace hypothesis , author=. Neuron , volume=. 2020 , publisher=
2020
-
[77]
Neurobiology of Language , volume=
Computational language modeling and the promise of in silico experimentation , author=. Neurobiology of Language , volume=. 2024 , publisher=
2024
-
[78]
Schoppe, Oliver and Harper, Nicol S. and Willmore, Ben D. B. and King, Andrew J. and Schnupp, Jan W. H. , TITLE=. Frontiers in Computational Neuroscience , VOLUME=. 2016 , URL=. doi:10.3389/fncom.2016.00010 , ISSN=
-
[79]
Introduction to transformers for NLP: With the hugging face library and models to solve problems , pages=
Hugging face , author=. Introduction to transformers for NLP: With the hugging face library and models to solve problems , pages=. 2022 , publisher=
2022
-
[80]
Nature Machine Intelligence , volume=
Decoding speech perception from non-invasive brain recordings , author=. Nature Machine Intelligence , volume=. 2023 , publisher=
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.