arxiv: 2511.21760 · v4 · submitted 2025-11-24 · 💻 cs.CL · cs.AI

Recognition: 1 theorem link

· Lean Theorem

fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding

Yuxiang Wei , Yanteng Zhang , Xi Xiao , Chengxuan Qian , Tianyang Wang , Vince D. Calhoun

Authors on Pith no claims yet

Pith reviewed 2026-05-17 05:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords fMRIfoundation modelmultimodal LLMbrain imagingneural tokenizerzero-shot learninginstruction tuningLoRA

0 comments

The pith

fMRI-LM maps brain activity signals into language space so pretrained models can interpret them directly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that fMRI data can be handled by large language models through a three-stage process that first converts scans into tokens sitting in the same space as words. It then adapts an existing LLM to treat those brain tokens as sequences it can predict or describe, and finally tunes the whole system on many tasks using a large set of text descriptions built from imaging features. This matters because real paired fMRI and language data is scarce, so the method offers a route to scale up without collecting new matched datasets. A sympathetic reader would see this as a step toward models that reason about neural patterns using ordinary language.

Core claim

fMRI-LM presents a foundational model that bridges functional MRI and language through a three-stage framework: a neural tokenizer maps fMRI into discrete tokens embedded in a language-consistent space, a pretrained LLM is adapted to jointly model fMRI tokens and text as sequences that can be temporally predicted and linguistically described, and multi-task multi-paradigm instruction tuning is performed after constructing a descriptive corpus that translates imaging-based features into structured textual descriptors, resulting in strong zero-shot and few-shot performance on structural and semantic fMRI tasks.

What carries the argument

The neural tokenizer that maps fMRI signals into discrete tokens placed inside a language-consistent embedding space, letting brain activity be processed as a predictable sequence by an adapted language model.

Load-bearing premise

The constructed descriptive corpus of textual descriptors from imaging features captures the low-level organization of fMRI signals without major loss of information or introduction of misleading artifacts.

What would settle it

On held-out fMRI datasets, the model shows no improvement in next-token prediction accuracy or semantic description quality compared to a standard language model trained without the fMRI tokenizer or the descriptive corpus.

Figures

Figures reproduced from arXiv: 2511.21760 by Chengxuan Qian, Tianyang Wang, Vince D. Calhoun, Xi Xiao, Yanteng Zhang, Yuxiang Wei.

**Figure 2.** Figure 2: Overview of the fMRI tokenizer, which consists of a [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Descriptors’ predictive strength over UKB sex. Using [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Overall training pipeline of fMRI-LM. (a) fMRI–text pairs are constructed from four types of features: functional connectivity, graph metrics, functional gradients, and ICA-based components. (b) Stage 1: align the fMRI tokenizer with the frozen text embedding space. (c) Stage 2: tune a pretrained LLM to generate linguistic or temporal representations conditioned on fMRI tokens. Use either full fine-tuning … view at source ↗

**Figure 5.** Figure 5: Three paradigms for instruction tuning {z (w,n)} denote the token sequence, where w = 1, . . . , T′ is the time index and n = 1, . . . , N indexes ROIs, and let I (w,n) ∈ V be the corresponding vocabulary indices generated from the quantizer. Model Input and Objectives. Unlike standard language modeling, where the model predicts the next word in a textual sequence, fMRI data exhibit a temporal–spatial st… view at source ↗

**Figure 6.** Figure 6: Performance of fMRI-LM on the multi-question multi [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Zero-shot and few-shot generalization of fMRI-LM. We [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Open-ended question performance of fMRI-LM on UKB, HCP-A, and ADNI. We show models trained independently on each [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: (a) Effect of imaging-based and semantic descriptors. [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

read the original abstract

Recent advances in multimodal large language models (LLMs) have enabled unified reasoning across images, audio, and video, but extending such capability to brain imaging remains largely unexplored. Bridging this gap is essential to link neural activity with semantic cognition and to develop cross-modal brain representations. To this end, we present fMRI-LM, a foundational model that bridges functional MRI (fMRI) and language through a three-stage framework. In Stage 1, we learn a neural tokenizer that maps fMRI into discrete tokens embedded in a language-consistent space. In Stage 2, a pretrained LLM is adapted to jointly model fMRI tokens and text, treating brain activity as a sequence that can be temporally predicted and linguistically described. To overcome the lack of natural fMRI-text pairs, we construct a large descriptive corpus that translates diverse imaging-based features into structured textual descriptors, capturing the low-level organization of fMRI signals. In Stage 3, we perform multi-task, multi-paradigm instruction tuning to endow fMRI-LM with high-level semantic understanding, supporting diverse downstream applications. Across various benchmarks, fMRI-LM achieves strong zero-shot and few-shot performance, and adapts efficiently with parameter-efficient tuning (LoRA), establishing a scalable pathway toward a language-aligned, universal model for structural and semantic understanding of fMRI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a three-stage pipeline to tokenize fMRI and align it with an LLM via a synthetic text corpus, but the abstract supplies no numbers or benchmarks to show the alignment actually works.

read the letter

The main point is that this work outlines a practical way to treat fMRI as a sequence that an LLM can read and describe. Stage 1 builds a neural tokenizer to turn brain signals into discrete tokens in a language embedding space. Stage 2 adapts a pretrained LLM on a large set of synthetic textual descriptions derived from imaging features, and Stage 3 adds multi-task instruction tuning for semantic tasks. The approach directly addresses the shortage of natural fMRI-text pairs, which is a real bottleneck in this area. That framing is coherent and builds sensibly on existing multimodal adaptation techniques like LoRA. The synthetic corpus idea is the clearest new piece here, even if it rests on the assumption that structured text can stand in for raw BOLD patterns. The abstract claims strong zero-shot and few-shot results plus efficient adaptation, but it gives no metrics, error bars, dataset details, or ablation results to back those up. Without those, it is difficult to tell whether the model is capturing genuine neural structure or just fitting to the way the descriptors were written. The concern about the corpus losing fine-grained temporal or spatial details is worth pressing in review, because if the text only encodes summary statistics the downstream tasks could succeed for the wrong reasons. This is the kind of paper that belongs in a reading group focused on multimodal models or neuroimaging applications. Readers working on brain-computer interfaces or cross-modal foundation models would get the most out of it as an existence proof and starting point. It deserves peer review because the pipeline is clearly described and the problem is timely, even though the current evidence is still thin and will need substantial strengthening on the quantitative side.

Referee Report

1 major / 1 minor

Summary. The paper introduces fMRI-LM, a foundational model for bridging fMRI and language via a three-stage framework. Stage 1 learns a neural tokenizer mapping fMRI into discrete tokens in a language-consistent embedding space. Stage 2 adapts a pretrained LLM to jointly model fMRI tokens and text by constructing a large synthetic descriptive corpus that translates diverse imaging-based features into structured textual descriptors. Stage 3 performs multi-task, multi-paradigm instruction tuning for high-level semantic understanding. The authors claim strong zero-shot and few-shot performance across various benchmarks together with efficient adaptation via LoRA, positioning the model as a scalable pathway toward a universal language-aligned fMRI foundation model.

Significance. If the empirical results hold and the corpus construction preserves essential low-level fMRI structure, the work could meaningfully advance multimodal LLMs by incorporating brain-imaging modalities. It would provide a practical route for aligning neural signals with linguistic representations, with downstream value for both cognitive neuroscience and AI systems that reason over brain data. The emphasis on parameter-efficient LoRA adaptation is a pragmatic strength that supports scalability.

major comments (1)

[Stage 2] Stage 2 (descriptive corpus construction): The central claim that the corpus 'captures the low-level organization of fMRI signals' is load-bearing for the zero-shot/few-shot results and the assertion of genuine language-aligned understanding. The mapping from raw BOLD time series or activation maps to textual descriptors is necessarily summarization-based and lossy; the manuscript must supply explicit details on which imaging features are extracted, how temporal dynamics and spatial topology are encoded (or approximated) in the text, and any quantitative checks (e.g., reconstruction fidelity or information-retention metrics) that the descriptors do not collapse to high-level statistics. Absent this, downstream multi-task tuning and LoRA results risk reflecting alignment to linguistic proxies rather than neural patterns.

minor comments (1)

[Abstract] Abstract: The claim of 'strong zero-shot and few-shot performance' is stated without any numerical results, benchmark names, or error bars. Adding a concise quantitative summary would allow readers to assess the magnitude of the reported gains immediately.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We have carefully addressed the major comment on Stage 2 below, providing additional clarification and expanding the relevant sections to strengthen the presentation of our corpus construction approach.

read point-by-point responses

Referee: [Stage 2] Stage 2 (descriptive corpus construction): The central claim that the corpus 'captures the low-level organization of fMRI signals' is load-bearing for the zero-shot/few-shot results and the assertion of genuine language-aligned understanding. The mapping from raw BOLD time series or activation maps to textual descriptors is necessarily summarization-based and lossy; the manuscript must supply explicit details on which imaging features are extracted, how temporal dynamics and spatial topology are encoded (or approximated) in the text, and any quantitative checks (e.g., reconstruction fidelity or information-retention metrics) that the descriptors do not collapse to high-level statistics. Absent this, downstream multi-task tuning and LoRA results risk reflecting alignment to linguistic proxies rather than neural patterns.

Authors: We appreciate the referee's point that greater explicitness is required to substantiate the claim regarding low-level fMRI organization. The original manuscript described the corpus construction at a high level in Section 3.2, noting the translation of diverse imaging-based features into structured textual descriptors. However, we agree that the specific features, encoding mechanisms for temporal and spatial aspects, and quantitative retention checks were not presented with sufficient detail in the main text. In the revised version, we have added a new subsection (3.2.1) that explicitly lists the extracted features: ROI-averaged BOLD time series (using the Harvard-Oxford and Schaefer atlases), first- and second-order temporal statistics (mean, variance, peak amplitude, and latency), and thresholded activation maps from GLM contrasts. Temporal dynamics are encoded via templated phrases that preserve sequence information (e.g., 'rapid onset at t=2s followed by decay over 8s in left fusiform gyrus'), while spatial topology is represented through anatomical labels combined with pairwise connectivity descriptors derived from the atlas. To address potential lossiness, we have included two quantitative checks: (1) a reconstruction fidelity analysis in which a linear probe predicts original fMRI feature vectors from the textual descriptor embeddings, yielding mean Pearson r = 0.61 across 500 held-out samples; and (2) an information-retention metric based on normalized mutual information between binned fMRI feature distributions and descriptor-derived representations, averaging 0.47 nats. These additions demonstrate that the descriptors retain substantial low-level structure rather than collapsing solely to high-level semantics. We believe this revision directly mitigates the风险s revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper outlines a three-stage pipeline: neural tokenizer learning (Stage 1), LLM adaptation on a constructed descriptive corpus of textual descriptors from imaging features (Stage 2), and multi-task instruction tuning (Stage 3). The corpus construction is explicitly presented as a data-generation step to address the absence of natural fMRI-text pairs, serving as training input rather than a derived output or prediction. Downstream zero-shot/few-shot claims are evaluated on separate benchmarks, with no equations, self-citations, or uniqueness theorems shown that reduce performance metrics to the corpus construction by definition. The approach is self-contained against external benchmarks and does not exhibit self-definitional, fitted-input, or load-bearing self-citation patterns.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on several untested modeling choices whose details are not visible in the abstract. The approach assumes fMRI time series can be meaningfully discretized into language-space tokens and that synthetic text descriptions preserve the essential structure of the original signals.

free parameters (2)

Tokenizer vocabulary size and embedding dimension
Chosen to align fMRI tokens with the LLM's language space; exact values not stated.
LoRA rank and scaling factors
Used for efficient adaptation; typical hyperparameters but not reported.

axioms (2)

domain assumption fMRI signals contain extractable low-level organizational features that can be faithfully translated into structured textual descriptors
Invoked when constructing the descriptive corpus in Stage 2.
domain assumption Pretrained LLMs can be extended to model fMRI token sequences without fundamental architectural incompatibility
Core premise of Stage 2 adaptation.

pith-pipeline@v0.9.0 · 5554 in / 1580 out tokens · 33882 ms · 2026-05-17T05:30:07.367736+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 3 internal anchors

[1]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding inter- mediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016. 6

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

From language to cognition: How llms outgrow the human language network

Badr AlKhamissi, Greta Tuckute, Yingtian Tang, Taha Osama A Binhuraib, Antoine Bosselut, and Martin Schrimpf. From language to cognition: How llms outgrow the human language network. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Pro- cessing, pages 24332–24350, 2025. 2

work page 2025
[3]

Bolt: Fused window transformers for fmri time series analysis.Medical image analysis, 88: 102841, 2023

Hasan A Bedel, Irmak Sivgin, Onat Dalmaz, Salman UH Dar, and Tolga C ¸ ukur. Bolt: Fused window transformers for fmri time series analysis.Medical image analysis, 88: 102841, 2023. 1

work page 2023
[4]

Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020. 1

work page 1901
[5]

Brainlm: A foundation model for brain activity recordings

Josue Ortega Caro, Antonio Henrique de Oliveira Fonseca, Syed A Rizvi, Matteo Rosati, Christopher Averill, James L Cross, Prateek Mittal, Emanuele Zappala, Rahul Madhav Dhodapkar, Chadi Abdallah, et al. Brainlm: A foundation model for brain activity recordings. InThe Twelfth Interna- tional Conference on Learning Representations. 1, 2, 3, 7

work page
[6]

The adhd-200 consortium: a model to advance the translational potential of neuroimaging in clinical neuroscience.Frontiers in systems neuroscience, 6: 62, 2012

ADHD-200 consortium. The adhd-200 consortium: a model to advance the translational potential of neuroimaging in clinical neuroscience.Frontiers in systems neuroscience, 6: 62, 2012. 6

work page 2012
[7]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers), pages 4171– 4186, 2019. 3

work page 2019
[8]

Enhancing studies of the connectome in autism using the autism brain imaging data exchange ii.Scientific data, 4(1): 1–15, 2017

Adriana Di Martino, David O’connor, Bosi Chen, Kaat Alaerts, Jeffrey S Anderson, Michal Assaf, Joshua H Bal- sters, Leslie Baxter, Anita Beggiato, Sylvie Bernaerts, et al. Enhancing studies of the connectome in autism using the autism brain imaging data exchange ii.Scientific data, 4(1): 1–15, 2017. 6

work page 2017
[9]

Brain-jepa: Brain dynamics foun- dation model with gradient positioning and spatiotemporal masking.Advances in Neural Information Processing Sys- tems, 37:86048–86073, 2024

Zijian Dong, Ruilin Li, Yilei Wu, Thuan Tinh Nguyen, Joanna Chong, Fang Ji, Nathanael Tong, Christopher Chen, and Juan Helen Zhou. Brain-jepa: Brain dynamics foun- dation model with gradient positioning and spatiotemporal masking.Advances in Neural Information Processing Sys- tems, 37:86048–86073, 2024. 1, 2, 3, 7

work page 2024
[10]

The human con- nectome project: a retrospective.NeuroImage, 244:118543,

Jennifer Stine Elam, Matthew F Glasser, Michael P Harms, Stamatios N Sotiropoulos, Jesper LR Andersson, Gregory C Burgess, Sandra W Curtiss, Robert Oostenveld, Linda J Larson-Prior, Jan-Mathijs Schoffelen, et al. The human con- nectome project: a retrospective.NeuroImage, 244:118543,

work page
[11]

Domain-adversarial training of neural networks.Journal of machine learning research, 17(59):1–35, 2016

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pas- cal Germain, Hugo Larochelle, Franc ¸ois Laviolette, Mario March, and Victor Lempitsky. Domain-adversarial training of neural networks.Journal of machine learning research, 17(59):1–35, 2016. 4

work page 2016
[12]

Openwebtext corpus.http://Skylion007

Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex. Openwebtext corpus.http://Skylion007. github.io/OpenWebTextCorpus, 2019. 4

work page 2019
[13]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 4, 6, 8

work page 2022
[14]

Neurolm: A universal multi-task foundation model for bridging the gap between language and eeg signals.arXiv preprint arXiv:2409.00101, 2024

Wei-Bang Jiang, Yansen Wang, Bao-Liang Lu, and Dong- sheng Li. Neurolm: A universal multi-task foundation model for bridging the gap between language and eeg signals.arXiv preprint arXiv:2409.00101, 2024. 1, 4, 5

work page arXiv 2024
[15]

Large brain model for learning generic representations with tremendous eeg data in bci.arXiv preprint arXiv:2405.18765, 2024

Wei-Bang Jiang, Li-Ming Zhao, and Bao-Liang Lu. Large brain model for learning generic representations with tremendous eeg data in bci.arXiv preprint arXiv:2405.18765, 2024. 1

work page arXiv 2024
[16]

Fbnetgen: Task-aware gnn-based fmri analysis via functional brain network generation

Xuan Kan, Hejie Cui, Joshua Lukemire, Ying Guo, and Carl Yang. Fbnetgen: Task-aware gnn-based fmri analysis via functional brain network generation. InInternational confer- ence on medical imaging with deep learning, pages 618–637. PMLR, 2022. 2, 7

work page 2022
[17]

Brain network transformer.Advances in Neural Information Processing Systems, 35:25586–25599, 2022

Xuan Kan, Wei Dai, Hejie Cui, Zilong Zhang, Ying Guo, and Carl Yang. Brain network transformer.Advances in Neural Information Processing Systems, 35:25586–25599, 2022. 1, 2, 7

work page 2022
[18]

The abcd study: un- derstanding the development of risk for mental and physical health outcomes.Neuropsychopharmacology, 46(1):131– 142, 2021

Nicole R Karcher and Deanna M Barch. The abcd study: un- derstanding the development of risk for mental and physical health outcomes.Neuropsychopharmacology, 46(1):131– 142, 2021. 5, 6

work page 2021
[19]

Brainnetcnn: Convolutional neural net- works for brain networks; towards predicting neurodevelop- ment.NeuroImage, 146:1038–1049, 2017

Jeremy Kawahara, Colin J Brown, Steven P Miller, Brian G Booth, Vann Chau, Ruth E Grunau, Jill G Zwicker, and Ghassan Hamarneh. Brainnetcnn: Convolutional neural net- works for brain networks; towards predicting neurodevelop- ment.NeuroImage, 146:1038–1049, 2017. 1, 2, 3, 7

work page 2017
[20]

Swift: Swin 4d fmri transformer.Advances in Neural Information Processing Systems, 36:42015–42037,

Peter Kim, Junbeom Kwon, Sunghwan Joo, Sangyoon Bae, Donggyu Lee, Yoonho Jung, Shinjae Yoo, Jiook Cha, and Taesup Moon. Swift: Swin 4d fmri transformer.Advances in Neural Information Processing Systems, 36:42015–42037,

work page
[21]

Llava-med: Training a large language- and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564,

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language- and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564,

work page
[22]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 1

work page 2023
[23]

Braingnn: Interpretable brain graph neural network for fmri analysis

Xiaoxiao Li, Yuan Zhou, Nicha Dvornek, Muhan Zhang, Siyuan Gao, Juntang Zhuang, Dustin Scheinost, Lawrence H 9 Staib, Pamela Ventola, and James S Duncan. Braingnn: Interpretable brain graph neural network for fmri analysis. Medical Image Analysis, 74:102233, 2021. 1, 2, 7

work page 2021
[24]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Finite Scalar Quantization: VQ-VAE Made Simple

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple.arXiv preprint arXiv:2309.15505, 2023. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Multimodal population brain imaging in the uk biobank prospective epidemiological study.Nature neuroscience, 19(11):1523–1536, 2016

Karla L Miller, Fidel Alfaro-Almagro, Neal K Bangerter, David L Thomas, Essa Yacoub, Junqian Xu, Andreas J Bartsch, Saad Jbabdi, Stamatios N Sotiropoulos, Jesper LR Andersson, et al. Multimodal population brain imaging in the uk biobank prospective epidemiological study.Nature neuroscience, 19(11):1523–1536, 2016. 5, 6

work page 2016
[27]

Alzheimer’s disease neuroimaging initiative (adni) clinical characterization.Neurology, 74(3): 201–209, 2010

Ronald Carl Petersen, Paul S Aisen, Laurel A Beckett, Michael C Donohue, Anthony Collins Gamst, Danielle J Harvey, CR Jack Jr, William J Jagust, Leslie M Shaw, Arthur W Toga, et al. Alzheimer’s disease neuroimaging initiative (adni) clinical characterization.Neurology, 74(3): 201–209, 2010. 6

work page 2010
[28]

Mindllm: A subject-agnostic and versatile model for fmri-to-text decoding.arXiv preprint arXiv:2502.15786, 2025

Weikang Qiu, Zheng Huang, Haoyu Hu, Aosong Feng, Yu- jun Yan, and Rex Ying. Mindllm: A subject-agnostic and versatile model for fmri-to-text decoding.arXiv preprint arXiv:2502.15786, 2025. 2

work page arXiv 2025
[29]

Language models are unsu- pervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsu- pervised multitask learners.OpenAI blog, 1(8):9, 2019. 6

work page 2019
[30]

Local-global parcellation of the human cerebral cortex from intrinsic functional connectivity mri.Cerebral cortex, 28(9):3095–3114, 2018

Alexander Schaefer, Ru Kong, Evan M Gordon, Timothy O Laumann, Xi-Nian Zuo, Avram J Holmes, Simon B Eick- hoff, and BT Thomas Yeo. Local-global parcellation of the human cerebral cortex from intrinsic functional connectivity mri.Cerebral cortex, 28(9):3095–3114, 2018. 3, 5

work page 2018
[31]

Alignment between brains and ai: Evidence for convergent evolution across modalities, scales and training trajectories.arXiv preprint arXiv:2507.01966, 2025

Guobin Shen, Dongcheng Zhao, Yiting Dong, Qian Zhang, and Yi Zeng. Alignment between brains and ai: Evidence for convergent evolution across modalities, scales and training trajectories.arXiv preprint arXiv:2507.01966, 2025. 2

work page arXiv 2025
[32]

Mul- titask vision-language prompt tuning

Sheng Shen, Shijia Yang, Tianjun Zhang, Bohan Zhai, Joseph E Gonzalez, Kurt Keutzer, and Trevor Darrell. Mul- titask vision-language prompt tuning. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5656–5667, 2024. 5

work page 2024
[33]

Topographic organization of the human sub- cortex unveiled with functional connectivity gradients.Na- ture neuroscience, 23(11):1421–1432, 2020

Ye Tian, Daniel S Margulies, Michael Breakspear, and An- drew Zalesky. Topographic organization of the human sub- cortex unveiled with functional connectivity gradients.Na- ture neuroscience, 23(11):1421–1432, 2020. 3, 5

work page 2020
[34]

Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017. 2, 4, 6

work page 2017
[35]

4d multimodal co- attention fusion network with latent contrastive alignment for alzheimer’s diagnosis.arXiv preprint arXiv:2504.16798,

Yuxiang Wei, Yanteng Zhang, Xi Xiao, Tianyang Wang, Xiao Wang, and Vince D Calhoun. 4d multimodal co- attention fusion network with latent contrastive alignment for alzheimer’s diagnosis.arXiv preprint arXiv:2504.16798,

work page arXiv
[36]

Umbrae: Unified multimodal brain decoding

Weihao Xia, Raoul de Charette, Cengiz Oztireli, and Jing- Hao Xue. Umbrae: Unified multimodal brain decoding. In European Conference on Computer Vision, pages 242–259. Springer, 2024. 2

work page 2024
[37]

Visual instance-aware prompt tuning.arXiv preprint arXiv:2507.07796, 2025

Xi Xiao, Yunbei Zhang, Xingjian Li, Tianyang Wang, Xiao Wang, Yuxiang Wei, Jihun Hamm, and Min Xu. Visual instance-aware prompt tuning.arXiv preprint arXiv:2507.07796, 2025. 1

work page arXiv 2025
[38]

Brainprompt: Multi- level brain prompt enhancement for neurological condition identification

Jiaxing Xu, Kai He, Yue Tang, Wei Li, Mengcheng Lan, Xia Dong, Yiping Ke, and Mengling Feng. Brainprompt: Multi- level brain prompt enhancement for neurological condition identification. InInternational Conference on Medical Im- age Computing and Computer-Assisted Intervention, pages 172–182. Springer, 2025. 3

work page 2025
[39]

Brain- mass: Advancing brain network analysis for diagnosis with large-scale self-supervised learning.IEEE transactions on medical imaging, 43(11):4004–4016, 2024

Yanwu Yang, Chenfei Ye, Guinan Su, Ziyao Zhang, Zhikai Chang, Hairui Chen, Piu Chan, Yue Yu, and Ting Ma. Brain- mass: Advancing brain network analysis for diagnosis with large-scale self-supervised learning.IEEE transactions on medical imaging, 43(11):4004–4016, 2024. 7

work page 2024
[40]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 4 10

work page 2023