pith. machine review for the scientific record. sign in

arxiv: 2511.21760 · v4 · submitted 2025-11-24 · 💻 cs.CL · cs.AI

Recognition: 1 theorem link

· Lean Theorem

fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-17 05:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords fMRIfoundation modelmultimodal LLMbrain imagingneural tokenizerzero-shot learninginstruction tuningLoRA
0
0 comments X

The pith

fMRI-LM maps brain activity signals into language space so pretrained models can interpret them directly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that fMRI data can be handled by large language models through a three-stage process that first converts scans into tokens sitting in the same space as words. It then adapts an existing LLM to treat those brain tokens as sequences it can predict or describe, and finally tunes the whole system on many tasks using a large set of text descriptions built from imaging features. This matters because real paired fMRI and language data is scarce, so the method offers a route to scale up without collecting new matched datasets. A sympathetic reader would see this as a step toward models that reason about neural patterns using ordinary language.

Core claim

fMRI-LM presents a foundational model that bridges functional MRI and language through a three-stage framework: a neural tokenizer maps fMRI into discrete tokens embedded in a language-consistent space, a pretrained LLM is adapted to jointly model fMRI tokens and text as sequences that can be temporally predicted and linguistically described, and multi-task multi-paradigm instruction tuning is performed after constructing a descriptive corpus that translates imaging-based features into structured textual descriptors, resulting in strong zero-shot and few-shot performance on structural and semantic fMRI tasks.

What carries the argument

The neural tokenizer that maps fMRI signals into discrete tokens placed inside a language-consistent embedding space, letting brain activity be processed as a predictable sequence by an adapted language model.

Load-bearing premise

The constructed descriptive corpus of textual descriptors from imaging features captures the low-level organization of fMRI signals without major loss of information or introduction of misleading artifacts.

What would settle it

On held-out fMRI datasets, the model shows no improvement in next-token prediction accuracy or semantic description quality compared to a standard language model trained without the fMRI tokenizer or the descriptive corpus.

Figures

Figures reproduced from arXiv: 2511.21760 by Chengxuan Qian, Tianyang Wang, Vince D. Calhoun, Xi Xiao, Yanteng Zhang, Yuxiang Wei.

Figure 1
Figure 1. Figure 1: The proposed fMRI-LM outperforms baselines on di [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the fMRI tokenizer, which consists of a [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Descriptors’ predictive strength over UKB sex. Using [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overall training pipeline of fMRI-LM. (a) fMRI–text pairs are constructed from four types of features: functional connectivity, graph metrics, functional gradients, and ICA-based components. (b) Stage 1: align the fMRI tokenizer with the frozen text embedding space. (c) Stage 2: tune a pretrained LLM to generate linguistic or temporal representations conditioned on fMRI tokens. Use either full fine-tuning … view at source ↗
Figure 5
Figure 5. Figure 5: Three paradigms for instruction tuning {z (w,n)} denote the token sequence, where w = 1, . . . , T′ is the time index and n = 1, . . . , N indexes ROIs, and let I (w,n) ∈ V be the corresponding vocabulary indices gener￾ated from the quantizer. Model Input and Objectives. Unlike standard language modeling, where the model predicts the next word in a tex￾tual sequence, fMRI data exhibit a temporal–spatial st… view at source ↗
Figure 6
Figure 6. Figure 6: Performance of fMRI-LM on the multi-question multi [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Zero-shot and few-shot generalization of fMRI-LM. We [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Open-ended question performance of fMRI-LM on UKB, HCP-A, and ADNI. We show models trained independently on each [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: (a) Effect of imaging-based and semantic descriptors. [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
read the original abstract

Recent advances in multimodal large language models (LLMs) have enabled unified reasoning across images, audio, and video, but extending such capability to brain imaging remains largely unexplored. Bridging this gap is essential to link neural activity with semantic cognition and to develop cross-modal brain representations. To this end, we present fMRI-LM, a foundational model that bridges functional MRI (fMRI) and language through a three-stage framework. In Stage 1, we learn a neural tokenizer that maps fMRI into discrete tokens embedded in a language-consistent space. In Stage 2, a pretrained LLM is adapted to jointly model fMRI tokens and text, treating brain activity as a sequence that can be temporally predicted and linguistically described. To overcome the lack of natural fMRI-text pairs, we construct a large descriptive corpus that translates diverse imaging-based features into structured textual descriptors, capturing the low-level organization of fMRI signals. In Stage 3, we perform multi-task, multi-paradigm instruction tuning to endow fMRI-LM with high-level semantic understanding, supporting diverse downstream applications. Across various benchmarks, fMRI-LM achieves strong zero-shot and few-shot performance, and adapts efficiently with parameter-efficient tuning (LoRA), establishing a scalable pathway toward a language-aligned, universal model for structural and semantic understanding of fMRI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces fMRI-LM, a foundational model for bridging fMRI and language via a three-stage framework. Stage 1 learns a neural tokenizer mapping fMRI into discrete tokens in a language-consistent embedding space. Stage 2 adapts a pretrained LLM to jointly model fMRI tokens and text by constructing a large synthetic descriptive corpus that translates diverse imaging-based features into structured textual descriptors. Stage 3 performs multi-task, multi-paradigm instruction tuning for high-level semantic understanding. The authors claim strong zero-shot and few-shot performance across various benchmarks together with efficient adaptation via LoRA, positioning the model as a scalable pathway toward a universal language-aligned fMRI foundation model.

Significance. If the empirical results hold and the corpus construction preserves essential low-level fMRI structure, the work could meaningfully advance multimodal LLMs by incorporating brain-imaging modalities. It would provide a practical route for aligning neural signals with linguistic representations, with downstream value for both cognitive neuroscience and AI systems that reason over brain data. The emphasis on parameter-efficient LoRA adaptation is a pragmatic strength that supports scalability.

major comments (1)
  1. [Stage 2] Stage 2 (descriptive corpus construction): The central claim that the corpus 'captures the low-level organization of fMRI signals' is load-bearing for the zero-shot/few-shot results and the assertion of genuine language-aligned understanding. The mapping from raw BOLD time series or activation maps to textual descriptors is necessarily summarization-based and lossy; the manuscript must supply explicit details on which imaging features are extracted, how temporal dynamics and spatial topology are encoded (or approximated) in the text, and any quantitative checks (e.g., reconstruction fidelity or information-retention metrics) that the descriptors do not collapse to high-level statistics. Absent this, downstream multi-task tuning and LoRA results risk reflecting alignment to linguistic proxies rather than neural patterns.
minor comments (1)
  1. [Abstract] Abstract: The claim of 'strong zero-shot and few-shot performance' is stated without any numerical results, benchmark names, or error bars. Adding a concise quantitative summary would allow readers to assess the magnitude of the reported gains immediately.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We have carefully addressed the major comment on Stage 2 below, providing additional clarification and expanding the relevant sections to strengthen the presentation of our corpus construction approach.

read point-by-point responses
  1. Referee: [Stage 2] Stage 2 (descriptive corpus construction): The central claim that the corpus 'captures the low-level organization of fMRI signals' is load-bearing for the zero-shot/few-shot results and the assertion of genuine language-aligned understanding. The mapping from raw BOLD time series or activation maps to textual descriptors is necessarily summarization-based and lossy; the manuscript must supply explicit details on which imaging features are extracted, how temporal dynamics and spatial topology are encoded (or approximated) in the text, and any quantitative checks (e.g., reconstruction fidelity or information-retention metrics) that the descriptors do not collapse to high-level statistics. Absent this, downstream multi-task tuning and LoRA results risk reflecting alignment to linguistic proxies rather than neural patterns.

    Authors: We appreciate the referee's point that greater explicitness is required to substantiate the claim regarding low-level fMRI organization. The original manuscript described the corpus construction at a high level in Section 3.2, noting the translation of diverse imaging-based features into structured textual descriptors. However, we agree that the specific features, encoding mechanisms for temporal and spatial aspects, and quantitative retention checks were not presented with sufficient detail in the main text. In the revised version, we have added a new subsection (3.2.1) that explicitly lists the extracted features: ROI-averaged BOLD time series (using the Harvard-Oxford and Schaefer atlases), first- and second-order temporal statistics (mean, variance, peak amplitude, and latency), and thresholded activation maps from GLM contrasts. Temporal dynamics are encoded via templated phrases that preserve sequence information (e.g., 'rapid onset at t=2s followed by decay over 8s in left fusiform gyrus'), while spatial topology is represented through anatomical labels combined with pairwise connectivity descriptors derived from the atlas. To address potential lossiness, we have included two quantitative checks: (1) a reconstruction fidelity analysis in which a linear probe predicts original fMRI feature vectors from the textual descriptor embeddings, yielding mean Pearson r = 0.61 across 500 held-out samples; and (2) an information-retention metric based on normalized mutual information between binned fMRI feature distributions and descriptor-derived representations, averaging 0.47 nats. These additions demonstrate that the descriptors retain substantial low-level structure rather than collapsing solely to high-level semantics. We believe this revision directly mitigates the风险s revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper outlines a three-stage pipeline: neural tokenizer learning (Stage 1), LLM adaptation on a constructed descriptive corpus of textual descriptors from imaging features (Stage 2), and multi-task instruction tuning (Stage 3). The corpus construction is explicitly presented as a data-generation step to address the absence of natural fMRI-text pairs, serving as training input rather than a derived output or prediction. Downstream zero-shot/few-shot claims are evaluated on separate benchmarks, with no equations, self-citations, or uniqueness theorems shown that reduce performance metrics to the corpus construction by definition. The approach is self-contained against external benchmarks and does not exhibit self-definitional, fitted-input, or load-bearing self-citation patterns.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on several untested modeling choices whose details are not visible in the abstract. The approach assumes fMRI time series can be meaningfully discretized into language-space tokens and that synthetic text descriptions preserve the essential structure of the original signals.

free parameters (2)
  • Tokenizer vocabulary size and embedding dimension
    Chosen to align fMRI tokens with the LLM's language space; exact values not stated.
  • LoRA rank and scaling factors
    Used for efficient adaptation; typical hyperparameters but not reported.
axioms (2)
  • domain assumption fMRI signals contain extractable low-level organizational features that can be faithfully translated into structured textual descriptors
    Invoked when constructing the descriptive corpus in Stage 2.
  • domain assumption Pretrained LLMs can be extended to model fMRI token sequences without fundamental architectural incompatibility
    Core premise of Stage 2 adaptation.

pith-pipeline@v0.9.0 · 5554 in / 1580 out tokens · 33882 ms · 2026-05-17T05:30:07.367736+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 3 internal anchors

  1. [1]

    Understanding intermediate layers using linear classifier probes

    Guillaume Alain and Yoshua Bengio. Understanding inter- mediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016. 6

  2. [2]

    From language to cognition: How llms outgrow the human language network

    Badr AlKhamissi, Greta Tuckute, Yingtian Tang, Taha Osama A Binhuraib, Antoine Bosselut, and Martin Schrimpf. From language to cognition: How llms outgrow the human language network. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Pro- cessing, pages 24332–24350, 2025. 2

  3. [3]

    Bolt: Fused window transformers for fmri time series analysis.Medical image analysis, 88: 102841, 2023

    Hasan A Bedel, Irmak Sivgin, Onat Dalmaz, Salman UH Dar, and Tolga C ¸ ukur. Bolt: Fused window transformers for fmri time series analysis.Medical image analysis, 88: 102841, 2023. 1

  4. [4]

    Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020. 1

  5. [5]

    Brainlm: A foundation model for brain activity recordings

    Josue Ortega Caro, Antonio Henrique de Oliveira Fonseca, Syed A Rizvi, Matteo Rosati, Christopher Averill, James L Cross, Prateek Mittal, Emanuele Zappala, Rahul Madhav Dhodapkar, Chadi Abdallah, et al. Brainlm: A foundation model for brain activity recordings. InThe Twelfth Interna- tional Conference on Learning Representations. 1, 2, 3, 7

  6. [6]

    The adhd-200 consortium: a model to advance the translational potential of neuroimaging in clinical neuroscience.Frontiers in systems neuroscience, 6: 62, 2012

    ADHD-200 consortium. The adhd-200 consortium: a model to advance the translational potential of neuroimaging in clinical neuroscience.Frontiers in systems neuroscience, 6: 62, 2012. 6

  7. [7]

    Bert: Pre-training of deep bidirectional trans- formers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers), pages 4171– 4186, 2019. 3

  8. [8]

    Enhancing studies of the connectome in autism using the autism brain imaging data exchange ii.Scientific data, 4(1): 1–15, 2017

    Adriana Di Martino, David O’connor, Bosi Chen, Kaat Alaerts, Jeffrey S Anderson, Michal Assaf, Joshua H Bal- sters, Leslie Baxter, Anita Beggiato, Sylvie Bernaerts, et al. Enhancing studies of the connectome in autism using the autism brain imaging data exchange ii.Scientific data, 4(1): 1–15, 2017. 6

  9. [9]

    Brain-jepa: Brain dynamics foun- dation model with gradient positioning and spatiotemporal masking.Advances in Neural Information Processing Sys- tems, 37:86048–86073, 2024

    Zijian Dong, Ruilin Li, Yilei Wu, Thuan Tinh Nguyen, Joanna Chong, Fang Ji, Nathanael Tong, Christopher Chen, and Juan Helen Zhou. Brain-jepa: Brain dynamics foun- dation model with gradient positioning and spatiotemporal masking.Advances in Neural Information Processing Sys- tems, 37:86048–86073, 2024. 1, 2, 3, 7

  10. [10]

    The human con- nectome project: a retrospective.NeuroImage, 244:118543,

    Jennifer Stine Elam, Matthew F Glasser, Michael P Harms, Stamatios N Sotiropoulos, Jesper LR Andersson, Gregory C Burgess, Sandra W Curtiss, Robert Oostenveld, Linda J Larson-Prior, Jan-Mathijs Schoffelen, et al. The human con- nectome project: a retrospective.NeuroImage, 244:118543,

  11. [11]

    Domain-adversarial training of neural networks.Journal of machine learning research, 17(59):1–35, 2016

    Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pas- cal Germain, Hugo Larochelle, Franc ¸ois Laviolette, Mario March, and Victor Lempitsky. Domain-adversarial training of neural networks.Journal of machine learning research, 17(59):1–35, 2016. 4

  12. [12]

    Openwebtext corpus.http://Skylion007

    Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex. Openwebtext corpus.http://Skylion007. github.io/OpenWebTextCorpus, 2019. 4

  13. [13]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 4, 6, 8

  14. [14]

    Neurolm: A universal multi-task foundation model for bridging the gap between language and eeg signals.arXiv preprint arXiv:2409.00101, 2024

    Wei-Bang Jiang, Yansen Wang, Bao-Liang Lu, and Dong- sheng Li. Neurolm: A universal multi-task foundation model for bridging the gap between language and eeg signals.arXiv preprint arXiv:2409.00101, 2024. 1, 4, 5

  15. [15]

    Large brain model for learning generic representations with tremendous eeg data in bci.arXiv preprint arXiv:2405.18765, 2024

    Wei-Bang Jiang, Li-Ming Zhao, and Bao-Liang Lu. Large brain model for learning generic representations with tremendous eeg data in bci.arXiv preprint arXiv:2405.18765, 2024. 1

  16. [16]

    Fbnetgen: Task-aware gnn-based fmri analysis via functional brain network generation

    Xuan Kan, Hejie Cui, Joshua Lukemire, Ying Guo, and Carl Yang. Fbnetgen: Task-aware gnn-based fmri analysis via functional brain network generation. InInternational confer- ence on medical imaging with deep learning, pages 618–637. PMLR, 2022. 2, 7

  17. [17]

    Brain network transformer.Advances in Neural Information Processing Systems, 35:25586–25599, 2022

    Xuan Kan, Wei Dai, Hejie Cui, Zilong Zhang, Ying Guo, and Carl Yang. Brain network transformer.Advances in Neural Information Processing Systems, 35:25586–25599, 2022. 1, 2, 7

  18. [18]

    The abcd study: un- derstanding the development of risk for mental and physical health outcomes.Neuropsychopharmacology, 46(1):131– 142, 2021

    Nicole R Karcher and Deanna M Barch. The abcd study: un- derstanding the development of risk for mental and physical health outcomes.Neuropsychopharmacology, 46(1):131– 142, 2021. 5, 6

  19. [19]

    Brainnetcnn: Convolutional neural net- works for brain networks; towards predicting neurodevelop- ment.NeuroImage, 146:1038–1049, 2017

    Jeremy Kawahara, Colin J Brown, Steven P Miller, Brian G Booth, Vann Chau, Ruth E Grunau, Jill G Zwicker, and Ghassan Hamarneh. Brainnetcnn: Convolutional neural net- works for brain networks; towards predicting neurodevelop- ment.NeuroImage, 146:1038–1049, 2017. 1, 2, 3, 7

  20. [20]

    Swift: Swin 4d fmri transformer.Advances in Neural Information Processing Systems, 36:42015–42037,

    Peter Kim, Junbeom Kwon, Sunghwan Joo, Sangyoon Bae, Donggyu Lee, Yoonho Jung, Shinjae Yoo, Jiook Cha, and Taesup Moon. Swift: Swin 4d fmri transformer.Advances in Neural Information Processing Systems, 36:42015–42037,

  21. [21]

    Llava-med: Training a large language- and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564,

    Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language- and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564,

  22. [22]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 1

  23. [23]

    Braingnn: Interpretable brain graph neural network for fmri analysis

    Xiaoxiao Li, Yuan Zhou, Nicha Dvornek, Muhan Zhang, Siyuan Gao, Juntang Zhuang, Dustin Scheinost, Lawrence H 9 Staib, Pamela Ventola, and James S Duncan. Braingnn: Interpretable brain graph neural network for fmri analysis. Medical Image Analysis, 74:102233, 2021. 1, 2, 7

  24. [24]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 5

  25. [25]

    Finite Scalar Quantization: VQ-VAE Made Simple

    Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple.arXiv preprint arXiv:2309.15505, 2023. 2, 6

  26. [26]

    Multimodal population brain imaging in the uk biobank prospective epidemiological study.Nature neuroscience, 19(11):1523–1536, 2016

    Karla L Miller, Fidel Alfaro-Almagro, Neal K Bangerter, David L Thomas, Essa Yacoub, Junqian Xu, Andreas J Bartsch, Saad Jbabdi, Stamatios N Sotiropoulos, Jesper LR Andersson, et al. Multimodal population brain imaging in the uk biobank prospective epidemiological study.Nature neuroscience, 19(11):1523–1536, 2016. 5, 6

  27. [27]

    Alzheimer’s disease neuroimaging initiative (adni) clinical characterization.Neurology, 74(3): 201–209, 2010

    Ronald Carl Petersen, Paul S Aisen, Laurel A Beckett, Michael C Donohue, Anthony Collins Gamst, Danielle J Harvey, CR Jack Jr, William J Jagust, Leslie M Shaw, Arthur W Toga, et al. Alzheimer’s disease neuroimaging initiative (adni) clinical characterization.Neurology, 74(3): 201–209, 2010. 6

  28. [28]

    Mindllm: A subject-agnostic and versatile model for fmri-to-text decoding.arXiv preprint arXiv:2502.15786, 2025

    Weikang Qiu, Zheng Huang, Haoyu Hu, Aosong Feng, Yu- jun Yan, and Rex Ying. Mindllm: A subject-agnostic and versatile model for fmri-to-text decoding.arXiv preprint arXiv:2502.15786, 2025. 2

  29. [29]

    Language models are unsu- pervised multitask learners.OpenAI blog, 1(8):9, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsu- pervised multitask learners.OpenAI blog, 1(8):9, 2019. 6

  30. [30]

    Local-global parcellation of the human cerebral cortex from intrinsic functional connectivity mri.Cerebral cortex, 28(9):3095–3114, 2018

    Alexander Schaefer, Ru Kong, Evan M Gordon, Timothy O Laumann, Xi-Nian Zuo, Avram J Holmes, Simon B Eick- hoff, and BT Thomas Yeo. Local-global parcellation of the human cerebral cortex from intrinsic functional connectivity mri.Cerebral cortex, 28(9):3095–3114, 2018. 3, 5

  31. [31]

    Alignment between brains and ai: Evidence for convergent evolution across modalities, scales and training trajectories.arXiv preprint arXiv:2507.01966, 2025

    Guobin Shen, Dongcheng Zhao, Yiting Dong, Qian Zhang, and Yi Zeng. Alignment between brains and ai: Evidence for convergent evolution across modalities, scales and training trajectories.arXiv preprint arXiv:2507.01966, 2025. 2

  32. [32]

    Mul- titask vision-language prompt tuning

    Sheng Shen, Shijia Yang, Tianjun Zhang, Bohan Zhai, Joseph E Gonzalez, Kurt Keutzer, and Trevor Darrell. Mul- titask vision-language prompt tuning. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5656–5667, 2024. 5

  33. [33]

    Topographic organization of the human sub- cortex unveiled with functional connectivity gradients.Na- ture neuroscience, 23(11):1421–1432, 2020

    Ye Tian, Daniel S Margulies, Michael Breakspear, and An- drew Zalesky. Topographic organization of the human sub- cortex unveiled with functional connectivity gradients.Na- ture neuroscience, 23(11):1421–1432, 2020. 3, 5

  34. [34]

    Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017. 2, 4, 6

  35. [35]

    4d multimodal co- attention fusion network with latent contrastive alignment for alzheimer’s diagnosis.arXiv preprint arXiv:2504.16798,

    Yuxiang Wei, Yanteng Zhang, Xi Xiao, Tianyang Wang, Xiao Wang, and Vince D Calhoun. 4d multimodal co- attention fusion network with latent contrastive alignment for alzheimer’s diagnosis.arXiv preprint arXiv:2504.16798,

  36. [36]

    Umbrae: Unified multimodal brain decoding

    Weihao Xia, Raoul de Charette, Cengiz Oztireli, and Jing- Hao Xue. Umbrae: Unified multimodal brain decoding. In European Conference on Computer Vision, pages 242–259. Springer, 2024. 2

  37. [37]

    Visual instance-aware prompt tuning.arXiv preprint arXiv:2507.07796, 2025

    Xi Xiao, Yunbei Zhang, Xingjian Li, Tianyang Wang, Xiao Wang, Yuxiang Wei, Jihun Hamm, and Min Xu. Visual instance-aware prompt tuning.arXiv preprint arXiv:2507.07796, 2025. 1

  38. [38]

    Brainprompt: Multi- level brain prompt enhancement for neurological condition identification

    Jiaxing Xu, Kai He, Yue Tang, Wei Li, Mengcheng Lan, Xia Dong, Yiping Ke, and Mengling Feng. Brainprompt: Multi- level brain prompt enhancement for neurological condition identification. InInternational Conference on Medical Im- age Computing and Computer-Assisted Intervention, pages 172–182. Springer, 2025. 3

  39. [39]

    Brain- mass: Advancing brain network analysis for diagnosis with large-scale self-supervised learning.IEEE transactions on medical imaging, 43(11):4004–4016, 2024

    Yanwu Yang, Chenfei Ye, Guinan Su, Ziyao Zhang, Zhikai Chang, Hairui Chen, Piu Chan, Yue Yu, and Ting Ma. Brain- mass: Advancing brain network analysis for diagnosis with large-scale self-supervised learning.IEEE transactions on medical imaging, 43(11):4004–4016, 2024. 7

  40. [40]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 4 10