pith. machine review for the scientific record. sign in

arxiv: 2605.07302 · v1 · submitted 2026-05-08 · 💻 cs.LG

Recognition: no theorem link

Pretraining Induces a Reusable Spectral Basis for Downstream Task Adaptation

Junjie Yu , Yue Wang , Zihan Deng , Yan Zhu , Wenxiao Ma , Quanying Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:35 UTC · model grok-4.3

classification 💻 cs.LG
keywords spectral analysispretrainingfinetuningsingular vectorstransfer learningparameter efficient fine-tuningGLUEweight matrix stability
0
0 comments X

The pith

Pretraining creates a reusable spectral basis of stable leading singular vectors that finetuning inherits across tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why finetuning leaves certain directions in the parameter space untouched. Systematic analysis of weight matrices in vision and language models shows that the leading singular vectors stay nearly unchanged after finetuning and turn up in many unrelated downstream tasks. This pattern indicates that pretraining builds a fixed coordinate system whose directions already contain useful structure. Larger pretraining datasets strengthen the stability of these directions even when data distributions or tasks shift. The authors turn the observation into a practical method that freezes the pretrained vectors and tunes only their coefficients, reaching competitive GLUE scores with 0.2 percent of the usual trainable parameters.

Core claim

Pretraining induces a reusable spectral basis formed by the leading singular vectors of the weight matrices. These vectors remain highly stable under finetuning and are shared across unrelated downstream tasks in vision and language settings. Models pretrained on larger datasets display greater spectral stability under distribution shift or task change. The stable directions encode transferable task-relevant structure, shown by the fact that optimizing only the associated spectral coefficients while freezing the basis vectors yields competitive performance on GLUE using 0.2 percent trainable parameters.

What carries the argument

The leading singular vectors of pretrained weight matrices, which act as a fixed spectral basis that downstream tasks reuse by adjusting only the corresponding coefficients.

If this is right

  • Finetuning reduces to low-dimensional optimization along the stable spectral directions.
  • Pretraining scale directly increases the geometric transferability of the learned basis.
  • Freezing the singular vectors and tuning only coefficients offers a parameter-efficient adaptation route that works across vision and language models.
  • The same spectral directions serve multiple unrelated tasks without further adjustment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could extend to other model families if similar spectral stability appears in their weight matrices.
  • Pretraining losses might be modified to encourage even stronger reuse of leading singular directions.
  • Task similarity could be measured by overlap in the leading singular vectors each task activates.
  • The finding suggests that some parameters are deliberately left fixed during pretraining to support broad transfer.

Load-bearing premise

The stability of the leading singular vectors under finetuning shows they carry transferable task-relevant structure rather than emerging as an artifact of optimization dynamics or matrix initialization.

What would settle it

Finding that the leading singular vectors change substantially during finetuning on new tasks, or that unrelated tasks rely on entirely different leading vectors, would disprove the reusable-basis claim.

Figures

Figures reproduced from arXiv: 2605.07302 by Junjie Yu, Quanying Liu, Wenxiao Ma, Yan Zhu, Yue Wang, Zihan Deng.

Figure 1
Figure 1. Figure 1: Pretraining induces a reusable spectral basis for downstream task adaptation. (A) Finetuning updates are not uniformly distributed across the parameter space: some directions undergo significant changes while others remain largely unchanged, suggesting that optimization is effectively confined to a low-dimensional subspace. (B) To understand which directions are preserved and which are altered during finet… view at source ↗
Figure 2
Figure 2. Figure 2: Finetuning largely preserves the pretrained spectral basis. (A, B) Cosine similarity between matched left/right singular vectors before and after finetuning across language and vision models. (C) Similarity as a function of rank and layer, leading directions and shallow layers are the most stable. (D) Replacing finetuned singular vectors with pretrained ones causes little performance loss. (E) Swapping sin… view at source ↗
Figure 3
Figure 3. Figure 3: Spectral stability is controlled by update magnitude and directional alignment. (A) For the leading singular direction, smaller ratios of singular value generally correlates with higher similarity of singular vector. (B) Representative case where σ1(∆W) ≪ σ1(Wpt): the leading singular vector is preserved with high fidelity. (C) Representative case where σ1(∆W) ≈ σ1(Wpt): stability can be maintained when th… view at source ↗
Figure 4
Figure 4. Figure 4: Pretraining progressively stabilizes the spectral basis, and larger datasets make it more transferable. (A) Cosine similarity between top singular vectors from different pretraining checkpoints. Similarity increases in later stages, showing that leading spectral directions gradually converge during pretraining. (B) Post-training on different tasks with the same data source. (C) Post-training on different d… view at source ↗
Figure 5
Figure 5. Figure 5: Impact of Components on Finetuning. (A) Increasing the range of rank consistently improves downstream performance, indicating that larger spectral subspaces capture more transfer￾able information. (B) Optimization in deeper layers leads to better performance, highlighting the importance of deep-layer adaptation. (C) Performance differences across leading, middle, and trailing rank ranges are modest. We fir… view at source ↗
Figure 6
Figure 6. Figure 6: shows the relative change in singular values across different ranks for models in our analysis. For each model and rank, we compute the relative change as the difference between finetuned and pretrained singular values divided by the pretrained value, visualized along the quantile axis. In most cases, the relative changes remain small across all ranks, indicating that finetuning preserves both the leading … view at source ↗
Figure 7
Figure 7. Figure 7: Spectral transfer in EEG and fMRI models. We repeat the singular-vector alignment analysis for pretrained EEG and fMRI models. Panels (A–D) show EEG results, and panels (E–H) show fMRI results. In both modalities, pretrained and finetuned singular vectors remain highly aligned, especially in the leading spectral components. that finetuning preserves the dominant pretrained spectral coordinates while allowi… view at source ↗
Figure 8
Figure 8. Figure 8: Singular-vector alignment over broader rank ranges. We compare rank-dependent alignment across vision, language, EEG, and fMRI models. Train with Raw Label Train with Random Label 0 1 Results of Last Layer Mean Results of All Layer [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Random labels break spectral stability. Starting from the same pretrained checkpoint and using the same input data, singular vectors remain stable under true labels but rapidly lose alignment under random labels. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Attention-module spectral alignment in vision models. Singular-vector similarity for query, key, value and output projections shows the same pattern as FFN layers: leading components remain highly stable after finetuning. Figures 10 and 11 show that attention modules exhibit a highly consistent pattern with FFN layers. Across query, key, value and output projections, the leading singular vectors remain st… view at source ↗
Figure 11
Figure 11. Figure 11: Attention-module spectral alignment in language models. Across query, key, and value projections, leading singular vectors remain strongly aligned after finetuning, with alignment decreasing toward lower-ranked components, consistent with FFN and vision-model observations. G Additional Results on Singular-Vector Swapping The main text presents representative swapping results for bert-base-uncased on two d… view at source ↗
Figure 12
Figure 12. Figure 12: Pretrained-to-finetuned singular-vector replacement (FFN). Replacing finetuned singular vectors with pretrained ones causes only minor performance degradation across tasks, especially for leading components. A. COLA B. MNLI C. QNLI D. SST2 Baseline MNLI QNLI SST2 COLA [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Cross-task singular-vector swapping (FFN). Swapping singular vectors between models finetuned on different tasks has limited impact on performance in most cases, indicating strong cross-task compatibility in spectral directions. A. COLA B. MNLI C. QNLI D. SST2 Baseline Swap Eigenvector Random Zero [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Pretrained-to-finetuned singular-vector replacement (attention). Attention projections show similar robustness: replacing singular vectors leads to only modest performance changes, particularly in leading components. A. COLA B. MNLI C. QNLI D. SST2 Baseline MNLI QNLI SST2 COLA [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Cross-task singular-vector swapping (attention). Performance remains largely stable under cross-task swapping, suggesting that attention modules also share a compatible spectral basis across tasks. Across both module types and all tasks, swapping singular vectors leads to only limited perfor￾mance changes. This supports a unified view: finetuning primarily operates by reweighting spectral coefficients wit… view at source ↗
Figure 16
Figure 16. Figure 16: Cumulative explained variance of ConvNeXt last-block FC weights. The first and second columns correspond to fc1 and fc2, respectively. (A) Explained variance versus relative rank. About 60% of singular values are needed to capture 90% of the Frobenius energy, with fc1 spectra slightly more concentrated than fc2. (B) Explained variance of the top singular values in absolute rank. The top-30 components expl… view at source ↗
Figure 17
Figure 17. Figure 17: Masking order comparison across tasks and module types. Performance under top-down masking (from leading ranks) versus bottom-up masking (from trailing ranks) on four datasets. Results are shown for both feed-forward (FC) and attention modules [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Singular vector similarity versus generalization performance across layers and rank ranges. (A) Vision models. (B) Language models. Across both modalities, no consistent or statistically significant relationship between singular vector similarity and generalization performance is observed after multiple testing correction. J Relation between Singular Vector Similarity and Generalization Performance We inv… view at source ↗
Figure 19
Figure 19. Figure 19: Multi-rank spectral alignment analysis. Smaller singular value ratios correlate with higher vector similarity. Alignment decreases significantly at higher ranks, with rank-100 showing maximum similarity below 0.1. primarily operates within a low-dimensional subspace spanned by the leading spectral components of the pretrained model [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Cumulative-gradient analysis across pretrained models. Gradient singular values are consistently smaller than pretrained parameter singular values across all models, indicating small-magnitude finetuning perturbations. L Additional Results on the Effect of Pretraining Data Scale on Spectral Stability Pretraining setup. All pretraining experiments are conducted on CIFAR-10. We train Masked Autoencoder (MAE… view at source ↗
Figure 21
Figure 21. Figure 21: Additional results across more pretraining data scales. (A) Post-training on the same task with different data. (B) Post-training on the same data source with different tasks. In both settings, models pretrained on larger datasets exhibit more stable singular vectors during post-training. In the main text, we illustrate the post-training evolution of singular-vector similarity for two representative pretr… view at source ↗
read the original abstract

Finetuning pretrained models occurs in a low-dimensional subspace of the full parameter space. Prior work has focused on characterizing this optimization subspace, but largely ignored the complementary question: why do certain directions remain unexplored during finetuning? Are these stable directions irrelevant to downstream tasks, or do they already encode task-relevant structure that requires no further adjustment? Answering this question is central to understanding how pretrained knowledge transfers. Through systematic spectral analysis across vision and language models, we show that the leading singular vectors of pretrained weight matrices remain highly stable under finetuning and are shared across unrelated downstream tasks, revealing that pretraining establishes a reusable spectral coordinate system. Models pretrained on larger datasets exhibit greater spectral stability under distribution shift or task change, directly linking pretraining scale to geometric transferability. Motivated by these findings, we propose a parameter-efficient method that freezes pretrained singular vectors and optimizes only leading spectral coefficients, achieving competitive performance on GLUE with 0.2% trainable parameters. Our results reveal that the stable directions encode transferable structure rather than irrelevant noise: successful pretraining discovers spectral bases that downstream tasks inherit and operate within.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that pretraining induces stable leading singular vectors in weight matrices that remain largely unchanged during finetuning and are shared across unrelated downstream tasks, establishing a reusable spectral coordinate system. This is supported by spectral analysis across vision and language models showing greater stability with larger pretraining datasets, and motivates a parameter-efficient finetuning method that freezes these vectors while optimizing only leading spectral coefficients, achieving competitive GLUE performance with 0.2% trainable parameters.

Significance. If the central empirical observations hold after addressing controls, the work provides a geometric interpretation of transfer learning that links pretraining scale directly to the dimensionality and reusability of the adaptation subspace. The proposed spectral-coefficient approach offers a simple, interpretable alternative to existing PEFT methods and could inform both theory and practice in efficient model adaptation.

major comments (2)
  1. [Abstract] Abstract: The claim that pretraining 'establishes a reusable spectral coordinate system' and that stable directions 'encode transferable structure rather than irrelevant noise' rests on comparisons across pretraining scales but lacks a from-scratch baseline in which identical architectures are trained from random initialization on the same downstream tasks. Without this control, the observed stability could arise from generic SGD dynamics or SVD geometry rather than pretraining-induced structure, leaving the causal attribution insecure.
  2. [Abstract] Abstract / Method description: The success of freezing singular vectors and tuning only coefficients (0.2% parameters on GLUE) is reported as evidence for the reusable basis, yet the manuscript does not include ablations isolating whether performance depends on the pretrained singular vectors specifically or would hold for any fixed low-rank basis; this weakens the link between the spectral analysis and the method's efficacy.
minor comments (2)
  1. Details on the exact SVD implementation, number of layers and models analyzed, and any statistical tests for stability (e.g., cosine similarity thresholds or variance across runs) are needed to assess reproducibility of the spectral measurements.
  2. The definition of 'unrelated downstream tasks' and quantitative metrics for cross-task sharing of singular vectors should be stated explicitly to allow readers to evaluate the sharing claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the causal role of pretraining in our spectral analysis. We address each major comment below and plan to revise the manuscript with additional experiments to strengthen our claims.

read point-by-point responses
  1. Referee: The claim that pretraining 'establishes a reusable spectral coordinate system' and that stable directions 'encode transferable structure rather than irrelevant noise' rests on comparisons across pretraining scales but lacks a from-scratch baseline in which identical architectures are trained from random initialization on the same downstream tasks. Without this control, the observed stability could arise from generic SGD dynamics or SVD geometry rather than pretraining-induced structure, leaving the causal attribution insecure.

    Authors: We agree that a from-scratch baseline would strengthen the causal attribution to pretraining. In the revised manuscript we will add experiments training identical architectures from random initialization on the downstream tasks and directly compare spectral stability of leading singular vectors under finetuning. This control will help distinguish pretraining-induced structure from generic SGD or SVD effects. revision: yes

  2. Referee: The success of freezing singular vectors and tuning only coefficients (0.2% parameters on GLUE) is reported as evidence for the reusable basis, yet the manuscript does not include ablations isolating whether performance depends on the pretrained singular vectors specifically or would hold for any fixed low-rank basis; this weakens the link between the spectral analysis and the method's efficacy.

    Authors: We acknowledge the value of such ablations. The revision will include comparisons of the proposed method against versions that freeze random orthogonal bases or bases extracted from scratch-trained models. We expect performance to degrade with non-pretrained bases, thereby tightening the connection between the observed spectral stability and the method's effectiveness. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on direct empirical SVD measurements independent of the proposed method

full rationale

The paper performs systematic empirical spectral analysis on pretrained weight matrices before and after finetuning, measuring stability of leading singular vectors and their sharing across tasks. These observations directly motivate but do not define or derive the parameter-efficient method of freezing singular vectors and optimizing coefficients. No equations, self-citations, or ansatzes reduce the stability findings or transferability claims to fitted quantities from the same data by construction. The derivation chain is observational and self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical spectral observations; the analysis presupposes that SVD of weight matrices isolates the directions most relevant to transfer.

axioms (1)
  • domain assumption Singular value decomposition of weight matrices isolates the principal geometric directions relevant to task adaptation and transfer
    The entire stability analysis and the proposed method are built on treating leading singular vectors as the meaningful coordinate system.

pith-pipeline@v0.9.0 · 5507 in / 1298 out tokens · 69852 ms · 2026-05-11T01:35:30.862343+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 5 internal anchors

  1. [1]

    Llm fine-tuning: Concepts, opportunities, and challenges.Big Data and Cognitive Computing, 9(4):87, 2025

    Xiao-Kun Wu, Min Chen, Wanyi Li, Rui Wang, Limeng Lu, Jia Liu, Kai Hwang, Yixue Hao, Yanru Pan, Qingguo Meng, et al. Llm fine-tuning: Concepts, opportunities, and challenges.Big Data and Cognitive Computing, 9(4):87, 2025

  2. [2]

    Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

    Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, and Sai Qian Zhang. Parameter-efficient fine-tuning for large models: A comprehensive survey.arXiv preprint arXiv:2403.14608, 2024

  3. [3]

    Intrinsic dimensionality explains the effectiveness of language model fine-tuning

    Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. InProceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), pages 7319–7328, 2021

  4. [4]

    Fine-tuning happens in tiny subspaces: Exploring intrinsic task-specific subspaces of pre-trained language models

    Zhong Zhang, Bang Liu, and Junming Shao. Fine-tuning happens in tiny subspaces: Exploring intrinsic task-specific subspaces of pre-trained language models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1701–1713, 2023

  5. [5]

    Few dimensions are enough: Fine-tuning bert with selected dimensions revealed its redundant nature.arXiv preprint arXiv:2504.04966, 2025

    Shion Fukuhata and Yoshinobu Kano. Few dimensions are enough: Fine-tuning bert with selected dimensions revealed its redundant nature.arXiv preprint arXiv:2504.04966, 2025

  6. [6]

    Pissa: Principal singular values and singular vectors adaptation of large language models.Advances in Neural Information Processing Systems, 37:121038–121072, 2024

    Fanxu Meng, Zhaohui Wang, and Muhan Zhang. Pissa: Principal singular values and singular vectors adaptation of large language models.Advances in Neural Information Processing Systems, 37:121038–121072, 2024

  7. [7]

    What is being transferred in transfer learning?Advances in neural information processing systems, 33:512–523, 2020

    Behnam Neyshabur, Hanie Sedghi, and Chiyuan Zhang. What is being transferred in transfer learning?Advances in neural information processing systems, 33:512–523, 2020

  8. [8]

    How transferable are features in deep neural networks?Advances in neural information processing systems, 27, 2014

    Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks?Advances in neural information processing systems, 27, 2014

  9. [9]

    Understanding intermediate layers using linear classifier probes

    Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016

  10. [10]

    Measuring the intrinsic dimension of objective landscapes,

    Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes.arXiv preprint arXiv:1804.08838, 2018

  11. [11]

    Parameter-efficient transfer learning for nlp

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InInternational conference on machine learning, pages 2790–2799. PMLR, 2019

  12. [12]

    Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

  13. [13]

    Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models

    Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, 2022

  14. [14]

    AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

    Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adalora: Adaptive budget allocation for parameter- efficient fine-tuning.arXiv preprint arXiv:2303.10512, 2023

  15. [15]

    Dora: Weight-decomposed low-rank adaptation

    Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. In Forty-first International Conference on Machine Learning, 2024. 10

  16. [16]

    Towards a unified view of parameter-efficient transfer learning,

    Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. To- wards a unified view of parameter-efficient transfer learning.arXiv preprint arXiv:2110.04366, 2021

  17. [17]

    Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning.Journal of Machine Learning Research, 22(165):1–73, 2021

    Charles H Martin and Michael W Mahoney. Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning.Journal of Machine Learning Research, 22(165):1–73, 2021

  18. [18]

    Traditional and heavy tailed self regularization in neural network models

    Michael Mahoney and Charles Martin. Traditional and heavy tailed self regularization in neural network models. InInternational Conference on Machine Learning, pages 4284–4293. PMLR, 2019

  19. [19]

    Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability.Advances in neural information processing systems, 30, 2017

    Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability.Advances in neural information processing systems, 30, 2017

  20. [20]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  21. [21]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 10, 2022

  22. [22]

    Scaling data-constrained language models.Advances in Neural Information Processing Systems, 36:50358–50376, 2023

    Niklas Muennighoff, Alexander Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin A Raffel. Scaling data-constrained language models.Advances in Neural Information Processing Systems, 36:50358–50376, 2023

  23. [23]

    Scaling laws for transfer

    Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish. Scaling laws for transfer.arXiv preprint arXiv:2102.01293, 2021

  24. [24]

    Adapterfusion: Non-destructive task composition for transfer learning

    Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. Adapterfusion: Non-destructive task composition for transfer learning. InProceedings of the 16th conference of the European chapter of the association for computational linguistics: main volume, pages 487–503, 2021

  25. [25]

    Svft: Parameter- efficient fine-tuning with singular vectors.Advances in Neural Information Processing Systems, 37:41425–41446, 2024

    Vijay Lingam, Atula Tejaswi, Aditya Vavre, Aneesh Shetty, Gautham K Gudur, Joydeep Ghosh, Alex Dimakis, Eunsol Choi, Aleksandar Bojchevski, and Sujay Sanghavi. Svft: Parameter- efficient fine-tuning with singular vectors.Advances in Neural Information Processing Systems, 37:41425–41446, 2024

  26. [26]

    & Lu, B.-L

    Wei-Bang Jiang, Li-Ming Zhao, and Bao-Liang Lu. Large brain model for learning generic representations with tremendous eeg data in bci.arXiv preprint arXiv:2405.18765, 2024

  27. [27]

    Open multi-session and multi-task eeg cognitive dataset for passive brain- computer interface applications.Scientific Data, 10(1):85, 2023

    Marcel F Hinss, Emilie S Jahanpour, Bertille Somon, Lou Pluchon, Frédéric Dehais, and Raphaëlle N Roy. Open multi-session and multi-task eeg cognitive dataset for passive brain- computer interface applications.Scientific Data, 10(1):85, 2023

  28. [28]

    Brainlm: A foundation model for brain activity recordings

    Josue Ortega Caro, Antonio Henrique de Oliveira Fonseca, Syed A Rizvi, Matteo Rosati, Christopher Averill, James L Cross, Prateek Mittal, Emanuele Zappala, Rahul Madhav Dhodap- kar, Chadi Abdallah, et al. Brainlm: A foundation model for brain activity recordings. InThe Twelfth International Conference on Learning Representations

  29. [29]

    Omni-fmri: A universal atlas-free fmri foundation model

    Mo Wang, Wenhao Ye, Junfeng Xia, Junxiang Zhang, Xuanye Pan, Minghao Xu, Haotian Deng, Hongkai Wen, and Quanying Liu. Omni-fmri: A universal atlas-free fmri foundation model. arXiv preprint arXiv:2601.23090, 2026

  30. [30]

    Brain-jepa: Brain dynamics foundation model with gradient positioning and spatiotemporal masking.Advances in Neural Information Processing Systems, 37:86048–86073, 2024

    Zijian Dong, Ruilin Li, Yilei Wu, Thuan T Nguyen, Joanna S Chong, Fang Ji, Nathanael R Tong, Christopher L Chen, and Juan H Zhou. Brain-jepa: Brain dynamics foundation model with gradient positioning and spatiotemporal masking.Advances in Neural Information Processing Systems, 37:86048–86073, 2024

  31. [31]

    Brain-DiT: A Universal Multi-state fMRI Foundation Model with Metadata-Conditioned Pretraining

    Junfeng Xia, Wenhao Ye, Xuanye Pan, Xinke Shen, Mo Wang, and Quanying Liu. Brain-dit: A universal multi-state fmri foundation model with metadata-conditioned pretraining.arXiv preprint arXiv:2604.12683, 2026. 11 Appendix Overview This appendix provides additional empirical results, model details, and implementation specifications that further support the ...