arxiv: 2605.07302 · v1 · submitted 2026-05-08 · 💻 cs.LG

Recognition: no theorem link

Pretraining Induces a Reusable Spectral Basis for Downstream Task Adaptation

Junjie Yu , Yue Wang , Zihan Deng , Yan Zhu , Wenxiao Ma , Quanying Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:35 UTC · model grok-4.3

classification 💻 cs.LG

keywords spectral analysispretrainingfinetuningsingular vectorstransfer learningparameter efficient fine-tuningGLUEweight matrix stability

0 comments

The pith

Pretraining creates a reusable spectral basis of stable leading singular vectors that finetuning inherits across tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why finetuning leaves certain directions in the parameter space untouched. Systematic analysis of weight matrices in vision and language models shows that the leading singular vectors stay nearly unchanged after finetuning and turn up in many unrelated downstream tasks. This pattern indicates that pretraining builds a fixed coordinate system whose directions already contain useful structure. Larger pretraining datasets strengthen the stability of these directions even when data distributions or tasks shift. The authors turn the observation into a practical method that freezes the pretrained vectors and tunes only their coefficients, reaching competitive GLUE scores with 0.2 percent of the usual trainable parameters.

Core claim

Pretraining induces a reusable spectral basis formed by the leading singular vectors of the weight matrices. These vectors remain highly stable under finetuning and are shared across unrelated downstream tasks in vision and language settings. Models pretrained on larger datasets display greater spectral stability under distribution shift or task change. The stable directions encode transferable task-relevant structure, shown by the fact that optimizing only the associated spectral coefficients while freezing the basis vectors yields competitive performance on GLUE using 0.2 percent trainable parameters.

What carries the argument

The leading singular vectors of pretrained weight matrices, which act as a fixed spectral basis that downstream tasks reuse by adjusting only the corresponding coefficients.

If this is right

Finetuning reduces to low-dimensional optimization along the stable spectral directions.
Pretraining scale directly increases the geometric transferability of the learned basis.
Freezing the singular vectors and tuning only coefficients offers a parameter-efficient adaptation route that works across vision and language models.
The same spectral directions serve multiple unrelated tasks without further adjustment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend to other model families if similar spectral stability appears in their weight matrices.
Pretraining losses might be modified to encourage even stronger reuse of leading singular directions.
Task similarity could be measured by overlap in the leading singular vectors each task activates.
The finding suggests that some parameters are deliberately left fixed during pretraining to support broad transfer.

Load-bearing premise

The stability of the leading singular vectors under finetuning shows they carry transferable task-relevant structure rather than emerging as an artifact of optimization dynamics or matrix initialization.

What would settle it

Finding that the leading singular vectors change substantially during finetuning on new tasks, or that unrelated tasks rely on entirely different leading vectors, would disprove the reusable-basis claim.

Figures

Figures reproduced from arXiv: 2605.07302 by Junjie Yu, Quanying Liu, Wenxiao Ma, Yan Zhu, Yue Wang, Zihan Deng.

**Figure 1.** Figure 1: Pretraining induces a reusable spectral basis for downstream task adaptation. (A) Finetuning updates are not uniformly distributed across the parameter space: some directions undergo significant changes while others remain largely unchanged, suggesting that optimization is effectively confined to a low-dimensional subspace. (B) To understand which directions are preserved and which are altered during finet… view at source ↗

**Figure 2.** Figure 2: Finetuning largely preserves the pretrained spectral basis. (A, B) Cosine similarity between matched left/right singular vectors before and after finetuning across language and vision models. (C) Similarity as a function of rank and layer, leading directions and shallow layers are the most stable. (D) Replacing finetuned singular vectors with pretrained ones causes little performance loss. (E) Swapping sin… view at source ↗

**Figure 3.** Figure 3: Spectral stability is controlled by update magnitude and directional alignment. (A) For the leading singular direction, smaller ratios of singular value generally correlates with higher similarity of singular vector. (B) Representative case where σ1(∆W) ≪ σ1(Wpt): the leading singular vector is preserved with high fidelity. (C) Representative case where σ1(∆W) ≈ σ1(Wpt): stability can be maintained when th… view at source ↗

**Figure 4.** Figure 4: Pretraining progressively stabilizes the spectral basis, and larger datasets make it more transferable. (A) Cosine similarity between top singular vectors from different pretraining checkpoints. Similarity increases in later stages, showing that leading spectral directions gradually converge during pretraining. (B) Post-training on different tasks with the same data source. (C) Post-training on different d… view at source ↗

**Figure 5.** Figure 5: Impact of Components on Finetuning. (A) Increasing the range of rank consistently improves downstream performance, indicating that larger spectral subspaces capture more transferable information. (B) Optimization in deeper layers leads to better performance, highlighting the importance of deep-layer adaptation. (C) Performance differences across leading, middle, and trailing rank ranges are modest. We fir… view at source ↗

**Figure 6.** Figure 6: shows the relative change in singular values across different ranks for models in our analysis. For each model and rank, we compute the relative change as the difference between finetuned and pretrained singular values divided by the pretrained value, visualized along the quantile axis. In most cases, the relative changes remain small across all ranks, indicating that finetuning preserves both the leading … view at source ↗

**Figure 7.** Figure 7: Spectral transfer in EEG and fMRI models. We repeat the singular-vector alignment analysis for pretrained EEG and fMRI models. Panels (A–D) show EEG results, and panels (E–H) show fMRI results. In both modalities, pretrained and finetuned singular vectors remain highly aligned, especially in the leading spectral components. that finetuning preserves the dominant pretrained spectral coordinates while allowi… view at source ↗

**Figure 8.** Figure 8: Singular-vector alignment over broader rank ranges. We compare rank-dependent alignment across vision, language, EEG, and fMRI models. Train with Raw Label Train with Random Label 0 1 Results of Last Layer Mean Results of All Layer [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Random labels break spectral stability. Starting from the same pretrained checkpoint and using the same input data, singular vectors remain stable under true labels but rapidly lose alignment under random labels. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Attention-module spectral alignment in vision models. Singular-vector similarity for query, key, value and output projections shows the same pattern as FFN layers: leading components remain highly stable after finetuning. Figures 10 and 11 show that attention modules exhibit a highly consistent pattern with FFN layers. Across query, key, value and output projections, the leading singular vectors remain st… view at source ↗

**Figure 11.** Figure 11: Attention-module spectral alignment in language models. Across query, key, and value projections, leading singular vectors remain strongly aligned after finetuning, with alignment decreasing toward lower-ranked components, consistent with FFN and vision-model observations. G Additional Results on Singular-Vector Swapping The main text presents representative swapping results for bert-base-uncased on two d… view at source ↗

**Figure 12.** Figure 12: Pretrained-to-finetuned singular-vector replacement (FFN). Replacing finetuned singular vectors with pretrained ones causes only minor performance degradation across tasks, especially for leading components. A. COLA B. MNLI C. QNLI D. SST2 Baseline MNLI QNLI SST2 COLA [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Cross-task singular-vector swapping (FFN). Swapping singular vectors between models finetuned on different tasks has limited impact on performance in most cases, indicating strong cross-task compatibility in spectral directions. A. COLA B. MNLI C. QNLI D. SST2 Baseline Swap Eigenvector Random Zero [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Pretrained-to-finetuned singular-vector replacement (attention). Attention projections show similar robustness: replacing singular vectors leads to only modest performance changes, particularly in leading components. A. COLA B. MNLI C. QNLI D. SST2 Baseline MNLI QNLI SST2 COLA [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Cross-task singular-vector swapping (attention). Performance remains largely stable under cross-task swapping, suggesting that attention modules also share a compatible spectral basis across tasks. Across both module types and all tasks, swapping singular vectors leads to only limited performance changes. This supports a unified view: finetuning primarily operates by reweighting spectral coefficients wit… view at source ↗

**Figure 16.** Figure 16: Cumulative explained variance of ConvNeXt last-block FC weights. The first and second columns correspond to fc1 and fc2, respectively. (A) Explained variance versus relative rank. About 60% of singular values are needed to capture 90% of the Frobenius energy, with fc1 spectra slightly more concentrated than fc2. (B) Explained variance of the top singular values in absolute rank. The top-30 components expl… view at source ↗

**Figure 17.** Figure 17: Masking order comparison across tasks and module types. Performance under top-down masking (from leading ranks) versus bottom-up masking (from trailing ranks) on four datasets. Results are shown for both feed-forward (FC) and attention modules [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗

**Figure 18.** Figure 18: Singular vector similarity versus generalization performance across layers and rank ranges. (A) Vision models. (B) Language models. Across both modalities, no consistent or statistically significant relationship between singular vector similarity and generalization performance is observed after multiple testing correction. J Relation between Singular Vector Similarity and Generalization Performance We inv… view at source ↗

**Figure 19.** Figure 19: Multi-rank spectral alignment analysis. Smaller singular value ratios correlate with higher vector similarity. Alignment decreases significantly at higher ranks, with rank-100 showing maximum similarity below 0.1. primarily operates within a low-dimensional subspace spanned by the leading spectral components of the pretrained model [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗

**Figure 20.** Figure 20: Cumulative-gradient analysis across pretrained models. Gradient singular values are consistently smaller than pretrained parameter singular values across all models, indicating small-magnitude finetuning perturbations. L Additional Results on the Effect of Pretraining Data Scale on Spectral Stability Pretraining setup. All pretraining experiments are conducted on CIFAR-10. We train Masked Autoencoder (MAE… view at source ↗

**Figure 21.** Figure 21: Additional results across more pretraining data scales. (A) Post-training on the same task with different data. (B) Post-training on the same data source with different tasks. In both settings, models pretrained on larger datasets exhibit more stable singular vectors during post-training. In the main text, we illustrate the post-training evolution of singular-vector similarity for two representative pretr… view at source ↗

read the original abstract

Finetuning pretrained models occurs in a low-dimensional subspace of the full parameter space. Prior work has focused on characterizing this optimization subspace, but largely ignored the complementary question: why do certain directions remain unexplored during finetuning? Are these stable directions irrelevant to downstream tasks, or do they already encode task-relevant structure that requires no further adjustment? Answering this question is central to understanding how pretrained knowledge transfers. Through systematic spectral analysis across vision and language models, we show that the leading singular vectors of pretrained weight matrices remain highly stable under finetuning and are shared across unrelated downstream tasks, revealing that pretraining establishes a reusable spectral coordinate system. Models pretrained on larger datasets exhibit greater spectral stability under distribution shift or task change, directly linking pretraining scale to geometric transferability. Motivated by these findings, we propose a parameter-efficient method that freezes pretrained singular vectors and optimizes only leading spectral coefficients, achieving competitive performance on GLUE with 0.2% trainable parameters. Our results reveal that the stable directions encode transferable structure rather than irrelevant noise: successful pretraining discovers spectral bases that downstream tasks inherit and operate within.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pretraining appears to create stable leading singular vectors that finetuning mostly leaves alone and that overlap across tasks, but without from-scratch controls the link to pretraining itself stays unproven.

read the letter

The key point is that this paper observes leading singular vectors from pretrained weights remain stable through finetuning and get reused across different tasks, which they tie to pretraining scale and turn into a parameter-efficient adaptation approach using just the spectral coefficients. They do well by performing spectral analysis on vision and language models to quantify the stability and sharing. The connection between larger pretraining datasets and greater stability under task changes is a solid empirical link. Their method achieves competitive GLUE performance while training only 0.2 percent of the parameters, showing the practical side of the idea. The main soft spot is the absence of a from-scratch training baseline for the same architectures on the downstream tasks. As the stress-test note points out, the stability might stem from general optimization behavior in large networks rather than being induced specifically by pretraining. This leaves the interpretation that pretraining creates a reusable spectral coordinate system on shaky ground, even though the measurements are direct. The claim that these directions encode transferable structure rather than noise would be stronger with that control or with more statistical rigor on the sharing. The work is internally consistent and builds on existing subspace ideas without circularity. No major issues with the math or data handling from what's described. This paper is for folks studying the mechanisms of transfer in pretrained models and those developing efficient fine-tuning techniques. It has enough new observations to merit a serious referee, though it would likely need revisions to tighten the causal claims. I would recommend putting it through peer review.

Referee Report

2 major / 2 minor

Summary. The paper claims that pretraining induces stable leading singular vectors in weight matrices that remain largely unchanged during finetuning and are shared across unrelated downstream tasks, establishing a reusable spectral coordinate system. This is supported by spectral analysis across vision and language models showing greater stability with larger pretraining datasets, and motivates a parameter-efficient finetuning method that freezes these vectors while optimizing only leading spectral coefficients, achieving competitive GLUE performance with 0.2% trainable parameters.

Significance. If the central empirical observations hold after addressing controls, the work provides a geometric interpretation of transfer learning that links pretraining scale directly to the dimensionality and reusability of the adaptation subspace. The proposed spectral-coefficient approach offers a simple, interpretable alternative to existing PEFT methods and could inform both theory and practice in efficient model adaptation.

major comments (2)

[Abstract] Abstract: The claim that pretraining 'establishes a reusable spectral coordinate system' and that stable directions 'encode transferable structure rather than irrelevant noise' rests on comparisons across pretraining scales but lacks a from-scratch baseline in which identical architectures are trained from random initialization on the same downstream tasks. Without this control, the observed stability could arise from generic SGD dynamics or SVD geometry rather than pretraining-induced structure, leaving the causal attribution insecure.
[Abstract] Abstract / Method description: The success of freezing singular vectors and tuning only coefficients (0.2% parameters on GLUE) is reported as evidence for the reusable basis, yet the manuscript does not include ablations isolating whether performance depends on the pretrained singular vectors specifically or would hold for any fixed low-rank basis; this weakens the link between the spectral analysis and the method's efficacy.

minor comments (2)

Details on the exact SVD implementation, number of layers and models analyzed, and any statistical tests for stability (e.g., cosine similarity thresholds or variance across runs) are needed to assess reproducibility of the spectral measurements.
The definition of 'unrelated downstream tasks' and quantitative metrics for cross-task sharing of singular vectors should be stated explicitly to allow readers to evaluate the sharing claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the causal role of pretraining in our spectral analysis. We address each major comment below and plan to revise the manuscript with additional experiments to strengthen our claims.

read point-by-point responses

Referee: The claim that pretraining 'establishes a reusable spectral coordinate system' and that stable directions 'encode transferable structure rather than irrelevant noise' rests on comparisons across pretraining scales but lacks a from-scratch baseline in which identical architectures are trained from random initialization on the same downstream tasks. Without this control, the observed stability could arise from generic SGD dynamics or SVD geometry rather than pretraining-induced structure, leaving the causal attribution insecure.

Authors: We agree that a from-scratch baseline would strengthen the causal attribution to pretraining. In the revised manuscript we will add experiments training identical architectures from random initialization on the downstream tasks and directly compare spectral stability of leading singular vectors under finetuning. This control will help distinguish pretraining-induced structure from generic SGD or SVD effects. revision: yes
Referee: The success of freezing singular vectors and tuning only coefficients (0.2% parameters on GLUE) is reported as evidence for the reusable basis, yet the manuscript does not include ablations isolating whether performance depends on the pretrained singular vectors specifically or would hold for any fixed low-rank basis; this weakens the link between the spectral analysis and the method's efficacy.

Authors: We acknowledge the value of such ablations. The revision will include comparisons of the proposed method against versions that freeze random orthogonal bases or bases extracted from scratch-trained models. We expect performance to degrade with non-pretrained bases, thereby tightening the connection between the observed spectral stability and the method's effectiveness. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on direct empirical SVD measurements independent of the proposed method

full rationale

The paper performs systematic empirical spectral analysis on pretrained weight matrices before and after finetuning, measuring stability of leading singular vectors and their sharing across tasks. These observations directly motivate but do not define or derive the parameter-efficient method of freezing singular vectors and optimizing coefficients. No equations, self-citations, or ansatzes reduce the stability findings or transferability claims to fitted quantities from the same data by construction. The derivation chain is observational and self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical spectral observations; the analysis presupposes that SVD of weight matrices isolates the directions most relevant to transfer.

axioms (1)

domain assumption Singular value decomposition of weight matrices isolates the principal geometric directions relevant to task adaptation and transfer
The entire stability analysis and the proposed method are built on treating leading singular vectors as the meaningful coordinate system.

pith-pipeline@v0.9.0 · 5507 in / 1298 out tokens · 69852 ms · 2026-05-11T01:35:30.862343+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 5 internal anchors

[1]

Llm fine-tuning: Concepts, opportunities, and challenges.Big Data and Cognitive Computing, 9(4):87, 2025

Xiao-Kun Wu, Min Chen, Wanyi Li, Rui Wang, Limeng Lu, Jia Liu, Kai Hwang, Yixue Hao, Yanru Pan, Qingguo Meng, et al. Llm fine-tuning: Concepts, opportunities, and challenges.Big Data and Cognitive Computing, 9(4):87, 2025

work page 2025
[2]

Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, and Sai Qian Zhang. Parameter-efficient fine-tuning for large models: A comprehensive survey.arXiv preprint arXiv:2403.14608, 2024

work page internal anchor Pith review arXiv 2024
[3]

Intrinsic dimensionality explains the effectiveness of language model fine-tuning

Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. InProceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), pages 7319–7328, 2021

work page 2021
[4]

Fine-tuning happens in tiny subspaces: Exploring intrinsic task-specific subspaces of pre-trained language models

Zhong Zhang, Bang Liu, and Junming Shao. Fine-tuning happens in tiny subspaces: Exploring intrinsic task-specific subspaces of pre-trained language models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1701–1713, 2023

work page 2023
[5]

Few dimensions are enough: Fine-tuning bert with selected dimensions revealed its redundant nature.arXiv preprint arXiv:2504.04966, 2025

Shion Fukuhata and Yoshinobu Kano. Few dimensions are enough: Fine-tuning bert with selected dimensions revealed its redundant nature.arXiv preprint arXiv:2504.04966, 2025

work page arXiv 2025
[6]

Pissa: Principal singular values and singular vectors adaptation of large language models.Advances in Neural Information Processing Systems, 37:121038–121072, 2024

Fanxu Meng, Zhaohui Wang, and Muhan Zhang. Pissa: Principal singular values and singular vectors adaptation of large language models.Advances in Neural Information Processing Systems, 37:121038–121072, 2024

work page 2024
[7]

What is being transferred in transfer learning?Advances in neural information processing systems, 33:512–523, 2020

Behnam Neyshabur, Hanie Sedghi, and Chiyuan Zhang. What is being transferred in transfer learning?Advances in neural information processing systems, 33:512–523, 2020

work page 2020
[8]

How transferable are features in deep neural networks?Advances in neural information processing systems, 27, 2014

Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks?Advances in neural information processing systems, 27, 2014

work page 2014
[9]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016

work page Pith review arXiv 2016
[10]

Measuring the intrinsic dimension of objective landscapes,

Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes.arXiv preprint arXiv:1804.08838, 2018

work page arXiv 2018
[11]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InInternational conference on machine learning, pages 2790–2799. PMLR, 2019

work page 2019
[12]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

work page 2022
[13]

Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models

Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, 2022

work page 2022
[14]

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adalora: Adaptive budget allocation for parameter- efficient fine-tuning.arXiv preprint arXiv:2303.10512, 2023

work page internal anchor Pith review arXiv 2023
[15]

Dora: Weight-decomposed low-rank adaptation

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. In Forty-first International Conference on Machine Learning, 2024. 10

work page 2024
[16]

Towards a unified view of parameter-efficient transfer learning,

Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. To- wards a unified view of parameter-efficient transfer learning.arXiv preprint arXiv:2110.04366, 2021

work page arXiv 2021
[17]

Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning.Journal of Machine Learning Research, 22(165):1–73, 2021

Charles H Martin and Michael W Mahoney. Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning.Journal of Machine Learning Research, 22(165):1–73, 2021

work page 2021
[18]

Traditional and heavy tailed self regularization in neural network models

Michael Mahoney and Charles Martin. Traditional and heavy tailed self regularization in neural network models. InInternational Conference on Machine Learning, pages 4284–4293. PMLR, 2019

work page 2019
[19]

Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability.Advances in neural information processing systems, 30, 2017

Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability.Advances in neural information processing systems, 30, 2017

work page 2017
[20]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[21]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 10, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

Scaling data-constrained language models.Advances in Neural Information Processing Systems, 36:50358–50376, 2023

Niklas Muennighoff, Alexander Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin A Raffel. Scaling data-constrained language models.Advances in Neural Information Processing Systems, 36:50358–50376, 2023

work page 2023
[23]

Scaling laws for transfer

Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish. Scaling laws for transfer.arXiv preprint arXiv:2102.01293, 2021

work page arXiv 2021
[24]

Adapterfusion: Non-destructive task composition for transfer learning

Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. Adapterfusion: Non-destructive task composition for transfer learning. InProceedings of the 16th conference of the European chapter of the association for computational linguistics: main volume, pages 487–503, 2021

work page 2021
[25]

Svft: Parameter- efficient fine-tuning with singular vectors.Advances in Neural Information Processing Systems, 37:41425–41446, 2024

Vijay Lingam, Atula Tejaswi, Aditya Vavre, Aneesh Shetty, Gautham K Gudur, Joydeep Ghosh, Alex Dimakis, Eunsol Choi, Aleksandar Bojchevski, and Sujay Sanghavi. Svft: Parameter- efficient fine-tuning with singular vectors.Advances in Neural Information Processing Systems, 37:41425–41446, 2024

work page 2024
[26]

& Lu, B.-L

Wei-Bang Jiang, Li-Ming Zhao, and Bao-Liang Lu. Large brain model for learning generic representations with tremendous eeg data in bci.arXiv preprint arXiv:2405.18765, 2024

work page arXiv 2024
[27]

Open multi-session and multi-task eeg cognitive dataset for passive brain- computer interface applications.Scientific Data, 10(1):85, 2023

Marcel F Hinss, Emilie S Jahanpour, Bertille Somon, Lou Pluchon, Frédéric Dehais, and Raphaëlle N Roy. Open multi-session and multi-task eeg cognitive dataset for passive brain- computer interface applications.Scientific Data, 10(1):85, 2023

work page 2023
[28]

Brainlm: A foundation model for brain activity recordings

Josue Ortega Caro, Antonio Henrique de Oliveira Fonseca, Syed A Rizvi, Matteo Rosati, Christopher Averill, James L Cross, Prateek Mittal, Emanuele Zappala, Rahul Madhav Dhodap- kar, Chadi Abdallah, et al. Brainlm: A foundation model for brain activity recordings. InThe Twelfth International Conference on Learning Representations

work page
[29]

Omni-fmri: A universal atlas-free fmri foundation model

Mo Wang, Wenhao Ye, Junfeng Xia, Junxiang Zhang, Xuanye Pan, Minghao Xu, Haotian Deng, Hongkai Wen, and Quanying Liu. Omni-fmri: A universal atlas-free fmri foundation model. arXiv preprint arXiv:2601.23090, 2026

work page arXiv 2026
[30]

Brain-jepa: Brain dynamics foundation model with gradient positioning and spatiotemporal masking.Advances in Neural Information Processing Systems, 37:86048–86073, 2024

Zijian Dong, Ruilin Li, Yilei Wu, Thuan T Nguyen, Joanna S Chong, Fang Ji, Nathanael R Tong, Christopher L Chen, and Juan H Zhou. Brain-jepa: Brain dynamics foundation model with gradient positioning and spatiotemporal masking.Advances in Neural Information Processing Systems, 37:86048–86073, 2024

work page 2024
[31]

Brain-DiT: A Universal Multi-state fMRI Foundation Model with Metadata-Conditioned Pretraining

Junfeng Xia, Wenhao Ye, Xuanye Pan, Xinke Shen, Mo Wang, and Quanying Liu. Brain-dit: A universal multi-state fmri foundation model with metadata-conditioned pretraining.arXiv preprint arXiv:2604.12683, 2026. 11 Appendix Overview This appendix provides additional empirical results, model details, and implementation specifications that further support the ...

work page internal anchor Pith review Pith/arXiv arXiv 2026