Batch Effects In Brain Foundation Model Embeddings

Anand D. Sarwate; Bradley T. Baker; Sandeep Panta; Sergey Plis; Vince D. Calhoun; Ye Tao; Yu Wu

arxiv: 2604.14441 · v1 · submitted 2026-04-15 · 📡 eess.SP

Batch Effects In Brain Foundation Model Embeddings

Ye Tao , Bradley T. Baker , Yu Wu , Anand D. Sarwate , Sandeep Panta , Sergey Plis , Vince D. Calhoun This is my paper

Pith reviewed 2026-05-10 12:10 UTC · model grok-4.3

classification 📡 eess.SP

keywords batch effectsfoundation modelsfMRIneuroimaging embeddingsharmonizationBrainLMSwiFT

0 comments

The pith

Foundation model embeddings from brain scans encode substantial batch effects that often dominate diagnosis signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates embeddings generated by two neuroimaging foundation models, BrainLM and SwiFT, on multiple heterogeneous fMRI datasets collected at different sites. It finds that these embeddings carry strong variability tied to acquisition batches, and that this batch-related information frequently exceeds the signals relevant to clinical diagnoses. The authors further test how standard harmonization methods affect the embeddings and observe that the two models differ in what they represent: BrainLM emphasizes fine-grained local brain activity while SwiFT emphasizes interactions across regions. If these results hold, direct use of such embeddings for diagnosis or group comparisons across sites risks attributing site differences to biology.

Core claim

Foundation model embeddings encode substantial batch-related variability, often dominating diagnosis-related information across heterogeneous datasets. Harmonization reduces these batch effects, while the models themselves differ in representational focus consistent with their architectures: BrainLM prefers fine-grained regional activity and SwiFT prefers interactions between regions.

What carries the argument

A systematic evaluation framework that quantifies batch effects versus diagnosis-related information in the embeddings produced by BrainLM and SwiFT on multi-site fMRI data.

If this is right

Harmonization techniques can measurably reduce the dominance of batch effects in these embeddings.
BrainLM embeddings are better suited for analyses focused on regional activity patterns, while SwiFT embeddings suit analyses of inter-regional interactions.
Disentangling acquisition variability from biological signals is required before using the embeddings for cross-site clinical or research applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Downstream machine-learning tasks trained on these embeddings may inadvertently learn site-specific artifacts rather than true diagnostic features unless batch correction is applied first.
Similar batch dominance could appear in foundation models trained on other biomedical imaging modalities or non-imaging data collected across institutions.
Retraining or fine-tuning the foundation models on larger, explicitly harmonized datasets might reduce the observed batch sensitivity.

Load-bearing premise

The chosen metrics and evaluation framework isolate batch effects from biologically meaningful signals without themselves introducing or amplifying site-specific artifacts.

What would settle it

A direct comparison showing that, on held-out multi-site data, similarity between embeddings from the same subject scanned at different sites exceeds similarity between different subjects with the same diagnosis, even after harmonization.

Figures

Figures reproduced from arXiv: 2604.14441 by Anand D. Sarwate, Bradley T. Baker, Sandeep Panta, Sergey Plis, Vince D. Calhoun, Ye Tao, Yu Wu.

**Figure 1.** Figure 1: Overview of the fMRI representation pipeline using foundation models. fMRI scans are encoded into low-dimensional embeddings by foundation models. These embeddings are used for dimensionality reduction (e.g., PCA, LDA) and predictive modeling. interpretability of these embeddings to study which biological signals are emphasized by different foundation models, revealing systematic differences consistent wi… view at source ↗

**Figure 2.** Figure 2: (or [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: and [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of subject-level embeddings extracted from the pre-trained SwiFT model. In the last row, embeddings are first projected using PCA with 20 components, followed by further dimensionality reduction using LDA. All points are colored according to site identity. For the ABIDE I dataset, which includes 17 imaging sites, only the 10 sites with the largest sample sizes are shown for clarity. C.3. Addi… view at source ↗

**Figure 5.** Figure 5: Visualization of subject-level FNC features. In the last row, features are first projected using PCA with 20 components, followed by further dimensionality reduction using LDA. All points are colored according to site identity. For the ABIDE I dataset, which includes 17 imaging sites, only the 10 sites with the largest sample sizes are shown for clarity. higher site classification accuracy but lower diagno… view at source ↗

**Figure 6.** Figure 6: PCA visualization of subject-level representations colored by diagnostic labels. Top, middle, and bottom rows correspond to FBIRN, ADHD-200, and ABIDE I, respectively. Columns represent different feature types: FNC, BrainLM embeddings, and SwiFT embeddings. This comparison illustrates how the various representations separate diagnostic groups across datasets. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Spatial maps of ALFF decoding performance (R 2 ) after ComBat harmonization. Only the top 30% predictive regions are shown for visualization. Visual Somatomotor Dorsal Attention Frontoparietal Default Ventral Attention Limbic Subcortical Cerebellar Functional Network 0.002 0.000 0.002 0.004 0.006 0.008 0.010 0.012 M e a n R 2 Foundation Model BrainLM SwiFT (a) FBIRN Visual Somatomotor Dorsal Attention Fron… view at source ↗

**Figure 8.** Figure 8: Network-level mean R 2 of FNC decoding after ComBat harmonization. Bars show the average predictive performance within each functional network for BrainLM and SwiFT embeddings. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

read the original abstract

Foundation models show strong potential for large-scale, high-dimensional biomedical applications, yet their ability to capture relevant neurobiological characteristics remains underexplored. We systematically evaluate embeddings from two neuroimaging foundation models, BrainLM and SwiFT, across multi-site fMRI datasets using a comprehensive evaluation framework. Our results show that foundation model embeddings encode substantial batch-related variability, often dominating diagnosis-related information across heterogeneous datasets. We further investigate how harmonization, applied to reduce batch effects, influences these embeddings. In addition, we find that BrainLM prefers to capture fine-grained regional activity, whereas SwiFT tends to represent interactions between regions, consistent with their respective model architectures. Our study highlights the importance of accounting for batch effects in foundation models and motivates future work on disentangling biologically meaningful signals from acquisition-related variability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Batch effects swamp diagnosis signals in BrainLM and SwiFT embeddings on multi-site fMRI, and standard harmonization only partially fixes it.

read the letter

The core finding is that embeddings from BrainLM and SwiFT pick up substantial site-specific batch variability that often outweighs the diagnostic information across the datasets they examined. They also note that BrainLM leans toward fine-grained regional patterns while SwiFT picks up more inter-region interactions, which lines up with how the models are built. Harmonization reduces but does not remove the batch component. That is the practical takeaway for anyone planning to use these models on pooled neuroimaging data. The work does a clean job of running the same evaluation framework on two recent foundation models and showing the batch problem is not solved just by moving to larger pre-trained networks. The architectural contrast is a small but useful observation that follows directly from the model designs. The evaluation is systematic enough to make the point that batch effects remain a deployment issue. The soft spots are mostly about missing quantitative anchors. The abstract and available text give no effect sizes, no clear statistical tests for the dominance claim, and limited detail on how they ruled out confounds like age or sex imbalance across sites. Without those numbers it is hard to judge how large the problem actually is or whether the harmonization results are robust. The paper is incremental rather than foundational; batch-effect concerns have been standard in fMRI for a long time, so the contribution is mainly the application to these two models. This is useful reading for groups that train or fine-tune neuroimaging foundation models and for anyone running multi-site studies. It is not a breakthrough but it flags a real obstacle that needs attention before clinical translation. A serious editor should send it to referees with a request for the missing quantitative controls and clearer metrics; the underlying concern is legitimate and the experiments are straightforward to check.

Referee Report

2 major / 1 minor

Summary. The manuscript evaluates embeddings from two neuroimaging foundation models (BrainLM and SwiFT) on multi-site fMRI datasets via a comprehensive evaluation framework. It claims that these embeddings encode substantial batch-related variability that often dominates diagnosis-related information, examines the impact of harmonization techniques on the embeddings, and reports that BrainLM preferentially captures fine-grained regional activity while SwiFT represents inter-regional interactions, consistent with their architectures.

Significance. If the empirical findings hold under rigorous controls, the work is significant for highlighting a practical limitation in applying foundation models to heterogeneous biomedical imaging data. It provides concrete motivation for improved harmonization and disentanglement methods in neuroimaging ML, and the multi-site evaluation adds real-world relevance. The observation of architecture-aligned differences between models is a useful secondary contribution.

major comments (2)

[Abstract and Evaluation Framework] The central claim that batch effects 'often dominating diagnosis-related information' requires explicit quantification (e.g., via specific metrics, effect sizes, or statistical tests comparing batch vs. diagnosis variance). Without these details the dominance assertion cannot be verified as load-bearing for the conclusions.
[Methods / Evaluation Framework] The evaluation framework's ability to isolate batch effects from biological signals is load-bearing, yet the manuscript provides no description of controls for confounding variables (age, sex, or site demographics) or validation that the framework itself does not amplify site-specific artifacts. This directly affects the weakest assumption in the study.

minor comments (1)

[Abstract] The abstract would benefit from at least one concrete quantitative result (e.g., a reported R², AUC difference, or variance ratio) to ground the qualitative claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important areas where the manuscript can be strengthened with additional quantification and methodological details. We address each major comment below and will make the corresponding revisions.

read point-by-point responses

Referee: [Abstract and Evaluation Framework] The central claim that batch effects 'often dominating diagnosis-related information' requires explicit quantification (e.g., via specific metrics, effect sizes, or statistical tests comparing batch vs. diagnosis variance). Without these details the dominance assertion cannot be verified as load-bearing for the conclusions.

Authors: We agree that the dominance claim requires explicit quantification to be verifiable. The revised manuscript will add variance decomposition analyses (e.g., using linear mixed-effects models with site as a random effect and diagnosis as a fixed effect) to report the proportion of variance attributable to batch versus diagnosis, along with effect sizes and statistical tests. These results will be presented in a new subsection of the Results. revision: yes
Referee: [Methods / Evaluation Framework] The evaluation framework's ability to isolate batch effects from biological signals is load-bearing, yet the manuscript provides no description of controls for confounding variables (age, sex, or site demographics) or validation that the framework itself does not amplify site-specific artifacts. This directly affects the weakest assumption in the study.

Authors: We acknowledge that the current manuscript lacks explicit description of confounder controls and validation steps. In the revision, the Methods section will be expanded to detail how age, sex, and site demographics are accounted for (via covariate regression or matching) and to include validation checks, such as residual correlation analyses and ablation tests on harmonized versus unharmonized subsets to confirm the framework does not amplify artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical evaluation of foundation model embeddings on multi-site fMRI datasets, with claims resting on direct metric comparisons and observed patterns rather than any derivation chain, equations, or self-referential definitions. No load-bearing steps reduce by construction to fitted inputs or prior self-citations; the abstract and described framework treat batch effects and diagnosis signals as independently measurable quantities without renaming or smuggling assumptions. This is a standard non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the central claim rests on standard assumptions about fMRI data quality and the validity of the chosen evaluation metrics.

pith-pipeline@v0.9.0 · 5447 in / 1089 out tokens · 36996 ms · 2026-05-10T12:10:42.587382+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 1 internal anchor

[1]

On the Opportunities and Risks of Foundation Models

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosse- lut, A., Brunskill, E., et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258,

work page internal anchor Pith review arXiv
[2]

O., Fonseca, A

Caro, J. O., Fonseca, A. H. d. O., Averill, C., Rizvi, S. A., Rosati, M., Cross, J. L., Mittal, P., Zappala, E., Levine, D., Dhodapkar, R. M., et al. BrainLM: A foundation model for brain activity recordings.bioRxiv, pp. 2023–09,

work page 2023
[3]

C., James, G

Craddock, R. C., James, G. A., Holtzheimer III, P. E., Hu, X. P., and Mayberg, H. S. A whole brain fMRI atlas generated via spatially constrained spectral clustering. Human brain mapping, 33(8):1914–1928,

work page 1914
[4]

Bert: Pre-training of deep bidirectional transformers for lan- guage understanding

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 confer- ence of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp. 4171–4186,

work page 2019
[5]

Crafting papers on machine learning

Langley, P. Crafting papers on machine learning. In Langley, P. (ed.),Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stan- ford, CA,

work page 2000
[6]

and Raghunathan, A

Song, C. and Raghunathan, A. Information leakage in embedding models. InProceedings of the 2020 ACM SIGSAC conference on computer and communications security, pp. 377–390,

work page 2020
[7]

7 Batch Effects In Brain Foundation Model Embeddings A. Additional Related Work Fundamentals of fMRI Data.Resting-state fMRI captures the blood oxygenation level dependent (BOLD) signal while subjects remain in the scanner without performing specific tasks (Poldrack et al., 2024; Buxton, 2009), producing a sequence of volumetric images over time in which ...

work page 2024
[8]

Functional connectivity measures quantify statistical dependencies between brain regions, capturing how regions interact with one another over time

or into overlapping brain networks via data-driven approaches such as independent component analysis (ICA) (McKeown et al., 1998; Beckmann & Smith, 2004; Calhoun et al., 2001). Functional connectivity measures quantify statistical dependencies between brain regions, capturing how regions interact with one another over time. A widely used approach is FNC, ...

work page 1998
[9]

introduced specialized edge-to-edge and edge-to-node filters to learn from connectivity matrices. While these methods are computationally efficient and relatively robust to local noise, they inherently abstract away fine-grained spatial heterogeneity and high-frequency temporal dynamics, potentially limiting their ability to capture richer patterns in fMR...

work page 2016
[10]

and its extensions such as CovBat (Chen et al., 2022), explicitly model both mean and variance shifts across sites and improve stability in small- sample settings. They have been successfully applied to various neuroimaging metrics, including functional connectivity matrices, and remain a practical standard for harmonizing summary-level neuroimaging featu...

work page 2022
[11]

acquired at the same institution. For diagnostic grouping, participants labeled as 1 (ADHD- Combined), 2 (ADHD-Hyperactive/Impulsive), and 3 (ADHD-Inattentive) were merged into a single ADHD patient group, while those labeled as 0 were treated as typically developing controls. Site Name Count Age (Mean±SD) Male Female Control Patient Peking 1 85 11.24±1.8...

work page arXiv 2041
[12]

Across all datasets, SwiFT consistently achieves higher mean R2 across most functional networks, whereas BrainLM exhibits substantially weaker connectivity-related performance

provide a more interpretable view of which large-scale systems are preferentially captured by each model. Across all datasets, SwiFT consistently achieves higher mean R2 across most functional networks, whereas BrainLM exhibits substantially weaker connectivity-related performance. 14 Batch Effects In Brain Foundation Model Embeddings 400 200 0 200 400 PC...

work page arXiv 1967

[1] [1]

On the Opportunities and Risks of Foundation Models

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosse- lut, A., Brunskill, E., et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258,

work page internal anchor Pith review arXiv

[2] [2]

O., Fonseca, A

Caro, J. O., Fonseca, A. H. d. O., Averill, C., Rizvi, S. A., Rosati, M., Cross, J. L., Mittal, P., Zappala, E., Levine, D., Dhodapkar, R. M., et al. BrainLM: A foundation model for brain activity recordings.bioRxiv, pp. 2023–09,

work page 2023

[3] [3]

C., James, G

Craddock, R. C., James, G. A., Holtzheimer III, P. E., Hu, X. P., and Mayberg, H. S. A whole brain fMRI atlas generated via spatially constrained spectral clustering. Human brain mapping, 33(8):1914–1928,

work page 1914

[4] [4]

Bert: Pre-training of deep bidirectional transformers for lan- guage understanding

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 confer- ence of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp. 4171–4186,

work page 2019

[5] [5]

Crafting papers on machine learning

Langley, P. Crafting papers on machine learning. In Langley, P. (ed.),Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stan- ford, CA,

work page 2000

[6] [6]

and Raghunathan, A

Song, C. and Raghunathan, A. Information leakage in embedding models. InProceedings of the 2020 ACM SIGSAC conference on computer and communications security, pp. 377–390,

work page 2020

[7] [7]

7 Batch Effects In Brain Foundation Model Embeddings A. Additional Related Work Fundamentals of fMRI Data.Resting-state fMRI captures the blood oxygenation level dependent (BOLD) signal while subjects remain in the scanner without performing specific tasks (Poldrack et al., 2024; Buxton, 2009), producing a sequence of volumetric images over time in which ...

work page 2024

[8] [8]

Functional connectivity measures quantify statistical dependencies between brain regions, capturing how regions interact with one another over time

or into overlapping brain networks via data-driven approaches such as independent component analysis (ICA) (McKeown et al., 1998; Beckmann & Smith, 2004; Calhoun et al., 2001). Functional connectivity measures quantify statistical dependencies between brain regions, capturing how regions interact with one another over time. A widely used approach is FNC, ...

work page 1998

[9] [9]

introduced specialized edge-to-edge and edge-to-node filters to learn from connectivity matrices. While these methods are computationally efficient and relatively robust to local noise, they inherently abstract away fine-grained spatial heterogeneity and high-frequency temporal dynamics, potentially limiting their ability to capture richer patterns in fMR...

work page 2016

[10] [10]

and its extensions such as CovBat (Chen et al., 2022), explicitly model both mean and variance shifts across sites and improve stability in small- sample settings. They have been successfully applied to various neuroimaging metrics, including functional connectivity matrices, and remain a practical standard for harmonizing summary-level neuroimaging featu...

work page 2022

[11] [11]

acquired at the same institution. For diagnostic grouping, participants labeled as 1 (ADHD- Combined), 2 (ADHD-Hyperactive/Impulsive), and 3 (ADHD-Inattentive) were merged into a single ADHD patient group, while those labeled as 0 were treated as typically developing controls. Site Name Count Age (Mean±SD) Male Female Control Patient Peking 1 85 11.24±1.8...

work page arXiv 2041

[12] [12]

Across all datasets, SwiFT consistently achieves higher mean R2 across most functional networks, whereas BrainLM exhibits substantially weaker connectivity-related performance

provide a more interpretable view of which large-scale systems are preferentially captured by each model. Across all datasets, SwiFT consistently achieves higher mean R2 across most functional networks, whereas BrainLM exhibits substantially weaker connectivity-related performance. 14 Batch Effects In Brain Foundation Model Embeddings 400 200 0 200 400 PC...

work page arXiv 1967