Batch Effects In Brain Foundation Model Embeddings
Pith reviewed 2026-05-10 12:10 UTC · model grok-4.3
The pith
Foundation model embeddings from brain scans encode substantial batch effects that often dominate diagnosis signals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Foundation model embeddings encode substantial batch-related variability, often dominating diagnosis-related information across heterogeneous datasets. Harmonization reduces these batch effects, while the models themselves differ in representational focus consistent with their architectures: BrainLM prefers fine-grained regional activity and SwiFT prefers interactions between regions.
What carries the argument
A systematic evaluation framework that quantifies batch effects versus diagnosis-related information in the embeddings produced by BrainLM and SwiFT on multi-site fMRI data.
If this is right
- Harmonization techniques can measurably reduce the dominance of batch effects in these embeddings.
- BrainLM embeddings are better suited for analyses focused on regional activity patterns, while SwiFT embeddings suit analyses of inter-regional interactions.
- Disentangling acquisition variability from biological signals is required before using the embeddings for cross-site clinical or research applications.
Where Pith is reading between the lines
- Downstream machine-learning tasks trained on these embeddings may inadvertently learn site-specific artifacts rather than true diagnostic features unless batch correction is applied first.
- Similar batch dominance could appear in foundation models trained on other biomedical imaging modalities or non-imaging data collected across institutions.
- Retraining or fine-tuning the foundation models on larger, explicitly harmonized datasets might reduce the observed batch sensitivity.
Load-bearing premise
The chosen metrics and evaluation framework isolate batch effects from biologically meaningful signals without themselves introducing or amplifying site-specific artifacts.
What would settle it
A direct comparison showing that, on held-out multi-site data, similarity between embeddings from the same subject scanned at different sites exceeds similarity between different subjects with the same diagnosis, even after harmonization.
Figures
read the original abstract
Foundation models show strong potential for large-scale, high-dimensional biomedical applications, yet their ability to capture relevant neurobiological characteristics remains underexplored. We systematically evaluate embeddings from two neuroimaging foundation models, BrainLM and SwiFT, across multi-site fMRI datasets using a comprehensive evaluation framework. Our results show that foundation model embeddings encode substantial batch-related variability, often dominating diagnosis-related information across heterogeneous datasets. We further investigate how harmonization, applied to reduce batch effects, influences these embeddings. In addition, we find that BrainLM prefers to capture fine-grained regional activity, whereas SwiFT tends to represent interactions between regions, consistent with their respective model architectures. Our study highlights the importance of accounting for batch effects in foundation models and motivates future work on disentangling biologically meaningful signals from acquisition-related variability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates embeddings from two neuroimaging foundation models (BrainLM and SwiFT) on multi-site fMRI datasets via a comprehensive evaluation framework. It claims that these embeddings encode substantial batch-related variability that often dominates diagnosis-related information, examines the impact of harmonization techniques on the embeddings, and reports that BrainLM preferentially captures fine-grained regional activity while SwiFT represents inter-regional interactions, consistent with their architectures.
Significance. If the empirical findings hold under rigorous controls, the work is significant for highlighting a practical limitation in applying foundation models to heterogeneous biomedical imaging data. It provides concrete motivation for improved harmonization and disentanglement methods in neuroimaging ML, and the multi-site evaluation adds real-world relevance. The observation of architecture-aligned differences between models is a useful secondary contribution.
major comments (2)
- [Abstract and Evaluation Framework] The central claim that batch effects 'often dominating diagnosis-related information' requires explicit quantification (e.g., via specific metrics, effect sizes, or statistical tests comparing batch vs. diagnosis variance). Without these details the dominance assertion cannot be verified as load-bearing for the conclusions.
- [Methods / Evaluation Framework] The evaluation framework's ability to isolate batch effects from biological signals is load-bearing, yet the manuscript provides no description of controls for confounding variables (age, sex, or site demographics) or validation that the framework itself does not amplify site-specific artifacts. This directly affects the weakest assumption in the study.
minor comments (1)
- [Abstract] The abstract would benefit from at least one concrete quantitative result (e.g., a reported R², AUC difference, or variance ratio) to ground the qualitative claims.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. The comments highlight important areas where the manuscript can be strengthened with additional quantification and methodological details. We address each major comment below and will make the corresponding revisions.
read point-by-point responses
-
Referee: [Abstract and Evaluation Framework] The central claim that batch effects 'often dominating diagnosis-related information' requires explicit quantification (e.g., via specific metrics, effect sizes, or statistical tests comparing batch vs. diagnosis variance). Without these details the dominance assertion cannot be verified as load-bearing for the conclusions.
Authors: We agree that the dominance claim requires explicit quantification to be verifiable. The revised manuscript will add variance decomposition analyses (e.g., using linear mixed-effects models with site as a random effect and diagnosis as a fixed effect) to report the proportion of variance attributable to batch versus diagnosis, along with effect sizes and statistical tests. These results will be presented in a new subsection of the Results. revision: yes
-
Referee: [Methods / Evaluation Framework] The evaluation framework's ability to isolate batch effects from biological signals is load-bearing, yet the manuscript provides no description of controls for confounding variables (age, sex, or site demographics) or validation that the framework itself does not amplify site-specific artifacts. This directly affects the weakest assumption in the study.
Authors: We acknowledge that the current manuscript lacks explicit description of confounder controls and validation steps. In the revision, the Methods section will be expanded to detail how age, sex, and site demographics are accounted for (via covariate regression or matching) and to include validation checks, such as residual correlation analyses and ablation tests on harmonized versus unharmonized subsets to confirm the framework does not amplify artifacts. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper is an empirical evaluation of foundation model embeddings on multi-site fMRI datasets, with claims resting on direct metric comparisons and observed patterns rather than any derivation chain, equations, or self-referential definitions. No load-bearing steps reduce by construction to fitted inputs or prior self-citations; the abstract and described framework treat batch effects and diagnosis signals as independently measurable quantities without renaming or smuggling assumptions. This is a standard non-circular empirical study.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
On the Opportunities and Risks of Foundation Models
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosse- lut, A., Brunskill, E., et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258,
work page internal anchor Pith review arXiv
-
[2]
Caro, J. O., Fonseca, A. H. d. O., Averill, C., Rizvi, S. A., Rosati, M., Cross, J. L., Mittal, P., Zappala, E., Levine, D., Dhodapkar, R. M., et al. BrainLM: A foundation model for brain activity recordings.bioRxiv, pp. 2023–09,
work page 2023
-
[3]
Craddock, R. C., James, G. A., Holtzheimer III, P. E., Hu, X. P., and Mayberg, H. S. A whole brain fMRI atlas generated via spatially constrained spectral clustering. Human brain mapping, 33(8):1914–1928,
work page 1914
-
[4]
Bert: Pre-training of deep bidirectional transformers for lan- guage understanding
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 confer- ence of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp. 4171–4186,
work page 2019
-
[5]
Crafting papers on machine learning
Langley, P. Crafting papers on machine learning. In Langley, P. (ed.),Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stan- ford, CA,
work page 2000
-
[6]
Song, C. and Raghunathan, A. Information leakage in embedding models. InProceedings of the 2020 ACM SIGSAC conference on computer and communications security, pp. 377–390,
work page 2020
-
[7]
7 Batch Effects In Brain Foundation Model Embeddings A. Additional Related Work Fundamentals of fMRI Data.Resting-state fMRI captures the blood oxygenation level dependent (BOLD) signal while subjects remain in the scanner without performing specific tasks (Poldrack et al., 2024; Buxton, 2009), producing a sequence of volumetric images over time in which ...
work page 2024
-
[8]
or into overlapping brain networks via data-driven approaches such as independent component analysis (ICA) (McKeown et al., 1998; Beckmann & Smith, 2004; Calhoun et al., 2001). Functional connectivity measures quantify statistical dependencies between brain regions, capturing how regions interact with one another over time. A widely used approach is FNC, ...
work page 1998
-
[9]
introduced specialized edge-to-edge and edge-to-node filters to learn from connectivity matrices. While these methods are computationally efficient and relatively robust to local noise, they inherently abstract away fine-grained spatial heterogeneity and high-frequency temporal dynamics, potentially limiting their ability to capture richer patterns in fMR...
work page 2016
-
[10]
and its extensions such as CovBat (Chen et al., 2022), explicitly model both mean and variance shifts across sites and improve stability in small- sample settings. They have been successfully applied to various neuroimaging metrics, including functional connectivity matrices, and remain a practical standard for harmonizing summary-level neuroimaging featu...
work page 2022
-
[11]
acquired at the same institution. For diagnostic grouping, participants labeled as 1 (ADHD- Combined), 2 (ADHD-Hyperactive/Impulsive), and 3 (ADHD-Inattentive) were merged into a single ADHD patient group, while those labeled as 0 were treated as typically developing controls. Site Name Count Age (Mean±SD) Male Female Control Patient Peking 1 85 11.24±1.8...
-
[12]
provide a more interpretable view of which large-scale systems are preferentially captured by each model. Across all datasets, SwiFT consistently achieves higher mean R2 across most functional networks, whereas BrainLM exhibits substantially weaker connectivity-related performance. 14 Batch Effects In Brain Foundation Model Embeddings 400 200 0 200 400 PC...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.