pith. sign in

arxiv: 2606.04010 · v1 · pith:2WEMH72Ynew · submitted 2026-05-29 · 🧬 q-bio.NC · cs.AI

The Variance Brain Foundation Models Forgot: Third-Order Statistics Predict Cognition Where Billion-Parameter Models Fail

Pith reviewed 2026-06-28 19:05 UTC · model grok-4.3

classification 🧬 q-bio.NC cs.AI
keywords brain foundation modelsfMRIfunctional connectivityco-skewnessthird-order statisticscognitive predictionpretraining objectivevariance allocation
0
0 comments X

The pith

Brain foundation models predict cognition worse than linear functional connectivity because pretraining destroys third-order co-skewness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that three leading brain foundation models, trained self-supervised on fMRI, underperform a simple linear regression on the functional connectivity matrix when predicting cognitive scores. The performance gap increases with model scale, pointing to the pretraining objective as the cause rather than architecture or size. Analysis of the reconstructed signals shows that second-order covariance is retained while third-order co-skewness is largely lost. A linear projection of the fMRI signal onto the subspace that preserves co-skewness yields functional connectivity features that exceed both raw FC and all tested BFMs across datasets, without any pretraining. Finetuning a BFM with a loss aimed at the same subspace recovers performance to the raw-FC level.

Core claim

Brain foundation models pretrained on fMRI capture the dominant variance components but discard the co-skewness tensor that carries cognitive information. Per-cumulant comparison of original and reconstructed signals confirms partial preservation of second-order statistics alongside near-total loss of third-order structure. Projecting the signal into the co-skewness-preserving subspace and computing functional connectivity in that space produces predictions that surpass both the raw connectivity matrix and every pretrained model tested.

What carries the argument

The co-skewness-preserving subspace projection, which selects directions in the fMRI signal that retain third-order moments before computing functional connectivity.

If this is right

  • Larger BFMs predict cognition more poorly than smaller ones when evaluated on the same readouts.
  • Finetuning any BFM with a loss that targets the co-skewness subspace recovers performance up to the raw-FC ceiling.
  • The performance limit in current BFMs is set by the pretraining objective, not parameter count or Transformer architecture.
  • A linear pipeline using the co-skewness subspace outperforms prior state-of-the-art cognitive prediction methods on every tested dataset and parcellation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future self-supervised objectives for fMRI should explicitly regularize or reconstruct higher-order moments rather than variance alone.
  • The result raises the possibility that cognitive signals in brain data live primarily in non-Gaussian structure, which standard variance-based pretraining discards by design.
  • Similar variance-allocation problems may appear in other high-dimensional time-series domains where second-order statistics dominate the training loss.

Load-bearing premise

The subspace that best preserves co-skewness is the one that also holds the information relevant for predicting cognition, and the per-cumulant breakdown reflects genuine loss rather than selection artifacts.

What would settle it

A controlled test in which BFMs are finetuned with an explicit co-skewness preservation loss yet still fail to match the linear subspace method on held-out cognitive prediction tasks would falsify the claim that the pretraining objective is the decisive bottleneck.

Figures

Figures reproduced from arXiv: 2606.04010 by Demian Wassermann, Gabriel Mahuas, Giovanni Marraffini, Trinidad Borrell, Victoria Shevchenko.

Figure 1
Figure 1. Figure 1: Feature-extraction pipelines and comparison protocol. All methods share the KRR + [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Cognition prediction (r, mean ± 1 std across 200 CV folds) on AOMIC and HCP. All three self-supervised BFMs (left of dashed line) sit at or below the noise floor, with BrainLM-650M < BrainLM-111M (inverse scaling). KRR on raw FC exceeds every BFM by a wide margin, and the Tucker decomposition of the co-skewness tensor (right) further improves it on both datasets. 4.3 Second- vs. third-order spatial subspac… view at source ↗
Figure 3
Figure 3. Figure 3: Temporal reduction sweeps across the four dataset [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: BrainLM dual-moment FT (AOMIC-trained at [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Readout comparison after finetuning on AOMIC: BrainLM-111M (top) and BrainLM-650M [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Readout comparison after finetuning on HCP: BrainLM-111M (top) and BrainLM-650M [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: HCP Schaefer granularity comparison. FC-full at each native Schaefer resolution ( [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Learning curves across all four dataset×parcellation combinations. Top: AOMIC AAL￾424 (left), AOMIC Schaefer-400 (right). Bottom: HCP AAL-424 (left), HCP Schaefer-400 (right). Shaded bands are ±1 std across the 200 CV folds. K Nested cross-validation for leak-free R∗ selection [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Dense-sweep curves for all four dataset×parcellation cells: AOMIC AAL-424 (top left), AOMIC Schaefer-400 (top right), HCP AAL-424 (bottom left), HCP Schaefer-400 (bottom right). FC-full shown as a horizontal dashed line. FC-Tucker (blue) exceeds FC-full across a broad plateau (R ≳ 80 up to full rank), not just at the sweep-optimal R∗ ; FC-PCA (orange) is at or below FC-full over the corresponding range. On… view at source ↗
Figure 10
Figure 10. Figure 10: Pretrained BrainLM readouts on AOMIC (AAL-424): 111M (top) and 650M (bottom). [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Pretrained Brain-JEPA (top, Schaefer-400+Tian-50) and BrainMass (bottom, Schaefer-100 [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Pretrained BrainLM readouts on HCP (AAL-424): 111M (top) and 650M (bottom). [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Pretrained Brain-JEPA (top) and BrainMass (bottom) readouts on HCP. Same atlas-per [PITH_FULL_IMAGE:figures/full_fig_p032_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Ooi-style raw FC baseline plots on AOMIC (left) and HCP (right). Used throughout the [PITH_FULL_IMAGE:figures/full_fig_p033_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Reconstruction-quality comparison. FC computed from BFM-reconstructed timeseries [PITH_FULL_IMAGE:figures/full_fig_p034_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Subject fingerprinting on HCP. Raw FC separates within- from between-subject distances [PITH_FULL_IMAGE:figures/full_fig_p036_16.png] view at source ↗
read the original abstract

Brain foundation models (BFMs) are self-supervised Transformers pretrained on fMRI data. We posit that these models should capture each subject's cognitive performance from their fMRI signal. Yet across three state-of-the-art BFMs and every readout we test, they predict cognition worse than a linear regression from the $\sim$80K parameters of the functional connectivity matrix (FC). The gap widens with scale: BrainLM's 650M model predicts cognition worse than its 111M. We attribute this to a \textbf{variance allocation problem}: BFM pretraining captures the variance components that dominate fMRI but not the higher-order structure that predicts cognition. Our per-cumulant analysis of the reconstructed signal shows that the second-order covariance is partially preserved, while the third-order co-skewness tensor is largely destroyed. To recover what BFMs lose, we design a linear pipeline that projects the fMRI signal into the subspace that best preserves its co-skewness and computes FC there. This \textbf{exceeds raw FC and every pretrained BFM} on every dataset and parcellation we test, outperforming prior state-of-the-art under controlled evaluation \textbf{with no pretraining and no GPU}. We \textbf{recover the raw-FC ceiling on BrainLM's forward pass} by finetuning with a loss targeted at this same subspace. This shows that the bottleneck is the pretraining objective, not the architecture or the model size.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that brain foundation models (BFMs) underperform linear regression on the ~80K parameters of the functional connectivity (FC) matrix when predicting cognition from fMRI, with the performance gap increasing with model scale. It attributes this to a variance allocation problem in which BFM pretraining preserves second-order covariance but largely destroys third-order co-skewness. The authors introduce a linear pipeline that projects fMRI signals into the subspace best preserving co-skewness before computing FC; this exceeds both raw FC and all tested BFMs. They further show that finetuning a BFM with a loss targeted at the same subspace recovers the raw-FC performance ceiling, concluding that the pretraining objective—not architecture or scale—is the bottleneck.

Significance. If the results are free of selection effects, the work would be significant for demonstrating that higher-order statistics, rather than model capacity, limit BFM utility for cognitive prediction. The per-cumulant analysis of reconstructions and the targeted finetuning experiment provide concrete evidence linking a specific statistical loss to downstream failure. The proposed linear method offers a reproducible, GPU-free baseline that outperforms large pretrained models, which could shift emphasis toward objective design in neuroimaging foundation models.

major comments (2)
  1. [Methods (linear pipeline and subspace identification)] Methods section describing the co-skewness-preserving subspace: the procedure for identifying the subspace (via tensor decomposition or optimization) must be shown to be performed strictly unsupervised on data held out from the cognition regression task. The abstract and per-cumulant analysis do not specify whether the 'best preserves' criterion is computed on the same fMRI sessions later used for label prediction or whether any post-hoc metric correlates with cognition scores; if either occurs, the reported gains over raw FC could arise from implicit label leakage rather than recovery of destroyed third-order structure.
  2. [Results (per-cumulant analysis)] Results on per-cumulant analysis of BFM reconstructions: the claim that co-skewness is 'largely destroyed' while covariance is 'partially preserved' requires explicit quantification (e.g., Frobenius norms or explained variance per cumulant order) on the exact same held-out sessions used for cognition prediction. Without these numbers and without confirming independence from label information, it is unclear whether the observed performance ordering is driven by the claimed statistical loss.
minor comments (2)
  1. [Abstract] The abstract states that the subspace projection 'exceeds raw FC and every pretrained BFM on every dataset and parcellation'; please add a table or supplementary figure reporting exact effect sizes and statistical tests for each comparison.
  2. [Methods] Notation for the co-skewness tensor and its projection operator should be defined once in the main text with consistent symbols across equations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and will revise the manuscript to incorporate the requested clarifications and quantifications.

read point-by-point responses
  1. Referee: [Methods (linear pipeline and subspace identification)] Methods section describing the co-skewness-preserving subspace: the procedure for identifying the subspace (via tensor decomposition or optimization) must be shown to be performed strictly unsupervised on data held out from the cognition regression task. The abstract and per-cumulant analysis do not specify whether the 'best preserves' criterion is computed on the same fMRI sessions later used for label prediction or whether any post-hoc metric correlates with cognition scores; if either occurs, the reported gains over raw FC could arise from implicit label leakage rather than recovery of destroyed third-order structure.

    Authors: The subspace is identified via an unsupervised optimization (or tensor decomposition) performed exclusively on fMRI time series from the training portion of each cross-validation fold; cognition labels are never accessed during this step. The preservation criterion is computed solely from the third-order cumulant of the training data. We will expand the Methods section with pseudocode and an explicit statement confirming that no label information enters the subspace selection, thereby ruling out leakage. revision: yes

  2. Referee: [Results (per-cumulant analysis)] Results on per-cumulant analysis of BFM reconstructions: the claim that co-skewness is 'largely destroyed' while covariance is 'partially preserved' requires explicit quantification (e.g., Frobenius norms or explained variance per cumulant order) on the exact same held-out sessions used for cognition prediction. Without these numbers and without confirming independence from label information, it is unclear whether the observed performance ordering is driven by the claimed statistical loss.

    Authors: We agree that explicit numerical quantification on the identical held-out sessions is required. We will add a supplementary table (or figure panel) reporting the Frobenius-norm ratios and explained-variance percentages for both the second-order covariance and third-order co-skewness tensors, computed on the test folds used for the cognition regressions. These metrics are derived from the reconstruction step alone and are therefore independent of the downstream labels. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation rests on three independent empirical steps: (1) direct comparison of BFM readouts vs. linear regression on raw FC parameters, (2) per-cumulant decomposition of BFM reconstructions showing differential preservation of covariance vs. co-skewness, and (3) construction of a linear projection whose sole selection criterion is maximization of co-skewness preservation on the fMRI signal itself, followed by FC computation and cognition regression on that projection. The finetuning experiment targets the same co-skewness subspace with an auxiliary loss and shows recovery of the raw-FC performance ceiling. None of these steps reduces by construction to the downstream cognition labels; the subspace criterion is defined solely from third-order moments of the input signal without reference to labels, and all performance numbers are reported under controlled evaluation. No self-citation chain or uniqueness theorem is invoked to force the result. The chain is therefore self-contained against the external benchmarks of BFM outputs and raw FC regression.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper's approach relies on the existence and importance of third-order statistics in fMRI for cognition, which is assumed rather than derived from first principles. The subspace selection is data-dependent.

free parameters (1)
  • co-skewness preserving subspace
    The subspace is chosen to best preserve the co-skewness tensor, which is data-dependent.
axioms (1)
  • domain assumption Third-order co-skewness in fMRI signals is predictive of cognitive performance
    This is central to why preserving it improves prediction.

pith-pipeline@v0.9.1-grok · 5812 in / 1355 out tokens · 36429 ms · 2026-06-28T19:05:18.499326+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 8 canonical work pages

  1. [1]

    NeuroImage , volume=

    Comparison of individualized behavioral predictions across anatomical, diffusion and functional connectivity MRI , author=. NeuroImage , volume=. 2022 , publisher=

  2. [2]

    NeuroImage , volume=

    Comparison between gradients and parcellations for functional connectivity prediction of behavior , author=. NeuroImage , volume=. 2023 , publisher=

  3. [3]

    Philosophical Transactions of the Royal Society B: Biological Sciences , volume=

    A distributed brain network predicts general intelligence from resting-state human neuroimaging data , author=. Philosophical Transactions of the Royal Society B: Biological Sciences , volume=. 2018 , publisher=

  4. [4]

    Scientific Data , volume=

    The Amsterdam Open MRI Collection, a set of multimodal MRI datasets for individual difference analyses , author=. Scientific Data , volume=. 2021 , publisher=

  5. [5]

    The WU-Minn Human Connectome Project: An overview , journal =

    David C. The WU-Minn Human Connectome Project: An overview , journal =. 2013 , note =. doi:https://doi.org/10.1016/j.neuroimage.2013.05.041 , url =

  6. [6]

    BioRxiv , pages=

    BrainLM: A foundation model for brain activity recordings , author=. BioRxiv , pages=. 2023 , publisher=

  7. [7]

    Advances in Neural Information Processing Systems , volume=

    Brain-jepa: Brain dynamics foundation model with gradient positioning and spatiotemporal masking , author=. Advances in Neural Information Processing Systems , volume=

  8. [8]

    arXiv preprint arXiv:2509.24693 , year=

    Brain Harmony: A Multimodal Foundation Model Unifying Morphology and Function into 1D Tokens , author=. arXiv preprint arXiv:2509.24693 , year=

  9. [9]

    Nature Neuroscience , volume=

    Functional connectome fingerprinting: identifying individuals using patterns of brain connectivity , author=. Nature Neuroscience , volume=. 2015 , publisher=

  10. [10]

    Nature Communications , volume=

    Shared and unique brain network features predict cognitive, personality, and mental health scores in the ABCD study , author=. Nature Communications , volume=. 2022 , publisher=

  11. [11]

    IEEE transactions on medical imaging , volume=

    Brainmass: Advancing brain network analysis for diagnosis with large-scale self-supervised learning , author=. IEEE transactions on medical imaging , volume=. 2024 , publisher=

  12. [12]

    IEEE Signal Processing Magazine , volume=

    Brain foundation models: A survey on advancements in neural signal processing and brain discovery , author=. IEEE Signal Processing Magazine , volume=. 2025 , publisher=

  13. [13]

    Nature Neuroscience , volume=

    A synergistic core for human brain evolution and cognition , author=. Nature Neuroscience , volume=. 2022 , publisher=

  14. [14]

    Brain Connectivity , volume=

    High-Order Interdependencies in the Aging Brain , author=. Brain Connectivity , volume=. 2021 , doi=

  15. [15]

    PLOS Computational Biology , volume=

    High-order functional redundancy in ageing explained via alterations in the connectome in a whole-brain model , author=. PLOS Computational Biology , volume=. 2022 , publisher=

  16. [16]

    Varley and Maria Pope and Olaf Sporns , title =

    Thomas F. Varley and Maria Pope and Olaf Sporns , title =. Proceedings of the National Academy of Sciences , volume =. 2023 , doi =

  17. [17]

    arXiv preprint arXiv:1912.10077 , year=

    Are transformers universal approximators of sequence-to-sequence functions? , author=. arXiv preprint arXiv:1912.10077 , year=

  18. [18]

    Nature Physics , volume=

    Higher-order organization of multivariate time series , author=. Nature Physics , volume=. 2023 , publisher=

  19. [19]

    Nature Communications , volume=

    Higher-order connectomics of human brain function reveals local topological signatures of task decoding, individual identification, and behavior , author=. Nature Communications , volume=. 2024 , publisher=

  20. [20]

    Advances in Neural Information Processing Systems , volume=

    Lexicon3d: Probing visual foundation models for complex 3d scene understanding , author=. Advances in Neural Information Processing Systems , volume=

  21. [21]

    Advances in neural information processing systems , volume=

    Are emergent abilities of large language models a mirage? , author=. Advances in neural information processing systems , volume=

  22. [22]

    arXiv preprint arXiv:2306.09479 , year=

    Inverse scaling: When bigger isn't better , author=. arXiv preprint arXiv:2306.09479 , year=

  23. [23]

    Advances in neural information processing systems , volume=

    Can contrastive learning avoid shortcut solutions? , author=. Advances in neural information processing systems , volume=

  24. [24]

    2016 , issn =

    Noise contributions to the fMRI signal: An overview , journal =. 2016 , issn =. doi:https://doi.org/10.1016/j.neuroimage.2016.09.008 , url =

  25. [25]

    Nature Machine Intelligence , volume=

    Shortcut learning in deep neural networks , author=. Nature Machine Intelligence , volume=. 2020 , publisher=

  26. [26]

    2017 , note =

    Benchmarking of participant-level confound regression strategies for the control of motion artifact in studies of functional connectivity , journal =. 2017 , note =. doi:https://doi.org/10.1016/j.neuroimage.2017.03.020 , url =

  27. [27]

    and Bader, Brett W

    Kolda, Tamara G. and Bader, Brett W. , title =. SIAM Review , volume =. 2009 , doi =

  28. [28]

    SIAM Journal on Matrix Analysis and Applications , volume =

    Lin, Zhenhua , title =. SIAM Journal on Matrix Analysis and Applications , volume =. 2019 , doi =

  29. [29]

    Psychometrika , volume=

    Some mathematical notes on three-mode factor analysis , author=. Psychometrika , volume=. 1966 , publisher=

  30. [30]

    Nature , volume=

    Learnable latent embeddings for joint behavioural and neural analysis , author=. Nature , volume=. 2023 , publisher=

  31. [31]

    1994 , publisher=

    Kendall's Advanced Theory of Statistics, Volume 1: Distribution Theory , author=. 1994 , publisher=

  32. [32]

    Local-global parcellation of the human cerebral cortex from intrinsic functional connectivity

    Schaefer, Alexander and Kong, Ru and Gordon, Evan M and Laumann, Timothy O and Zuo, Xi-Nian and Holmes, Avram J and Eickhoff, Simon B and Yeo, BT Thomas , journal=. Local-global parcellation of the human cerebral cortex from intrinsic functional connectivity. 2018 , publisher=

  33. [33]

    Nature Neuroscience , volume=

    Topographic organization of the human subcortex unveiled with functional connectivity gradients , author=. Nature Neuroscience , volume=. 2020 , publisher=

  34. [34]

    Scientific Reports , volume=

    Determining the hierarchical architecture of the human brain using subject-level clustering of functional networks , author=. Scientific Reports , volume=. 2019 , publisher=

  35. [35]

    iScience , volume=

    A unique brain connectome fingerprint predates and predicts response to antidepressants , author=. iScience , volume=. 2020 , publisher=

  36. [36]

    Esteban, Oscar and Markiewicz, Christopher J and Blair, Ross W and Moodie, Craig A and Isik, A Ilkay and Erramuzpe, Asier and Kent, James D and Goncalves, Mathias and DuPre, Elizabeth and Snyder, Madeleine and Oya, Hiroyuki and Ghosh, Satrajit S and Wright, Jessey and Durnez, Joke and Poldrack, Russell A and Gorgolewski, Krzysztof J , journal=. f. 2019 , ...

  37. [37]

    arXiv preprint arXiv:2402.11337 , year=

    Learning by reconstruction produces uninformative features for perception , author=. arXiv preprint arXiv:2402.11337 , year=

  38. [38]

    Joint embedding vs reconstruction: Provable benefits of latent space prediction for self- supervised learning.arXiv preprint arXiv:2505.12477, 2025

    Joint embedding vs reconstruction: Provable benefits of latent space prediction for self supervised learning , author=. arXiv preprint arXiv:2505.12477 , year=

  39. [39]

    Machine Learning , volume=

    Inference for the generalization error , author=. Machine Learning , volume=

  40. [40]

    Pacific-Asia Conference on Knowledge Discovery and Data Mining , pages=

    Evaluating the replicability of significance tests for comparing learning algorithms , author=. Pacific-Asia Conference on Knowledge Discovery and Data Mining , pages=. 2004 , organization=

  41. [41]

    Journal of Machine Learning Research , volume=

    No unbiased estimator of the variance of k-fold cross-validation , author=. Journal of Machine Learning Research , volume=