pith. sign in

arxiv: 2605.31504 · v1 · pith:5OALTXXWnew · submitted 2026-05-29 · 💻 cs.LG · stat.ML

When Are Multimodal Predictions Biologically Supported? A Diagnostic Evaluation Framework

Pith reviewed 2026-06-28 22:48 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords multimodal learningconfounding detectionbiological interpretabilityfoundation modelsoncologydiagnostic evaluationrepresentation analysis
0
0 comments X

The pith

DECAT shows entangled multimodal models falsely claim shared biology in most cases where it is absent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DECAT, a model-agnostic post-hoc framework that classifies multimodal representations into four scenarios—shared biology across modalities, biology limited to one modality, spurious correlations driven by confounders, or indeterminate—by applying five null-referenced metrics and a rule-based decision procedure. The framework requires no knowledge of which confounder is present and operates directly on learned embeddings. Validation on more than 2,500 synthetic representations across four model classes and on embeddings from 8,979 TCGA patients demonstrates that entangled models such as CLIP achieve near-perfect detection of shared biology yet produce false claims of shared biology in the majority of absent cases, with the false-positive rate rising as confound strength increases. The same pattern appears when the framework is applied to five pretrained pathology foundation models on real patient data without paired RNA, where confounding remains invisible to standard AUROC evaluation.

Core claim

DECAT classifies multimodal representations into four diagnostic scenarios using five null-referenced metrics and a rule-based procedure; on both synthetic data and real TCGA embeddings, entangled models achieve near-perfect shared-biology detection while falsely claiming shared biology in the majority of cases where it is absent, with the false-claim rate increasing with confound strength so that larger cohorts and stronger representations yield more confident but incorrect diagnoses.

What carries the argument

The DECAT framework, a set of five null-referenced metrics plus a rule-based decision procedure that assigns each representation to one of four diagnostic scenarios without requiring confounder labels.

If this is right

  • Standard AUROC evaluation cannot distinguish genuine shared biology from confounding in multimodal oncology models.
  • Entangled training objectives increase the rate of false shared-biology claims as dataset size and representation strength grow.
  • The framework can be applied to existing foundation models without paired modalities to surface confounding that performance metrics miss.
  • Models labeled indeterminate by DECAT should not be interpreted as biologically supported for the given task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers of multimodal foundation models could run DECAT as a routine post-training check before deploying predictions as biologically grounded.
  • The same metric set might be adapted to other multimodal domains such as imaging-genomics pairs outside oncology.
  • If the rule-based decision thresholds prove stable across cohorts, DECAT could serve as a lightweight filter for selecting representations for downstream biological interpretation.

Load-bearing premise

The five null-referenced metrics and rule-based procedure can reliably separate the four diagnostic scenarios even when the confounder is unknown and the representations come from real patient data with complex confounding.

What would settle it

A dataset in which the true presence or absence of shared biology and the identity of the confounder are known in advance, yet DECAT assigns the wrong diagnostic label to a majority of representations.

Figures

Figures reproduced from arXiv: 2605.31504 by Dylan Steiner, Etai Jacob, Gerald Sun, Gustavo Arango-Argoty.

Figure 1
Figure 1. Figure 1: DECAT framework overview. DECAT takes per-modality embeddings from any multi￾modal model and uses a four-stage decision tree to classify the predictive behavior of each modality, for a given task, into one of four scenarios or indeterminate ( [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: DECAT detection rate versus predictive task signal per model on synthetic ground truth. Each column varies the task coefficient for one signal source while all others are zero, producing a pure scenario (S3 has all coefficients at zero). Top row: strict accuracy (correct scenario assigned). Bottom row: conservative accuracy (correct or indeterminate). Where conservative accuracy substantially exceeds stric… view at source ↗
Figure 3
Figure 3. Figure 3: False shared claim rate (FSCR) on synthetic ground truth. FSCR is the probability that DECAT assigns Scenario 1 (shared biology) when the true scenario is not S1, pooled across all non-S1 scenarios (S2B, S3, S4h, S4r). (a) FSCR versus predictive task signal at Neval = 1000, with α matched by magnitude across non-S1 scenarios. FSCR rises with signal strength for entangled models. (b) FSCR versus evaluation … view at source ↗
Figure 4
Figure 4. Figure 4: DECAT detects TCGA cancer-type confounding invisible to AUROC (CLIP, H&E representation). The α-mixture sweep varies the fraction of Cohort C drawn from a label-extreme pool while A′ /B remain random pan-cancer splits; NC ≈300 fixed. Dotted line: mean natural cohort composition across labels (≈0.20; individually TMB ≈0.25, TP53 ≈0.10, Age ≈0.25). TMB and Age are binarized at the pan-cancer median; TP53 is … view at source ↗
Figure 5
Figure 5. Figure 5: DECAT’s S2 flag rate predicts within-type AUROC collapse across driver genes (unimodal H&E, Stages II and IV only). (a) Within-type AUROC collapse ∆ = AUROCpan − AUROCwithin (median across cancer types) per FM (colored dots) for 16 driver genes sorted by η 2 (fraction of mutation prevalence variance explained by cancer type). (b) S2 flag rate versus α for KRAS (50 splits). Extreme pool E defined per FM fro… view at source ↗
Figure 6
Figure 6. Figure 6: DECAT decision procedure. Four stages are applied sequentially per task and per modality. Stage I checks structural geometry (task-independent). Stage II gates on signal presence. Stage III localizes signal to shared or modality-specific components (factorized models only). Stage IV evaluates cross-cohort stability via Ptransfer and Dtask quantile, with Scenario 2 checked first. Terminal nodes are the four… view at source ↗
Figure 7
Figure 7. Figure 7: Representation geometry under independent β sweeps. Each curve varies one β coefficient while holding others at zero; marker size increases with β value. Null thresholds (dashed lines) are the most conservative boundaries across all β conditions (max of per-condition 97.5th percentiles from 200 permutations). Only shared signal (βs) drives the representation strongly into the aligned-structure region. Moda… view at source ↗
Figure 8
Figure 8. Figure 8: Per-β sensitivity of Anorm and Bnorm. Each panel sweeps one β coefficient independently while holding all others at zero. Dashed lines indicate the most conservative null boundary across all β conditions (max/min of per-condition percentiles from 200 permutations). (a) βs produces strong Anorm response and negative Bnorm. (b/c) βh and βr do not inflate Anorm. The Bnorm curve for βh is noisier than for βr b… view at source ↗
Figure 9
Figure 9. Figure 9: validates ∆shared by sweeping the outcome mixing parameter αs from 0 (outcome driven entirely by modality-specific signal) to 1 (outcome driven entirely by shared signal), with αr = 1−αs. Cross-validated linear probes are trained on the ground-truth latents (zs and zr) from Cohort A′ , in a clean two-signal setting (βs = βr = 1, βh = βb = 0). As αs increases, ∆shared transitions smoothly from negative (mod… view at source ↗
Figure 10
Figure 10. Figure 10: Dtask quantile and Ptransfer validation on ground-truth latents. S2B trajectories through the metric space under three parameter sweeps. Quadrant boundaries are the most conservative permutation null thresholds (max across conditions, 200 permutations per condition). S1 (olive) and S4h (brown) remain in the stable-transfer region. S2B (purple) enters the unstable-transfer region as cohort-shift parameters… view at source ↗
Figure 11
Figure 11. Figure 11: a shows that S1 and S2B produce overlapping but distinguishable distributions, with S2B exhibiting a heavier right tail past the null threshold. Figure 11b confirms that the S1 false-positive rate remains near the expected 5% across all evaluation sample sizes while S2B detection rises with Neval, demonstrating that the permutation null is well-calibrated. Figure 11c shows per-model fire rates: S1 false p… view at source ↗
Figure 12
Figure 12. Figure 12: Dtask quantile: S1 vs. proxy S2 on learned representations. Same format as [PITH_FULL_IMAGE:figures/full_fig_p035_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: shows Ptransfer detection sensitivity for proxy S2. At Neval = 1000, Ptransfer fires for 61% of proxy S2 runs pooled across all configurations, compared to ≈5% for S3 (Figure 13a). Detectability varies substantially by proxy configuration (Figure 13b): strong aligned proxy (γh = 1.0, η = 0) reaches 87–88%, while weak misaligned proxy (γh = 0.3, η = 0.3) reaches only 37%. Runs where Ptransfer does not fire… view at source ↗
Figure 14
Figure 14. Figure 14: shows A∗ norm saturation curves from Pre Step A across all four measurement regimes and latent dimensionalities (k ∈ {5, 10, 50, 100}). Most models saturate by Ntrain = 30k (black dashed line). The modality-dominant regime (βh = βr = 2.0) shows the slowest saturation due to weaker shared signal relative to modality-specific signal, and some models have not fully plateaued at 30k in this regime. We select … view at source ↗
Figure 15
Figure 15. Figure 15: Detection rate versus predictive task signal, shared-dominant regime (βs = 2.0, βb = 0.75). Same panel layout as [PITH_FULL_IMAGE:figures/full_fig_p037_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Detection rate versus predictive task signal, batch-dominant regime (βb = 1.5). Same panel layout as [PITH_FULL_IMAGE:figures/full_fig_p038_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Detection rate versus predictive task signal, modality-dominant regime (βh = βr = 2.0, βb = 0.75). Same panel layout as [PITH_FULL_IMAGE:figures/full_fig_p039_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Detection rate versus evaluation sample size per model. Same panel layout as [PITH_FULL_IMAGE:figures/full_fig_p040_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: False shared claim rate (FSCR), shared-dominant regime (βs = 2.0, βb = 0.75). Same panel layout as [PITH_FULL_IMAGE:figures/full_fig_p041_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: False shared claim rate (FSCR), batch-dominant regime (βb = 1.5). Same panel layout as [PITH_FULL_IMAGE:figures/full_fig_p041_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: False shared claim rate (FSCR), modality-dominant regime (βh = βr = 2.0, βb = 0.75). Same panel layout as [PITH_FULL_IMAGE:figures/full_fig_p041_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: evaluates DECAT on representations learned from proxy-entangled data. S1 detection is largely preserved relative to clean data (Figure 22a), with most models showing drops of less than 5%. Conservative S1 accuracy remains above 90% for all models (Figure 22d). Proxy S2 strict detection remains low at 5–20% (Figure 22b), consistent with the geometric challenge of detecting instability along proxy-contamina… view at source ↗
Figure 23
Figure 23. Figure 23: S1 detection on proxy-contaminated representations, stratified by proxy condition. Columns: proxy conditions varying in strength (γ) and alignment (η). Same format as [PITH_FULL_IMAGE:figures/full_fig_p043_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Proxy S2 detection stratified by proxy condition. Columns: proxy conditions. Detection rates remain low (5–35%) across all conditions, with stronger proxy (γ = 1.0) producing the highest detection rates. J.7 Cross-Modality Resolution [PITH_FULL_IMAGE:figures/full_fig_p043_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Cross-modality resolution accuracy versus Neval. Probability that both modalities are correctly classified simultaneously. (a) S4h (H&E=S4, RNA=S3). (b) S4r (H&E=S3, RNA=S4). Factorized models achieve meaningful resolution; entangled models cannot resolve different scenarios across modalities. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: b/f), H&E is correctly classified as S4 by factorized models, but the RNA-specific signal is too weak for reliable localization, producing mostly S3 or indeterminate classifications for RNA. Mixed C (αs + αh with proxy γh > 0, η = 0.3, ground truth: RNA=S1, H&E=S2; Figure 26c/g) tests whether DECAT can distinguish shared biology from proxy-driven signal across modalities. Mixed D (αs + αb, ground truth: b… view at source ↗
Figure 27
Figure 27. Figure 27: Indeterminate sensitivity on transition configurations. (a) Transition A (biological ambiguity, no proxy or confounding). (b) Transition B (confounded ambiguity, with proxy and confounding overlaid). Factorized models increasingly return indeterminate with more data (correct behavior on non-identifiable tasks) while entangled models become more overconfident. K Further TCGA Experimental Results K.1 TCGA D… view at source ↗
Figure 28
Figure 28. Figure 28: TCGA detection rates by variate index. Top row: strict accuracy (correct scenario assigned). Bottom row: conservative accuracy (correct or indeterminate). All continuous labels are binarized at the pan-cancer median. 95% Clopper–Pearson confidence intervals. (a/e) S1: CCA canonical variates (k=0–49), pooled random splits. Variate 0 carries the strongest shared signal (maximal cross-modal correlation) and … view at source ↗
Figure 29
Figure 29. Figure 29: False shared claim rate (FSCR) on TCGA pooled random splits. (a) FSCR by variate index for S4r tasks (RNA-modality residual PCs). (b) FSCR decomposed by ground-truth scenario (S2B, S3, S4r). Entangled models (CCA, CLIP) systematically misclassify non-shared signal as shared biology. Factorized models (JIVE, DSSL at β≥1.0) eliminate false S4r claims. K.3 Power Curve for S2 Detection [PITH_FULL_IMAGE:figur… view at source ↗
Figure 30
Figure 30. Figure 30: Power curve for S2 detection (C drawn entirely from extreme pool E, α=1). S2 detection rate versus evaluation cohort size N under two designs: (a) vary-C (A′ /B held at full size, only NC varied) and (b) equal-N (all cohorts varied together, A′=B=C=N). The gap between panels shows the value of large reference cohorts for probe quality. For example, under vary-C, TMB reaches 80% detection at NC ≈50, wherea… view at source ↗
read the original abstract

Multimodal models in oncology can produce accurate predictions, but accurate prediction does not reveal whether the model has learned biology that is shared across modalities, biology confined to one modality, or spurious correlations that reflect confounders rather than genuine biology. We introduce DECAT, a model-agnostic post-hoc evaluation framework that classifies multimodal representations into four diagnostic scenarios for a given task and modality, using five null-referenced metrics and a rule-based decision procedure. The framework operates on learned representations, requires no knowledge of which specific confounder is present, and returns indeterminate when the evidence is insufficient. We validate DECAT on synthetic data across four multimodal model classes (over 2,500 trained representations) and on real data from 8,979 TCGA patients, evaluating both multimodal embeddings and five pretrained pathology foundation models. Entangled models (e.g., CLIP) achieve near-perfect shared biology detection but falsely claim shared biology in the majority of cases where it is absent on real foundation model embeddings. This false claim rate increases with confound strength so that larger cohorts and stronger representations produce more confident but still incorrect diagnoses. Applied to both multimodal TCGA embeddings and five pathology foundation models without paired RNA, DECAT detects confounding invisible to AUROC without requiring the confounder labels, as confirmed by post-hoc stratification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces DECAT, a model-agnostic post-hoc framework that classifies multimodal representations into four diagnostic scenarios (shared biology, modality-specific, spurious, indeterminate) using five null-referenced metrics and a rule-based decision procedure. It reports validation across >2500 synthetic representations from four multimodal model classes and application to embeddings from 8979 TCGA patients plus five pathology foundation models, with the central empirical claim that entangled models (e.g., CLIP) achieve near-perfect shared-biology detection on synthetic data but exhibit high false-positive rates on real embeddings, with the false-claim rate increasing with confound strength.

Significance. If the five null-referenced metrics and rule-based procedure can be shown to reliably separate the four scenarios on real patient embeddings whose confounding structure is unknown and more complex than the controlled synthetic cases, the framework would provide a useful post-hoc diagnostic for distinguishing biologically supported multimodal predictions from spurious ones in oncology, beyond standard metrics such as AUROC.

major comments (3)
  1. [Abstract] Abstract and methods (as referenced in the reader's note): the central claim that DECAT reliably maps representations to the four scenarios on real TCGA embeddings rests on the assumption that the null-referenced metrics plus rule-based procedure generalize from synthetic data (known confounder structure) to real data (unknown, multi-variable clinical confounding). No direct accuracy measurement against ground-truth scenario labels is provided when the confounder is withheld, leaving the reported false-claim rates for entangled models on real foundation-model embeddings without independent confirmation.
  2. [Abstract] Abstract: the statement that 'this false claim rate increases with confound strength' on real data requires a concrete operationalization of confound strength that does not rely on the same metrics used for classification; without it, the reported increase could be circular with the decision procedure itself.
  3. [Abstract] Abstract: the framework is described as returning 'indeterminate when the evidence is insufficient,' yet the abstract supplies no quantitative thresholds or decision rules for the five metrics, making it impossible to assess whether the procedure is parameter-free or whether post-hoc choices affect the reported false-claim rates.
minor comments (1)
  1. [Abstract] The abstract would benefit from a one-sentence definition or example of each of the four diagnostic scenarios to orient readers before the empirical claims.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for these constructive comments on the manuscript. We respond to each major comment below, indicating where revisions will be incorporated.

read point-by-point responses
  1. Referee: [Abstract] Abstract and methods (as referenced in the reader's note): the central claim that DECAT reliably maps representations to the four scenarios on real TCGA embeddings rests on the assumption that the null-referenced metrics plus rule-based procedure generalize from synthetic data (known confounder structure) to real data (unknown, multi-variable clinical confounding). No direct accuracy measurement against ground-truth scenario labels is provided when the confounder is withheld, leaving the reported false-claim rates for entangled models on real foundation-model embeddings without independent confirmation.

    Authors: We agree that ground-truth scenario labels cannot be obtained for real TCGA embeddings, as the true multi-variable confounding structure is unknown by design. The synthetic experiments (with held-out confounder structure) serve to validate that the five metrics and rule-based procedure recover the correct scenario when the generative process is known. On real data the framework is applied diagnostically, with post-hoc stratification by clinical variables providing corroborating evidence. We will add an explicit limitations paragraph in the Discussion clarifying this point and the reliance on synthetic validation for procedural soundness. revision: yes

  2. Referee: [Abstract] Abstract: the statement that 'this false claim rate increases with confound strength' on real data requires a concrete operationalization of confound strength that does not rely on the same metrics used for classification; without it, the reported increase could be circular with the decision procedure itself.

    Authors: Confound strength on real data is operationalized via two external proxies that are independent of the five DECAT metrics: (1) cohort size (larger TCGA subsets) and (2) representation strength (model scale and pre-training data volume of the five pathology foundation models). The abstract already alludes to this via the clause on larger cohorts and stronger representations. We will add a dedicated paragraph in Methods defining these proxies and include a supplementary table showing the monotonic relationship between these proxies and the observed false-claim rate. revision: yes

  3. Referee: [Abstract] Abstract: the framework is described as returning 'indeterminate when the evidence is insufficient,' yet the abstract supplies no quantitative thresholds or decision rules for the five metrics, making it impossible to assess whether the procedure is parameter-free or whether post-hoc choices affect the reported false-claim rates.

    Authors: The abstract is a high-level summary; the quantitative thresholds, null-referenced metric definitions, and the complete rule-based decision tree (including the indeterminate condition) are fully specified in the Methods section. No post-hoc parameter tuning was performed; the rules were fixed prior to the real-data experiments. We will add a sentence in the abstract directing readers to the Methods for the decision procedure if space allows. revision: partial

standing simulated objections not resolved
  • Direct ground-truth scenario labels for real TCGA embeddings cannot be supplied because the true confounding structure is unknown and multi-variable; this is an inherent limitation of any diagnostic applied to observational clinical data.

Circularity Check

0 steps flagged

No significant circularity detected in DECAT framework or claims

full rationale

The paper introduces DECAT as a new model-agnostic post-hoc framework that applies five null-referenced metrics and a rule-based decision procedure to classify representations into four diagnostic scenarios. The abstract and provided text describe validation on over 2,500 synthetic representations with controlled confounder structure plus application to real TCGA embeddings from 8,979 patients and five foundation models. No equations, decision rules, or claims are shown to reduce by construction to fitted inputs on the same data, self-definitions, or load-bearing self-citations. The central results (near-perfect detection on entangled models, false claims on real embeddings, detection of confounding invisible to AUROC) are presented as empirical outcomes of the independent framework rather than tautological renamings or forced predictions. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that null models and the chosen metrics can isolate shared-biology signals without explicit confounder labels; the decision thresholds and metric definitions are not detailed in the abstract and may function as implicit free parameters.

free parameters (1)
  • decision thresholds
    Rule-based classification requires thresholds on the five metrics whose values are not specified in the abstract and may have been selected or tuned during development.
axioms (1)
  • domain assumption Null-referenced metrics can separate shared biology, modality-specific biology, and confounding without knowledge of the specific confounder identity
    The abstract states the framework 'requires no knowledge of which specific confounder is present' and returns 'indeterminate when the evidence is insufficient'.

pith-pipeline@v0.9.1-grok · 5767 in / 1510 out tokens · 26413 ms · 2026-06-28T22:48:24.118745+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 30 canonical work pages · 1 internal anchor

  1. [1]

    Chen, Ming Y

    Richard J. Chen, Ming Y . Lu, Drew F. K. Williamson, Tiffany Y . Chen, Jana Lipkova, Zahra Noor, Muhammad Shaban, Maha Shady, Mane Williams, Bumjin Joo, and Faisal Mahmood. Pan-cancer integrative histology-genomic analysis via multimodal deep learning.Cancer Cell, 40(8):865–878.e6, August 2022. ISSN 1535-6108. doi: 10.1016/j.ccell.2022.07.004

  2. [2]

    Song, Tong Ding, Sophia J

    Anurag Vaidya, Andrew Zhang, Guillaume Jaume, Andrew H. Song, Tong Ding, Sophia J. Wagner, Ming Y . Lu, Paul Doucet, Harry Robertson, Cristina Almagro-Perez, Richard J. Chen, Dina ElHarouni, Georges Ayoub, Connor Bossi, Keith L. Ligon, Georg Gerber, Long Phi Le, and Faisal Mahmood. Molecular-driven Foundation Model for Oncologic Pathology, January

  3. [3]

    EXAONE Path 2.5: Pathology Foundation Model with Multi-Omics Alignment, December 2025

    Juseung Yun, Sunwoo Yu, Sumin Ha, Jonghyun Kim, Janghyeon Lee, Jongseong Jang, and Soonyoung Lee. EXAONE Path 2.5: Pathology Foundation Model with Multi-Omics Alignment, December 2025. arXiv:2512.14019

  4. [4]

    A multimodal knowledge-enhanced whole-slide pathology foundation model.Nature Communications, 16(1): 11406, December 2025

    Yingxue Xu, Yihui Wang, Fengtao Zhou, Jiabo Ma, Cheng Jin, Shu Yang, Jinbang Li, Zhengyu Zhang, Chenglong Zhao, Huajun Zhou, Zhenhui Li, Huangjing Lin, Xin Wang, Jiguang Wang, Anjia Han, Ronald Cheong Kin Chan, Li Liang, Xiuming Zhang, and Hao Chen. A multimodal knowledge-enhanced whole-slide pathology foundation model.Nature Communications, 16(1): 11406,...

  5. [5]

    Leveraging multi-modal foundation models for analysing spatial multi-omic and histopathology data

    Tianyu Liu, Tinglin Huang, Tong Ding, Hao Wu, Peter Humphrey, Sudhir Perincheri, Kurt Schalper, Rex Ying, Hua Xu, James Zou, Faisal Mahmood, and Hongyu Zhao. Leveraging multi-modal foundation models for analysing spatial multi-omic and histopathology data. Nature Biomedical Engineering, pages 1–18, February 2026. ISSN 2157-846X. doi: 10.1038/ s41551-025-01602-6

  6. [6]

    Howard, James Dolezal, Sara Kochanny, Jefree Schulte, Heather Chen, Lara Heij, Dezheng Huo, Rita Nanda, Olufunmilayo I

    Frederick M. Howard, James Dolezal, Sara Kochanny, Jefree Schulte, Heather Chen, Lara Heij, Dezheng Huo, Rita Nanda, Olufunmilayo I. Olopade, Jakob N. Kather, Nicole Cipriani, Robert L. Grossman, and Alexander T. Pearson. The impact of site-specific digital histology signatures on deep learning model accuracy and bias.Nature Communications, 12(1):4423, Ju...

  7. [7]

    Do Histopathologi- cal Foundation Models Eliminate Batch Effects? A Comparative Study

    Jonah Kömen, Hannah Marienwald, Jonas Dippel, and Julius Hense. Do Histopathologi- cal Foundation Models Eliminate Batch Effects? A Comparative Study. InAIM-FM Work- shop, Advances in Neural Information Processing Systems (NeurIPS). arXiv, November 2024. arXiv:2411.05489

  8. [8]

    de Jong, Eric Marcus, and Jonas Teuwen

    Edwin D. de Jong, Eric Marcus, and Jonas Teuwen. Current Pathology Foundation Models are unrobust to Medical Center Differences, February 2025. arXiv:2501.18055

  9. [9]

    de Jong, Julius Hense, Hannah Marienwald, Jonas Dippel, Philip Naumann, Eric Marcus, Lukas Ruff, Maximilian Alber, Jonas Teuwen, Frederick Klauschen, and Klaus-Robert Müller

    Jonah Kömen, Edwin D. de Jong, Julius Hense, Hannah Marienwald, Jonas Dippel, Philip Naumann, Eric Marcus, Lukas Ruff, Maximilian Alber, Jonas Teuwen, Frederick Klauschen, and Klaus-Robert Müller. Towards Robust Foundation Models for Digital Pathology, July 2025. arXiv:2507.17845

  10. [10]

    Gustafsson, Kajsa Ledesma Eriksson, and Mattias Rantalainen

    Erik Thiringer, Fredrik K. Gustafsson, Kajsa Ledesma Eriksson, and Mattias Rantalainen. Scanner-Induced Domain Shifts Undermine the Robustness of Pathology Foundation Models, January 2026. arXiv:2601.04163. 10

  11. [11]

    Confounding factors and biases abound when predicting molecular biomarkers from histological images.Nature Biomedical Engineering, pages 1–15, March 2026

    Muhammad Dawood, Kim Branson, Sabine Tejpar, Nasir Rajpoot, and Fayyaz ul Amir Afsar Minhas. Confounding factors and biases abound when predicting molecular biomarkers from histological images.Nature Biomedical Engineering, pages 1–15, March 2026. ISSN 2157- 846X. doi: 10.1038/s41551-026-01616-8

  12. [12]

    Theis, and Bo Wang

    Haotian Cui, Alejandro Tejada-Lapuerta, Maria Brbi´c, Julio Saez-Rodriguez, Simona Cristea, Hani Goodarzi, Mohammad Lotfollahi, Fabian J. Theis, and Bo Wang. Towards multimodal foundation models in molecular cell biology.Nature, 640(8059):623–633, April 2025. ISSN 1476-4687. doi: 10.1038/s41586-025-08710-y

  13. [13]

    Vanguri, Jia Luo, Andrew T

    Rami S. Vanguri, Jia Luo, Andrew T. Aukerman, Jacklynn V . Egger, Christopher J. Fong, Natally Horvat, Andrew Pagano, Jose de Arimateia Batista Araujo-Filho, Luke Geneslaw, Hira Rizvi, Ramon Sosa, Kevin M. Boehm, Soo-Ryum Yang, Francis M. Bodd, Katia Ventura, Travis J. Hollmann, Michelle S. Ginsberg, Jianjiong Gao, Rami Vanguri, Matthew D. Hellmann, Jenni...

  14. [14]

    A multimodal AI biomarker PATH-ORACLE improves prediction of recurrence in stage I lung adenocarcinoma, February 2026

    Oz Kilim, Orsolya Pipek, Zsofia Sztupinszki, Miklos Diossy, Aurel Prosz, Cristina Naceur- Lombardelli, Selvaraju Veeriah, David Moore, Mariam Jamal-Hanjani, Allan Hackshaw, Janos Fillinger, Judit Moldvay, Istvan Csabai, Charles Swanton, and Zoltan Szallasi. A multimodal AI biomarker PATH-ORACLE improves prediction of recurrence in stage I lung adenocarcin...

  15. [15]

    Howard, Jakob Nikolas Kather, and Alexander T

    Frederick M. Howard, Jakob Nikolas Kather, and Alexander T. Pearson. Multimodal deep learning: An improvement in prognostication or a reflection of batch effect?Cancer Cell, 41 (1):5–6, January 2023. ISSN 1535-6108, 1878-3686. doi: 10.1016/j.ccell.2022.10.025

  16. [16]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. InPro- ceedings of the 38th International Conference on Machine Learning (ICML). arXiv, February

  17. [17]

    An Information Criterion for Controlled Disentanglement of Multimodal Data

    Chenyu Wang, Sharut Gupta, Xinyi Zhang, Sana Tonekaboni, Stefanie Jegelka, Tommi Jaakkola, and Caroline Uhler. An Information Criterion for Controlled Disentanglement of Multimodal Data. InProceedings of the International Conference on Learning Representations. arXiv, March 2025. arXiv:2410.23996

  18. [18]

    Factorized Contrastive Learning: Going Beyond Multi-view Redundancy

    Paul Pu Liang, Zihao Deng, Martin Ma, James Zou, Louis-Philippe Morency, and Ruslan Salakhutdinov. Factorized Contrastive Learning: Going Beyond Multi-view Redundancy. InAdvances in Neural Information Processing Systems (NeurIPS). arXiv, October 2023. arXiv:2306.05268

  19. [19]

    Canonical Correlation Analysis (CCA) Based Multi-View Learning: An Overview, May 2021

    Chenfeng Guo and Dongrui Wu. Canonical Correlation Analysis (CCA) Based Multi-View Learning: An Overview, May 2021. arXiv:1907.01693

  20. [20]

    Lock, Katherine A

    Eric F. Lock, Katherine A. Hoadley, J. S. Marron, and Andrew B. Nobel. Joint and individual variation explained (JIVE) for integrated analysis of multiple data types.The Annals of Applied Statistics, 7(1), March 2013. ISSN 1932-6157. doi: 10.1214/12-AOAS597

  21. [21]

    Wagner, Andrew H

    Tong Ding, Sophia J. Wagner, Andrew H. Song, Richard J. Chen, Ming Y . Lu, Andrew Zhang, Anurag J. Vaidya, Guillaume Jaume, Muhammad Shaban, Ahrong Kim, Drew F. K. Williamson, Harry Robertson, Bowen Chen, Cristina Almagro-Pérez, Paul Doucet, Sharifa Sahai, Chengkuan Chen, Christina S. Chen, Daisuke Komura, Akihiro Kawabe, Mieko Ochi, Shinya Sato, Tomoyuki...

  22. [22]

    Sun, Arijit Patra, Ioannis Kagiampakis, and Etai Jacob

    Gustavo Arango-Argoty, Elly Kipkogei, Ross Stewart, Gerald J. Sun, Arijit Patra, Ioannis Kagiampakis, and Etai Jacob. Pretrained transformers applied to clinical studies improve predictions of treatment efficacy and associated biomarkers.Nature Communications, 16(1): 2101, March 2025. ISSN 2041-1723. doi: 10.1038/s41467-025-57181-2

  23. [23]

    Lu, Bowen Chen, Drew F

    Ming Y . Lu, Bowen Chen, Drew F. K. Williamson, Richard J. Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Long Phi Le, Georg Gerber, Anil V . Parwani, Andrew Zhang, and Faisal Mahmood. A visual-language foundation model for computational pathology.Nature Medicine, 30(3):863–874, March 2024. ISSN 1546-170X. doi: 10.1038/s41591-024-02856-4

  24. [24]

    Chen, Tong Ding, Ming Y

    Richard J. Chen, Tong Ding, Ming Y . Lu, Drew F. K. Williamson, Guillaume Jaume, Andrew H. Song, Bowen Chen, Andrew Zhang, Daniel Shao, Muhammad Shaban, Mane Williams, Lukas Oldenburg, Luca L. Weishaupt, Judy J. Wang, Anurag Vaidya, Long Phi Le, Georg Gerber, Sharifa Sahai, Walt Williams, and Faisal Mahmood. Towards a general-purpose foundation model for ...

  25. [25]

    H-optimus-0, 2024

    Charlie Saillard, Rodolphe Jenatton, Felipe Llinares-López, Zelda Mariet, David Cahané, Eric Durand, and Jean-Philippe Vert. H-optimus-0, 2024. https://github.com/bioptimus/ releases/tree/main/models/h-optimus/v0

  26. [26]

    Daniel Kaplan, Ratna Sagari Grandhi, Connor Lane, Benjamin Warner, Tanishq Mathew Abraham, and Paul S. Scotti. How to Train a State-of-the-Art Pathology Foundation Model with $1.6k, 2025. Sophont Blog,https://sophont.med/blog/openmidnight

  27. [27]

    Training state-of-the-art pathology foundation models with orders of magnitude less data, April 2025

    Mikhail Karasikov, Joost van Doorn, Nicolas Känzig, Melis Erdal Cesur, Hugo Mark Horlings, Robert Berke, Fei Tang, and Sebastian Otálora. Training state-of-the-art pathology foundation models with orders of magnitude less data, April 2025. arXiv:2504.05186

  28. [28]

    Chen, Drew F

    Guillaume Jaume, Lukas Oldenburg, Anurag Vaidya, Richard J. Chen, Drew F. K. Williamson, Thomas Peeters, Andrew H. Song, and Faisal Mahmood. Transcriptomics-guided Slide Repre- sentation Learning in Computational Pathology. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv, May 2024. arXiv:2405.11618

  29. [29]

    Tran, Yiwei Xiao, Shengyu Li, Vrutant V

    Weiqing Chen, Pengzhi Zhang, Tu N. Tran, Yiwei Xiao, Shengyu Li, Vrutant V . Shah, Hao Cheng, Kristopher W. Brannan, Keith Youker, Li Lai, Longhou Fang, Yu Yang, Nhat-Tu Le, Jun-ichi Abe, Shu-Hsia Chen, Qin Ma, Ken Chen, Qianqian Song, John P. Cooke, and Guangyu Wang. A visual–omics foundation model to bridge histopathology with spatial transcriptomics.Na...

  30. [30]

    The Conditional Entropy Bottleneck.Entropy, 22(9):999, September 2020

    Ian Fischer. The Conditional Entropy Bottleneck.Entropy, 22(9):999, September 2020. ISSN 1099-4300. doi: 10.3390/e22090999

  31. [31]

    Disentanglement of Variations with Multimodal Generative Modeling, September 2025

    Yijie Zhang, Yiyang Shen, and Weiran Wang. Disentanglement of Variations with Multimodal Generative Modeling, September 2025. arXiv:2509.23548

  32. [32]

    IndiSeek learns information-guided disentangled representations, December 2025

    Yu Gui, Cong Ma, and Zongming Ma. IndiSeek learns information-guided disentangled representations, December 2025. arXiv:2509.21584

  33. [33]

    Learning Optimal Multimodal Infor- mation Bottleneck Representations

    Qilong Wu, Yiyang Shao, Jun Wang, and Xiaobo Sun. Learning Optimal Multimodal Infor- mation Bottleneck Representations. InProceedings of the 42nd International Conference on Machine Learning. arXiv, May 2025. arXiv:2505.19996

  34. [34]

    Xinyi Zhang, G. V . Shivashankar, and Caroline Uhler. Partially shared multi-modal embedding learns holistic representation of cell state.Nature Computational Science, 6(3):285–300, March

  35. [35]

    doi: 10.1038/s43588-025-00948-w

    ISSN 2662-8457. doi: 10.1038/s43588-025-00948-w

  36. [36]

    Theis, Srivatsan Raghavan, Pe- ter S

    Till Richter, Eric Zimmermann, James Hall, Fabian J. Theis, Srivatsan Raghavan, Pe- ter S. Winter, Ava P. Amini, and Lorin Crawford. Beyond alignment: Synergistic inte- gration is required for multimodal cell foundation models, March 2026. bioRxiv preprint, doi:10.64898/2026.02.23.707420. 12

  37. [37]

    Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations

    Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Rätsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations. InProceedings of the 36th International Conference on Machine Learning (ICML). arXiv, June 2019. arXiv:1811.12359

  38. [38]

    Causal Structure and Representation Learning with Biomedical Applications, November 2025

    Caroline Uhler and Jiaqi Zhang. Causal Structure and Representation Learning with Biomedical Applications, November 2025. arXiv:2511.04790. 13 Appendix A Additional Simulator Limitations The measurement model is linear (Appendix D.2), while real biological assays involve highly nonlinear transformations. This likely advantages linear models (CCA, JIVE) be...

  39. [39]

    Real assays involve nonlin- ear mixing, which the linear simulator does not capture

    Observations are linear mixtures of latents plus Gaussian noise. Real assays involve nonlin- ear mixing, which the linear simulator does not capture

  40. [40]

    Real biology has complex dependencies and imposed independence is an optimistic assumption forz s/bseparation

    Latents are independent Gaussians. Real biology has complex dependencies and imposed independence is an optimistic assumption forz s/bseparation

  41. [41]

    Real cohorts differ in scanner, protocol, and pre-processing and we do not model measurement-level shifts

    Measurement matrices A∗, B∗ are fixed across cohorts. Real cohorts differ in scanner, protocol, and pre-processing and we do not model measurement-level shifts

  42. [42]

    Confounding is shared across modalities via a single latent b. Modality-specific artifacts uncorrelated with the shared batch axis are not modeled as a separate latent, though single- modality proxy ( γh >0, γ r = 0 ) captures a related mechanism. Such artifacts do not generate shared-looking signal and are therefore not the hardest failure mode for S1 mi...

  43. [43]

    Real biomarkers may be sparse, with phenotype depending on a low-dimensional mechanism embedded in a higher-dimensional latent space

    Outcome weight vectors ¯w∗ are fixed across patients and dense (every latent dimension con- tributes). Real biomarkers may be sparse, with phenotype depending on a low-dimensional mechanism embedded in a higher-dimensional latent space

  44. [44]

    In practice, one modality may capture shared biology more reliably than the other

    Modalities are treated as symmetric (similar SNR and contamination). In practice, one modality may capture shared biology more reliably than the other

  45. [45]

    In real data, additional unmeasured factors may influence both the outcome and the observed features

    All outcome-relevant structure is encoded in (zs, zh, zr, b) with no unmodeled confounders. In real data, additional unmeasured factors may influence both the outcome and the observed features

  46. [46]

    In real data, different assays may sample different tissue regions or time points, introducing intratumor heterogeneity and sampling variation that the simulator does not model

    All modalities observe the same patient’s latent state with perfect spatial and temporal correspondence. In real data, different assays may sample different tissue regions or time points, introducing intratumor heterogeneity and sampling variation that the simulator does not model. E Model Architectures and Training We evaluate four model classes spanning...

  47. [47]

    This step is identical to CCA, so JIVE’s joint component matches CCA’s shared representation exactly

    Joint subspace:compute the SVD of the cross-covariance C=X ⊤ h Xr/(n−1) with the top K left and right singular vectors Wh, Wr defining the joint projection directions. This step is identical to CCA, so JIVE’s joint component matches CCA’s shared representation exactly

  48. [48]

    For evaluation cohorts, the encoder returns the concatenation [Z(h) joint |Z (h) indiv]∈R n×2K

    Individual subspaces:project out the joint subspace from each modality ( Xres h =X h − XhWhW ⊤ h , analogously for r), then apply PCA to each residual to obtain K individual components per modality. For evaluation cohorts, the encoder returns the concatenation [Z(h) joint |Z (h) indiv]∈R n×2K. The first K dimensions are joint (shared) and the remaining K ...

  49. [49]

    Probe direction ˆw. From the logistic regression coefficients WLR of the selected probe and per-dimension standard deviationsˆσestimated on A′: ˆw= WLR/ˆσ ∥WLR/ˆσ∥2 .(23) Selection follows Stage III: shared-dominant leads to using probe on zc; modality-dominant leads to using probe on zms; indeterminate leads to using probe on zc (conservative fallback); ...

  50. [50]

    Each patient receives a scalar representing their posi- tion along the outcome-predictive direction

    Scores.Project representations onto ˆwwithout per-cohort standardization: s(X) i = ⟨z(i,s) sig ,ˆw⟩for X∈ {A ′, B, C}. Each patient receives a scalar representing their posi- tion along the outcome-predictive direction. 3.A ′-referenced quantiles. q(X) i =F A′(s(X) i ) = |{j∈A ′ :s (A′) j ≤s (X) i }| |A′| , X∈ {B, C}.(24) A′ provides a fixed reference sca...

  51. [51]

    Dtask(s) quantile =W 1(QB, QC), where QX ={q (X) i }i∈X

    Wasserstein-1 distance. Dtask(s) quantile =W 1(QB, QC), where QX ={q (X) i }i∈X. A large W1 indicates that the cohort shift moves patients differentially along the probe direction, suggesting composition-dependent ordering. Null calibration.A one-sided permutation null is estimated by shuffling B/C cohort labels (pre- serving sizes); the probe direction ˆ...

  52. [52]

    Scenario 2 (checked first).If CIlower(P (s) transfer)>null upper and Dtask(s) quantile ≥null upper, predic- tive signal transfers functionally but biological ordering is unstable across cohorts.Assign S2. 2.Scenario 1.If not S2,andA norm >null upper,andsignal-gatep <0.05,and(for factorized models) shared-dominant localization,and CIlower(P (s) transfer)>n...

  53. [53]

    4.Scenario 3.If signal-gatep≥0.05and Scenario 2 not assigned:Assign S3

    Scenario 4.If not S2 or S1,andsignal-gate p <0.05 ,and(for factorized models) modality- dominant localization,and CIlower(P (s) transfer)>null upper,and Dtask(s) quantile <null upper:Assign S4. 4.Scenario 3.If signal-gatep≥0.05and Scenario 2 not assigned:Assign S3. 5.Indeterminate.Any case not matching the above:Assign∅. G.5 Cross-Modality Scenario Intera...

  54. [54]

    unimodal

    Outcome-scenario evaluation(linear probes on Cohorts A ′, B, C across multiple outcome configurations). This separation enables evaluating many biological scenarios on a single frozen representation without retraining, improving computational efficiency. H.2 Representation-Generating Parameter Space Each synthetic run is defined by a joint configuration o...