When Are Multimodal Predictions Biologically Supported? A Diagnostic Evaluation Framework

Dylan Steiner; Etai Jacob; Gerald Sun; Gustavo Arango-Argoty

arxiv: 2605.31504 · v1 · pith:5OALTXXWnew · submitted 2026-05-29 · 💻 cs.LG · stat.ML

When Are Multimodal Predictions Biologically Supported? A Diagnostic Evaluation Framework

Dylan Steiner , Gustavo Arango-Argoty , Gerald Sun , Etai Jacob This is my paper

Pith reviewed 2026-06-28 22:48 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords multimodal learningconfounding detectionbiological interpretabilityfoundation modelsoncologydiagnostic evaluationrepresentation analysis

0 comments

The pith

DECAT shows entangled multimodal models falsely claim shared biology in most cases where it is absent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DECAT, a model-agnostic post-hoc framework that classifies multimodal representations into four scenarios—shared biology across modalities, biology limited to one modality, spurious correlations driven by confounders, or indeterminate—by applying five null-referenced metrics and a rule-based decision procedure. The framework requires no knowledge of which confounder is present and operates directly on learned embeddings. Validation on more than 2,500 synthetic representations across four model classes and on embeddings from 8,979 TCGA patients demonstrates that entangled models such as CLIP achieve near-perfect detection of shared biology yet produce false claims of shared biology in the majority of absent cases, with the false-positive rate rising as confound strength increases. The same pattern appears when the framework is applied to five pretrained pathology foundation models on real patient data without paired RNA, where confounding remains invisible to standard AUROC evaluation.

Core claim

DECAT classifies multimodal representations into four diagnostic scenarios using five null-referenced metrics and a rule-based procedure; on both synthetic data and real TCGA embeddings, entangled models achieve near-perfect shared-biology detection while falsely claiming shared biology in the majority of cases where it is absent, with the false-claim rate increasing with confound strength so that larger cohorts and stronger representations yield more confident but incorrect diagnoses.

What carries the argument

The DECAT framework, a set of five null-referenced metrics plus a rule-based decision procedure that assigns each representation to one of four diagnostic scenarios without requiring confounder labels.

If this is right

Standard AUROC evaluation cannot distinguish genuine shared biology from confounding in multimodal oncology models.
Entangled training objectives increase the rate of false shared-biology claims as dataset size and representation strength grow.
The framework can be applied to existing foundation models without paired modalities to surface confounding that performance metrics miss.
Models labeled indeterminate by DECAT should not be interpreted as biologically supported for the given task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers of multimodal foundation models could run DECAT as a routine post-training check before deploying predictions as biologically grounded.
The same metric set might be adapted to other multimodal domains such as imaging-genomics pairs outside oncology.
If the rule-based decision thresholds prove stable across cohorts, DECAT could serve as a lightweight filter for selecting representations for downstream biological interpretation.

Load-bearing premise

The five null-referenced metrics and rule-based procedure can reliably separate the four diagnostic scenarios even when the confounder is unknown and the representations come from real patient data with complex confounding.

What would settle it

A dataset in which the true presence or absence of shared biology and the identity of the confounder are known in advance, yet DECAT assigns the wrong diagnostic label to a majority of representations.

Figures

Figures reproduced from arXiv: 2605.31504 by Dylan Steiner, Etai Jacob, Gerald Sun, Gustavo Arango-Argoty.

**Figure 1.** Figure 1: DECAT framework overview. DECAT takes per-modality embeddings from any multimodal model and uses a four-stage decision tree to classify the predictive behavior of each modality, for a given task, into one of four scenarios or indeterminate ( [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: DECAT detection rate versus predictive task signal per model on synthetic ground truth. Each column varies the task coefficient for one signal source while all others are zero, producing a pure scenario (S3 has all coefficients at zero). Top row: strict accuracy (correct scenario assigned). Bottom row: conservative accuracy (correct or indeterminate). Where conservative accuracy substantially exceeds stric… view at source ↗

**Figure 3.** Figure 3: False shared claim rate (FSCR) on synthetic ground truth. FSCR is the probability that DECAT assigns Scenario 1 (shared biology) when the true scenario is not S1, pooled across all non-S1 scenarios (S2B, S3, S4h, S4r). (a) FSCR versus predictive task signal at Neval = 1000, with α matched by magnitude across non-S1 scenarios. FSCR rises with signal strength for entangled models. (b) FSCR versus evaluation … view at source ↗

**Figure 4.** Figure 4: DECAT detects TCGA cancer-type confounding invisible to AUROC (CLIP, H&E representation). The α-mixture sweep varies the fraction of Cohort C drawn from a label-extreme pool while A′ /B remain random pan-cancer splits; NC ≈300 fixed. Dotted line: mean natural cohort composition across labels (≈0.20; individually TMB ≈0.25, TP53 ≈0.10, Age ≈0.25). TMB and Age are binarized at the pan-cancer median; TP53 is … view at source ↗

**Figure 5.** Figure 5: DECAT’s S2 flag rate predicts within-type AUROC collapse across driver genes (unimodal H&E, Stages II and IV only). (a) Within-type AUROC collapse ∆ = AUROCpan − AUROCwithin (median across cancer types) per FM (colored dots) for 16 driver genes sorted by η 2 (fraction of mutation prevalence variance explained by cancer type). (b) S2 flag rate versus α for KRAS (50 splits). Extreme pool E defined per FM fro… view at source ↗

**Figure 6.** Figure 6: DECAT decision procedure. Four stages are applied sequentially per task and per modality. Stage I checks structural geometry (task-independent). Stage II gates on signal presence. Stage III localizes signal to shared or modality-specific components (factorized models only). Stage IV evaluates cross-cohort stability via Ptransfer and Dtask quantile, with Scenario 2 checked first. Terminal nodes are the four… view at source ↗

**Figure 7.** Figure 7: Representation geometry under independent β sweeps. Each curve varies one β coefficient while holding others at zero; marker size increases with β value. Null thresholds (dashed lines) are the most conservative boundaries across all β conditions (max of per-condition 97.5th percentiles from 200 permutations). Only shared signal (βs) drives the representation strongly into the aligned-structure region. Moda… view at source ↗

**Figure 8.** Figure 8: Per-β sensitivity of Anorm and Bnorm. Each panel sweeps one β coefficient independently while holding all others at zero. Dashed lines indicate the most conservative null boundary across all β conditions (max/min of per-condition percentiles from 200 permutations). (a) βs produces strong Anorm response and negative Bnorm. (b/c) βh and βr do not inflate Anorm. The Bnorm curve for βh is noisier than for βr b… view at source ↗

**Figure 9.** Figure 9: validates ∆shared by sweeping the outcome mixing parameter αs from 0 (outcome driven entirely by modality-specific signal) to 1 (outcome driven entirely by shared signal), with αr = 1−αs. Cross-validated linear probes are trained on the ground-truth latents (zs and zr) from Cohort A′ , in a clean two-signal setting (βs = βr = 1, βh = βb = 0). As αs increases, ∆shared transitions smoothly from negative (mod… view at source ↗

**Figure 10.** Figure 10: Dtask quantile and Ptransfer validation on ground-truth latents. S2B trajectories through the metric space under three parameter sweeps. Quadrant boundaries are the most conservative permutation null thresholds (max across conditions, 200 permutations per condition). S1 (olive) and S4h (brown) remain in the stable-transfer region. S2B (purple) enters the unstable-transfer region as cohort-shift parameters… view at source ↗

**Figure 11.** Figure 11: a shows that S1 and S2B produce overlapping but distinguishable distributions, with S2B exhibiting a heavier right tail past the null threshold. Figure 11b confirms that the S1 false-positive rate remains near the expected 5% across all evaluation sample sizes while S2B detection rises with Neval, demonstrating that the permutation null is well-calibrated. Figure 11c shows per-model fire rates: S1 false p… view at source ↗

**Figure 12.** Figure 12: Dtask quantile: S1 vs. proxy S2 on learned representations. Same format as [PITH_FULL_IMAGE:figures/full_fig_p035_12.png] view at source ↗

**Figure 13.** Figure 13: shows Ptransfer detection sensitivity for proxy S2. At Neval = 1000, Ptransfer fires for 61% of proxy S2 runs pooled across all configurations, compared to ≈5% for S3 (Figure 13a). Detectability varies substantially by proxy configuration (Figure 13b): strong aligned proxy (γh = 1.0, η = 0) reaches 87–88%, while weak misaligned proxy (γh = 0.3, η = 0.3) reaches only 37%. Runs where Ptransfer does not fire… view at source ↗

**Figure 14.** Figure 14: shows A∗ norm saturation curves from Pre Step A across all four measurement regimes and latent dimensionalities (k ∈ {5, 10, 50, 100}). Most models saturate by Ntrain = 30k (black dashed line). The modality-dominant regime (βh = βr = 2.0) shows the slowest saturation due to weaker shared signal relative to modality-specific signal, and some models have not fully plateaued at 30k in this regime. We select … view at source ↗

**Figure 15.** Figure 15: Detection rate versus predictive task signal, shared-dominant regime (βs = 2.0, βb = 0.75). Same panel layout as [PITH_FULL_IMAGE:figures/full_fig_p037_15.png] view at source ↗

**Figure 16.** Figure 16: Detection rate versus predictive task signal, batch-dominant regime (βb = 1.5). Same panel layout as [PITH_FULL_IMAGE:figures/full_fig_p038_16.png] view at source ↗

**Figure 17.** Figure 17: Detection rate versus predictive task signal, modality-dominant regime (βh = βr = 2.0, βb = 0.75). Same panel layout as [PITH_FULL_IMAGE:figures/full_fig_p039_17.png] view at source ↗

**Figure 18.** Figure 18: Detection rate versus evaluation sample size per model. Same panel layout as [PITH_FULL_IMAGE:figures/full_fig_p040_18.png] view at source ↗

**Figure 19.** Figure 19: False shared claim rate (FSCR), shared-dominant regime (βs = 2.0, βb = 0.75). Same panel layout as [PITH_FULL_IMAGE:figures/full_fig_p041_19.png] view at source ↗

**Figure 20.** Figure 20: False shared claim rate (FSCR), batch-dominant regime (βb = 1.5). Same panel layout as [PITH_FULL_IMAGE:figures/full_fig_p041_20.png] view at source ↗

**Figure 21.** Figure 21: False shared claim rate (FSCR), modality-dominant regime (βh = βr = 2.0, βb = 0.75). Same panel layout as [PITH_FULL_IMAGE:figures/full_fig_p041_21.png] view at source ↗

**Figure 22.** Figure 22: evaluates DECAT on representations learned from proxy-entangled data. S1 detection is largely preserved relative to clean data (Figure 22a), with most models showing drops of less than 5%. Conservative S1 accuracy remains above 90% for all models (Figure 22d). Proxy S2 strict detection remains low at 5–20% (Figure 22b), consistent with the geometric challenge of detecting instability along proxy-contamina… view at source ↗

**Figure 23.** Figure 23: S1 detection on proxy-contaminated representations, stratified by proxy condition. Columns: proxy conditions varying in strength (γ) and alignment (η). Same format as [PITH_FULL_IMAGE:figures/full_fig_p043_23.png] view at source ↗

**Figure 24.** Figure 24: Proxy S2 detection stratified by proxy condition. Columns: proxy conditions. Detection rates remain low (5–35%) across all conditions, with stronger proxy (γ = 1.0) producing the highest detection rates. J.7 Cross-Modality Resolution [PITH_FULL_IMAGE:figures/full_fig_p043_24.png] view at source ↗

**Figure 25.** Figure 25: Cross-modality resolution accuracy versus Neval. Probability that both modalities are correctly classified simultaneously. (a) S4h (H&E=S4, RNA=S3). (b) S4r (H&E=S3, RNA=S4). Factorized models achieve meaningful resolution; entangled models cannot resolve different scenarios across modalities. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_25.png] view at source ↗

**Figure 26.** Figure 26: b/f), H&E is correctly classified as S4 by factorized models, but the RNA-specific signal is too weak for reliable localization, producing mostly S3 or indeterminate classifications for RNA. Mixed C (αs + αh with proxy γh > 0, η = 0.3, ground truth: RNA=S1, H&E=S2; Figure 26c/g) tests whether DECAT can distinguish shared biology from proxy-driven signal across modalities. Mixed D (αs + αb, ground truth: b… view at source ↗

**Figure 27.** Figure 27: Indeterminate sensitivity on transition configurations. (a) Transition A (biological ambiguity, no proxy or confounding). (b) Transition B (confounded ambiguity, with proxy and confounding overlaid). Factorized models increasingly return indeterminate with more data (correct behavior on non-identifiable tasks) while entangled models become more overconfident. K Further TCGA Experimental Results K.1 TCGA D… view at source ↗

**Figure 28.** Figure 28: TCGA detection rates by variate index. Top row: strict accuracy (correct scenario assigned). Bottom row: conservative accuracy (correct or indeterminate). All continuous labels are binarized at the pan-cancer median. 95% Clopper–Pearson confidence intervals. (a/e) S1: CCA canonical variates (k=0–49), pooled random splits. Variate 0 carries the strongest shared signal (maximal cross-modal correlation) and … view at source ↗

**Figure 29.** Figure 29: False shared claim rate (FSCR) on TCGA pooled random splits. (a) FSCR by variate index for S4r tasks (RNA-modality residual PCs). (b) FSCR decomposed by ground-truth scenario (S2B, S3, S4r). Entangled models (CCA, CLIP) systematically misclassify non-shared signal as shared biology. Factorized models (JIVE, DSSL at β≥1.0) eliminate false S4r claims. K.3 Power Curve for S2 Detection [PITH_FULL_IMAGE:figur… view at source ↗

**Figure 30.** Figure 30: Power curve for S2 detection (C drawn entirely from extreme pool E, α=1). S2 detection rate versus evaluation cohort size N under two designs: (a) vary-C (A′ /B held at full size, only NC varied) and (b) equal-N (all cohorts varied together, A′=B=C=N). The gap between panels shows the value of large reference cohorts for probe quality. For example, under vary-C, TMB reaches 80% detection at NC ≈50, wherea… view at source ↗

read the original abstract

Multimodal models in oncology can produce accurate predictions, but accurate prediction does not reveal whether the model has learned biology that is shared across modalities, biology confined to one modality, or spurious correlations that reflect confounders rather than genuine biology. We introduce DECAT, a model-agnostic post-hoc evaluation framework that classifies multimodal representations into four diagnostic scenarios for a given task and modality, using five null-referenced metrics and a rule-based decision procedure. The framework operates on learned representations, requires no knowledge of which specific confounder is present, and returns indeterminate when the evidence is insufficient. We validate DECAT on synthetic data across four multimodal model classes (over 2,500 trained representations) and on real data from 8,979 TCGA patients, evaluating both multimodal embeddings and five pretrained pathology foundation models. Entangled models (e.g., CLIP) achieve near-perfect shared biology detection but falsely claim shared biology in the majority of cases where it is absent on real foundation model embeddings. This false claim rate increases with confound strength so that larger cohorts and stronger representations produce more confident but still incorrect diagnoses. Applied to both multimodal TCGA embeddings and five pathology foundation models without paired RNA, DECAT detects confounding invisible to AUROC without requiring the confounder labels, as confirmed by post-hoc stratification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DECAT introduces a post-hoc rule-based diagnostic using five null-referenced metrics to flag confounding in multimodal oncology models, with solid synthetic tests but indirect evidence on real TCGA embeddings.

read the letter

The main point is that this paper gives a practical framework called DECAT to classify multimodal representations into shared biology, modality-specific, spurious, or indeterminate cases without needing confounder labels.

It is new in combining those five metrics with a rule-based procedure and a four-scenario taxonomy aimed at oncology multimodal work. The validation covers more than 2500 synthetic representations across four model classes plus application to embeddings from 8979 TCGA patients and five pathology foundation models.

The paper shows clearly that entangled models like CLIP pick up shared signals reliably in controlled synthetics but overclaim them on real foundation model embeddings, with the false claim rate rising as confound strength increases. The post-hoc stratification to reveal issues invisible to AUROC is a useful demonstration.

The soft spot is the real-data step. Synthetic cases have known structure, but the claim that the metrics and rules still separate scenarios correctly under complex unknown clinical confounding rests on an assumption that is not directly tested with ground-truth labels. The decision thresholds are free parameters that could shift results.

This is for researchers auditing multimodal models in pathology and oncology. It deserves a serious referee because the problem is concrete and the method is explicit, even though the real-data validation will need closer scrutiny on robustness.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces DECAT, a model-agnostic post-hoc framework that classifies multimodal representations into four diagnostic scenarios (shared biology, modality-specific, spurious, indeterminate) using five null-referenced metrics and a rule-based decision procedure. It reports validation across >2500 synthetic representations from four multimodal model classes and application to embeddings from 8979 TCGA patients plus five pathology foundation models, with the central empirical claim that entangled models (e.g., CLIP) achieve near-perfect shared-biology detection on synthetic data but exhibit high false-positive rates on real embeddings, with the false-claim rate increasing with confound strength.

Significance. If the five null-referenced metrics and rule-based procedure can be shown to reliably separate the four scenarios on real patient embeddings whose confounding structure is unknown and more complex than the controlled synthetic cases, the framework would provide a useful post-hoc diagnostic for distinguishing biologically supported multimodal predictions from spurious ones in oncology, beyond standard metrics such as AUROC.

major comments (3)

[Abstract] Abstract and methods (as referenced in the reader's note): the central claim that DECAT reliably maps representations to the four scenarios on real TCGA embeddings rests on the assumption that the null-referenced metrics plus rule-based procedure generalize from synthetic data (known confounder structure) to real data (unknown, multi-variable clinical confounding). No direct accuracy measurement against ground-truth scenario labels is provided when the confounder is withheld, leaving the reported false-claim rates for entangled models on real foundation-model embeddings without independent confirmation.
[Abstract] Abstract: the statement that 'this false claim rate increases with confound strength' on real data requires a concrete operationalization of confound strength that does not rely on the same metrics used for classification; without it, the reported increase could be circular with the decision procedure itself.
[Abstract] Abstract: the framework is described as returning 'indeterminate when the evidence is insufficient,' yet the abstract supplies no quantitative thresholds or decision rules for the five metrics, making it impossible to assess whether the procedure is parameter-free or whether post-hoc choices affect the reported false-claim rates.

minor comments (1)

[Abstract] The abstract would benefit from a one-sentence definition or example of each of the four diagnostic scenarios to orient readers before the empirical claims.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for these constructive comments on the manuscript. We respond to each major comment below, indicating where revisions will be incorporated.

read point-by-point responses

Referee: [Abstract] Abstract and methods (as referenced in the reader's note): the central claim that DECAT reliably maps representations to the four scenarios on real TCGA embeddings rests on the assumption that the null-referenced metrics plus rule-based procedure generalize from synthetic data (known confounder structure) to real data (unknown, multi-variable clinical confounding). No direct accuracy measurement against ground-truth scenario labels is provided when the confounder is withheld, leaving the reported false-claim rates for entangled models on real foundation-model embeddings without independent confirmation.

Authors: We agree that ground-truth scenario labels cannot be obtained for real TCGA embeddings, as the true multi-variable confounding structure is unknown by design. The synthetic experiments (with held-out confounder structure) serve to validate that the five metrics and rule-based procedure recover the correct scenario when the generative process is known. On real data the framework is applied diagnostically, with post-hoc stratification by clinical variables providing corroborating evidence. We will add an explicit limitations paragraph in the Discussion clarifying this point and the reliance on synthetic validation for procedural soundness. revision: yes
Referee: [Abstract] Abstract: the statement that 'this false claim rate increases with confound strength' on real data requires a concrete operationalization of confound strength that does not rely on the same metrics used for classification; without it, the reported increase could be circular with the decision procedure itself.

Authors: Confound strength on real data is operationalized via two external proxies that are independent of the five DECAT metrics: (1) cohort size (larger TCGA subsets) and (2) representation strength (model scale and pre-training data volume of the five pathology foundation models). The abstract already alludes to this via the clause on larger cohorts and stronger representations. We will add a dedicated paragraph in Methods defining these proxies and include a supplementary table showing the monotonic relationship between these proxies and the observed false-claim rate. revision: yes
Referee: [Abstract] Abstract: the framework is described as returning 'indeterminate when the evidence is insufficient,' yet the abstract supplies no quantitative thresholds or decision rules for the five metrics, making it impossible to assess whether the procedure is parameter-free or whether post-hoc choices affect the reported false-claim rates.

Authors: The abstract is a high-level summary; the quantitative thresholds, null-referenced metric definitions, and the complete rule-based decision tree (including the indeterminate condition) are fully specified in the Methods section. No post-hoc parameter tuning was performed; the rules were fixed prior to the real-data experiments. We will add a sentence in the abstract directing readers to the Methods for the decision procedure if space allows. revision: partial

standing simulated objections not resolved

Direct ground-truth scenario labels for real TCGA embeddings cannot be supplied because the true confounding structure is unknown and multi-variable; this is an inherent limitation of any diagnostic applied to observational clinical data.

Circularity Check

0 steps flagged

No significant circularity detected in DECAT framework or claims

full rationale

The paper introduces DECAT as a new model-agnostic post-hoc framework that applies five null-referenced metrics and a rule-based decision procedure to classify representations into four diagnostic scenarios. The abstract and provided text describe validation on over 2,500 synthetic representations with controlled confounder structure plus application to real TCGA embeddings from 8,979 patients and five foundation models. No equations, decision rules, or claims are shown to reduce by construction to fitted inputs on the same data, self-definitions, or load-bearing self-citations. The central results (near-perfect detection on entangled models, false claims on real embeddings, detection of confounding invisible to AUROC) are presented as empirical outcomes of the independent framework rather than tautological renamings or forced predictions. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that null models and the chosen metrics can isolate shared-biology signals without explicit confounder labels; the decision thresholds and metric definitions are not detailed in the abstract and may function as implicit free parameters.

free parameters (1)

decision thresholds
Rule-based classification requires thresholds on the five metrics whose values are not specified in the abstract and may have been selected or tuned during development.

axioms (1)

domain assumption Null-referenced metrics can separate shared biology, modality-specific biology, and confounding without knowledge of the specific confounder identity
The abstract states the framework 'requires no knowledge of which specific confounder is present' and returns 'indeterminate when the evidence is insufficient'.

pith-pipeline@v0.9.1-grok · 5767 in / 1510 out tokens · 26413 ms · 2026-06-28T22:48:24.118745+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 30 canonical work pages · 1 internal anchor

[1]

Chen, Ming Y

Richard J. Chen, Ming Y . Lu, Drew F. K. Williamson, Tiffany Y . Chen, Jana Lipkova, Zahra Noor, Muhammad Shaban, Maha Shady, Mane Williams, Bumjin Joo, and Faisal Mahmood. Pan-cancer integrative histology-genomic analysis via multimodal deep learning.Cancer Cell, 40(8):865–878.e6, August 2022. ISSN 1535-6108. doi: 10.1016/j.ccell.2022.07.004

work page doi:10.1016/j.ccell.2022.07.004 2022
[2]

Song, Tong Ding, Sophia J

Anurag Vaidya, Andrew Zhang, Guillaume Jaume, Andrew H. Song, Tong Ding, Sophia J. Wagner, Ming Y . Lu, Paul Doucet, Harry Robertson, Cristina Almagro-Perez, Richard J. Chen, Dina ElHarouni, Georges Ayoub, Connor Bossi, Keith L. Ligon, Georg Gerber, Long Phi Le, and Faisal Mahmood. Molecular-driven Foundation Model for Oncologic Pathology, January
[3]

EXAONE Path 2.5: Pathology Foundation Model with Multi-Omics Alignment, December 2025

Juseung Yun, Sunwoo Yu, Sumin Ha, Jonghyun Kim, Janghyeon Lee, Jongseong Jang, and Soonyoung Lee. EXAONE Path 2.5: Pathology Foundation Model with Multi-Omics Alignment, December 2025. arXiv:2512.14019

work page arXiv 2025
[4]

A multimodal knowledge-enhanced whole-slide pathology foundation model.Nature Communications, 16(1): 11406, December 2025

Yingxue Xu, Yihui Wang, Fengtao Zhou, Jiabo Ma, Cheng Jin, Shu Yang, Jinbang Li, Zhengyu Zhang, Chenglong Zhao, Huajun Zhou, Zhenhui Li, Huangjing Lin, Xin Wang, Jiguang Wang, Anjia Han, Ronald Cheong Kin Chan, Li Liang, Xiuming Zhang, and Hao Chen. A multimodal knowledge-enhanced whole-slide pathology foundation model.Nature Communications, 16(1): 11406,...

work page doi:10.1038/s41467-025-66220-x 2025
[5]

Leveraging multi-modal foundation models for analysing spatial multi-omic and histopathology data

Tianyu Liu, Tinglin Huang, Tong Ding, Hao Wu, Peter Humphrey, Sudhir Perincheri, Kurt Schalper, Rex Ying, Hua Xu, James Zou, Faisal Mahmood, and Hongyu Zhao. Leveraging multi-modal foundation models for analysing spatial multi-omic and histopathology data. Nature Biomedical Engineering, pages 1–18, February 2026. ISSN 2157-846X. doi: 10.1038/ s41551-025-01602-6

2026
[6]

Howard, James Dolezal, Sara Kochanny, Jefree Schulte, Heather Chen, Lara Heij, Dezheng Huo, Rita Nanda, Olufunmilayo I

Frederick M. Howard, James Dolezal, Sara Kochanny, Jefree Schulte, Heather Chen, Lara Heij, Dezheng Huo, Rita Nanda, Olufunmilayo I. Olopade, Jakob N. Kather, Nicole Cipriani, Robert L. Grossman, and Alexander T. Pearson. The impact of site-specific digital histology signatures on deep learning model accuracy and bias.Nature Communications, 12(1):4423, Ju...

work page doi:10.1038/s41467-021-24698-1 2021
[7]

Do Histopathologi- cal Foundation Models Eliminate Batch Effects? A Comparative Study

Jonah Kömen, Hannah Marienwald, Jonas Dippel, and Julius Hense. Do Histopathologi- cal Foundation Models Eliminate Batch Effects? A Comparative Study. InAIM-FM Work- shop, Advances in Neural Information Processing Systems (NeurIPS). arXiv, November 2024. arXiv:2411.05489

work page arXiv 2024
[8]

de Jong, Eric Marcus, and Jonas Teuwen

Edwin D. de Jong, Eric Marcus, and Jonas Teuwen. Current Pathology Foundation Models are unrobust to Medical Center Differences, February 2025. arXiv:2501.18055

work page arXiv 2025
[9]

de Jong, Julius Hense, Hannah Marienwald, Jonas Dippel, Philip Naumann, Eric Marcus, Lukas Ruff, Maximilian Alber, Jonas Teuwen, Frederick Klauschen, and Klaus-Robert Müller

Jonah Kömen, Edwin D. de Jong, Julius Hense, Hannah Marienwald, Jonas Dippel, Philip Naumann, Eric Marcus, Lukas Ruff, Maximilian Alber, Jonas Teuwen, Frederick Klauschen, and Klaus-Robert Müller. Towards Robust Foundation Models for Digital Pathology, July 2025. arXiv:2507.17845

work page arXiv 2025
[10]

Gustafsson, Kajsa Ledesma Eriksson, and Mattias Rantalainen

Erik Thiringer, Fredrik K. Gustafsson, Kajsa Ledesma Eriksson, and Mattias Rantalainen. Scanner-Induced Domain Shifts Undermine the Robustness of Pathology Foundation Models, January 2026. arXiv:2601.04163. 10

work page arXiv 2026
[11]

Confounding factors and biases abound when predicting molecular biomarkers from histological images.Nature Biomedical Engineering, pages 1–15, March 2026

Muhammad Dawood, Kim Branson, Sabine Tejpar, Nasir Rajpoot, and Fayyaz ul Amir Afsar Minhas. Confounding factors and biases abound when predicting molecular biomarkers from histological images.Nature Biomedical Engineering, pages 1–15, March 2026. ISSN 2157- 846X. doi: 10.1038/s41551-026-01616-8

work page doi:10.1038/s41551-026-01616-8 2026
[12]

Theis, and Bo Wang

Haotian Cui, Alejandro Tejada-Lapuerta, Maria Brbi´c, Julio Saez-Rodriguez, Simona Cristea, Hani Goodarzi, Mohammad Lotfollahi, Fabian J. Theis, and Bo Wang. Towards multimodal foundation models in molecular cell biology.Nature, 640(8059):623–633, April 2025. ISSN 1476-4687. doi: 10.1038/s41586-025-08710-y

work page doi:10.1038/s41586-025-08710-y 2025
[13]

Vanguri, Jia Luo, Andrew T

Rami S. Vanguri, Jia Luo, Andrew T. Aukerman, Jacklynn V . Egger, Christopher J. Fong, Natally Horvat, Andrew Pagano, Jose de Arimateia Batista Araujo-Filho, Luke Geneslaw, Hira Rizvi, Ramon Sosa, Kevin M. Boehm, Soo-Ryum Yang, Francis M. Bodd, Katia Ventura, Travis J. Hollmann, Michelle S. Ginsberg, Jianjiong Gao, Rami Vanguri, Matthew D. Hellmann, Jenni...

2022
[14]

A multimodal AI biomarker PATH-ORACLE improves prediction of recurrence in stage I lung adenocarcinoma, February 2026

Oz Kilim, Orsolya Pipek, Zsofia Sztupinszki, Miklos Diossy, Aurel Prosz, Cristina Naceur- Lombardelli, Selvaraju Veeriah, David Moore, Mariam Jamal-Hanjani, Allan Hackshaw, Janos Fillinger, Judit Moldvay, Istvan Csabai, Charles Swanton, and Zoltan Szallasi. A multimodal AI biomarker PATH-ORACLE improves prediction of recurrence in stage I lung adenocarcin...

work page doi:10.64898/2026.01.28.26344973 2026
[15]

Howard, Jakob Nikolas Kather, and Alexander T

Frederick M. Howard, Jakob Nikolas Kather, and Alexander T. Pearson. Multimodal deep learning: An improvement in prognostication or a reflection of batch effect?Cancer Cell, 41 (1):5–6, January 2023. ISSN 1535-6108, 1878-3686. doi: 10.1016/j.ccell.2022.10.025

work page doi:10.1016/j.ccell.2022.10.025 2023
[16]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. InPro- ceedings of the 38th International Conference on Machine Learning (ICML). arXiv, February
[17]

An Information Criterion for Controlled Disentanglement of Multimodal Data

Chenyu Wang, Sharut Gupta, Xinyi Zhang, Sana Tonekaboni, Stefanie Jegelka, Tommi Jaakkola, and Caroline Uhler. An Information Criterion for Controlled Disentanglement of Multimodal Data. InProceedings of the International Conference on Learning Representations. arXiv, March 2025. arXiv:2410.23996

work page arXiv 2025
[18]

Factorized Contrastive Learning: Going Beyond Multi-view Redundancy

Paul Pu Liang, Zihao Deng, Martin Ma, James Zou, Louis-Philippe Morency, and Ruslan Salakhutdinov. Factorized Contrastive Learning: Going Beyond Multi-view Redundancy. InAdvances in Neural Information Processing Systems (NeurIPS). arXiv, October 2023. arXiv:2306.05268

work page arXiv 2023
[19]

Canonical Correlation Analysis (CCA) Based Multi-View Learning: An Overview, May 2021

Chenfeng Guo and Dongrui Wu. Canonical Correlation Analysis (CCA) Based Multi-View Learning: An Overview, May 2021. arXiv:1907.01693

work page arXiv 2021
[20]

Lock, Katherine A

Eric F. Lock, Katherine A. Hoadley, J. S. Marron, and Andrew B. Nobel. Joint and individual variation explained (JIVE) for integrated analysis of multiple data types.The Annals of Applied Statistics, 7(1), March 2013. ISSN 1932-6157. doi: 10.1214/12-AOAS597

work page doi:10.1214/12-aoas597 2013
[21]

Wagner, Andrew H

Tong Ding, Sophia J. Wagner, Andrew H. Song, Richard J. Chen, Ming Y . Lu, Andrew Zhang, Anurag J. Vaidya, Guillaume Jaume, Muhammad Shaban, Ahrong Kim, Drew F. K. Williamson, Harry Robertson, Bowen Chen, Cristina Almagro-Pérez, Paul Doucet, Sharifa Sahai, Chengkuan Chen, Christina S. Chen, Daisuke Komura, Akihiro Kawabe, Mieko Ochi, Shinya Sato, Tomoyuki...

work page doi:10.1038/s41591-025-03982-3 2025
[22]

Sun, Arijit Patra, Ioannis Kagiampakis, and Etai Jacob

Gustavo Arango-Argoty, Elly Kipkogei, Ross Stewart, Gerald J. Sun, Arijit Patra, Ioannis Kagiampakis, and Etai Jacob. Pretrained transformers applied to clinical studies improve predictions of treatment efficacy and associated biomarkers.Nature Communications, 16(1): 2101, March 2025. ISSN 2041-1723. doi: 10.1038/s41467-025-57181-2

work page doi:10.1038/s41467-025-57181-2 2025
[23]

Lu, Bowen Chen, Drew F

Ming Y . Lu, Bowen Chen, Drew F. K. Williamson, Richard J. Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Long Phi Le, Georg Gerber, Anil V . Parwani, Andrew Zhang, and Faisal Mahmood. A visual-language foundation model for computational pathology.Nature Medicine, 30(3):863–874, March 2024. ISSN 1546-170X. doi: 10.1038/s41591-024-02856-4

work page doi:10.1038/s41591-024-02856-4 2024
[24]

Chen, Tong Ding, Ming Y

Richard J. Chen, Tong Ding, Ming Y . Lu, Drew F. K. Williamson, Guillaume Jaume, Andrew H. Song, Bowen Chen, Andrew Zhang, Daniel Shao, Muhammad Shaban, Mane Williams, Lukas Oldenburg, Luca L. Weishaupt, Judy J. Wang, Anurag Vaidya, Long Phi Le, Georg Gerber, Sharifa Sahai, Walt Williams, and Faisal Mahmood. Towards a general-purpose foundation model for ...

work page doi:10.1038/s41591-024-02857-3 2024
[25]

H-optimus-0, 2024

Charlie Saillard, Rodolphe Jenatton, Felipe Llinares-López, Zelda Mariet, David Cahané, Eric Durand, and Jean-Philippe Vert. H-optimus-0, 2024. https://github.com/bioptimus/ releases/tree/main/models/h-optimus/v0

2024
[26]

Daniel Kaplan, Ratna Sagari Grandhi, Connor Lane, Benjamin Warner, Tanishq Mathew Abraham, and Paul S. Scotti. How to Train a State-of-the-Art Pathology Foundation Model with $1.6k, 2025. Sophont Blog,https://sophont.med/blog/openmidnight

2025
[27]

Training state-of-the-art pathology foundation models with orders of magnitude less data, April 2025

Mikhail Karasikov, Joost van Doorn, Nicolas Känzig, Melis Erdal Cesur, Hugo Mark Horlings, Robert Berke, Fei Tang, and Sebastian Otálora. Training state-of-the-art pathology foundation models with orders of magnitude less data, April 2025. arXiv:2504.05186

work page arXiv 2025
[28]

Chen, Drew F

Guillaume Jaume, Lukas Oldenburg, Anurag Vaidya, Richard J. Chen, Drew F. K. Williamson, Thomas Peeters, Andrew H. Song, and Faisal Mahmood. Transcriptomics-guided Slide Repre- sentation Learning in Computational Pathology. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv, May 2024. arXiv:2405.11618

work page arXiv 2024
[29]

Tran, Yiwei Xiao, Shengyu Li, Vrutant V

Weiqing Chen, Pengzhi Zhang, Tu N. Tran, Yiwei Xiao, Shengyu Li, Vrutant V . Shah, Hao Cheng, Kristopher W. Brannan, Keith Youker, Li Lai, Longhou Fang, Yu Yang, Nhat-Tu Le, Jun-ichi Abe, Shu-Hsia Chen, Qin Ma, Ken Chen, Qianqian Song, John P. Cooke, and Guangyu Wang. A visual–omics foundation model to bridge histopathology with spatial transcriptomics.Na...

2025
[30]

The Conditional Entropy Bottleneck.Entropy, 22(9):999, September 2020

Ian Fischer. The Conditional Entropy Bottleneck.Entropy, 22(9):999, September 2020. ISSN 1099-4300. doi: 10.3390/e22090999

work page doi:10.3390/e22090999 2020
[31]

Disentanglement of Variations with Multimodal Generative Modeling, September 2025

Yijie Zhang, Yiyang Shen, and Weiran Wang. Disentanglement of Variations with Multimodal Generative Modeling, September 2025. arXiv:2509.23548

work page arXiv 2025
[32]

IndiSeek learns information-guided disentangled representations, December 2025

Yu Gui, Cong Ma, and Zongming Ma. IndiSeek learns information-guided disentangled representations, December 2025. arXiv:2509.21584

work page arXiv 2025
[33]

Learning Optimal Multimodal Infor- mation Bottleneck Representations

Qilong Wu, Yiyang Shao, Jun Wang, and Xiaobo Sun. Learning Optimal Multimodal Infor- mation Bottleneck Representations. InProceedings of the 42nd International Conference on Machine Learning. arXiv, May 2025. arXiv:2505.19996

work page arXiv 2025
[34]

Xinyi Zhang, G. V . Shivashankar, and Caroline Uhler. Partially shared multi-modal embedding learns holistic representation of cell state.Nature Computational Science, 6(3):285–300, March
[35]

doi: 10.1038/s43588-025-00948-w

ISSN 2662-8457. doi: 10.1038/s43588-025-00948-w

work page doi:10.1038/s43588-025-00948-w
[36]

Theis, Srivatsan Raghavan, Pe- ter S

Till Richter, Eric Zimmermann, James Hall, Fabian J. Theis, Srivatsan Raghavan, Pe- ter S. Winter, Ava P. Amini, and Lorin Crawford. Beyond alignment: Synergistic inte- gration is required for multimodal cell foundation models, March 2026. bioRxiv preprint, doi:10.64898/2026.02.23.707420. 12

work page doi:10.64898/2026.02.23.707420 2026
[37]

Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations

Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Rätsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations. InProceedings of the 36th International Conference on Machine Learning (ICML). arXiv, June 2019. arXiv:1811.12359

work page internal anchor Pith review Pith/arXiv arXiv 2019
[38]

Causal Structure and Representation Learning with Biomedical Applications, November 2025

Caroline Uhler and Jiaqi Zhang. Causal Structure and Representation Learning with Biomedical Applications, November 2025. arXiv:2511.04790. 13 Appendix A Additional Simulator Limitations The measurement model is linear (Appendix D.2), while real biological assays involve highly nonlinear transformations. This likely advantages linear models (CCA, JIVE) be...

work page arXiv 2025
[39]

Real assays involve nonlin- ear mixing, which the linear simulator does not capture

Observations are linear mixtures of latents plus Gaussian noise. Real assays involve nonlin- ear mixing, which the linear simulator does not capture
[40]

Real biology has complex dependencies and imposed independence is an optimistic assumption forz s/bseparation

Latents are independent Gaussians. Real biology has complex dependencies and imposed independence is an optimistic assumption forz s/bseparation
[41]

Real cohorts differ in scanner, protocol, and pre-processing and we do not model measurement-level shifts

Measurement matrices A∗, B∗ are fixed across cohorts. Real cohorts differ in scanner, protocol, and pre-processing and we do not model measurement-level shifts
[42]

Confounding is shared across modalities via a single latent b. Modality-specific artifacts uncorrelated with the shared batch axis are not modeled as a separate latent, though single- modality proxy ( γh >0, γ r = 0 ) captures a related mechanism. Such artifacts do not generate shared-looking signal and are therefore not the hardest failure mode for S1 mi...
[43]

Real biomarkers may be sparse, with phenotype depending on a low-dimensional mechanism embedded in a higher-dimensional latent space

Outcome weight vectors ¯w∗ are fixed across patients and dense (every latent dimension con- tributes). Real biomarkers may be sparse, with phenotype depending on a low-dimensional mechanism embedded in a higher-dimensional latent space
[44]

In practice, one modality may capture shared biology more reliably than the other

Modalities are treated as symmetric (similar SNR and contamination). In practice, one modality may capture shared biology more reliably than the other
[45]

In real data, additional unmeasured factors may influence both the outcome and the observed features

All outcome-relevant structure is encoded in (zs, zh, zr, b) with no unmodeled confounders. In real data, additional unmeasured factors may influence both the outcome and the observed features
[46]

In real data, different assays may sample different tissue regions or time points, introducing intratumor heterogeneity and sampling variation that the simulator does not model

All modalities observe the same patient’s latent state with perfect spatial and temporal correspondence. In real data, different assays may sample different tissue regions or time points, introducing intratumor heterogeneity and sampling variation that the simulator does not model. E Model Architectures and Training We evaluate four model classes spanning...
[47]

This step is identical to CCA, so JIVE’s joint component matches CCA’s shared representation exactly

Joint subspace:compute the SVD of the cross-covariance C=X ⊤ h Xr/(n−1) with the top K left and right singular vectors Wh, Wr defining the joint projection directions. This step is identical to CCA, so JIVE’s joint component matches CCA’s shared representation exactly
[48]

For evaluation cohorts, the encoder returns the concatenation [Z(h) joint |Z (h) indiv]∈R n×2K

Individual subspaces:project out the joint subspace from each modality ( Xres h =X h − XhWhW ⊤ h , analogously for r), then apply PCA to each residual to obtain K individual components per modality. For evaluation cohorts, the encoder returns the concatenation [Z(h) joint |Z (h) indiv]∈R n×2K. The first K dimensions are joint (shared) and the remaining K ...
[49]

Probe direction ˆw. From the logistic regression coefficients WLR of the selected probe and per-dimension standard deviationsˆσestimated on A′: ˆw= WLR/ˆσ ∥WLR/ˆσ∥2 .(23) Selection follows Stage III: shared-dominant leads to using probe on zc; modality-dominant leads to using probe on zms; indeterminate leads to using probe on zc (conservative fallback); ...
[50]

Each patient receives a scalar representing their posi- tion along the outcome-predictive direction

Scores.Project representations onto ˆwwithout per-cohort standardization: s(X) i = ⟨z(i,s) sig ,ˆw⟩for X∈ {A ′, B, C}. Each patient receives a scalar representing their posi- tion along the outcome-predictive direction. 3.A ′-referenced quantiles. q(X) i =F A′(s(X) i ) = |{j∈A ′ :s (A′) j ≤s (X) i }| |A′| , X∈ {B, C}.(24) A′ provides a fixed reference sca...
[51]

Dtask(s) quantile =W 1(QB, QC), where QX ={q (X) i }i∈X

Wasserstein-1 distance. Dtask(s) quantile =W 1(QB, QC), where QX ={q (X) i }i∈X. A large W1 indicates that the cohort shift moves patients differentially along the probe direction, suggesting composition-dependent ordering. Null calibration.A one-sided permutation null is estimated by shuffling B/C cohort labels (pre- serving sizes); the probe direction ˆ...
[52]

Scenario 2 (checked first).If CIlower(P (s) transfer)>null upper and Dtask(s) quantile ≥null upper, predic- tive signal transfers functionally but biological ordering is unstable across cohorts.Assign S2. 2.Scenario 1.If not S2,andA norm >null upper,andsignal-gatep <0.05,and(for factorized models) shared-dominant localization,and CIlower(P (s) transfer)>n...
[53]

4.Scenario 3.If signal-gatep≥0.05and Scenario 2 not assigned:Assign S3

Scenario 4.If not S2 or S1,andsignal-gate p <0.05 ,and(for factorized models) modality- dominant localization,and CIlower(P (s) transfer)>null upper,and Dtask(s) quantile <null upper:Assign S4. 4.Scenario 3.If signal-gatep≥0.05and Scenario 2 not assigned:Assign S3. 5.Indeterminate.Any case not matching the above:Assign∅. G.5 Cross-Modality Scenario Intera...
[54]

unimodal

Outcome-scenario evaluation(linear probes on Cohorts A ′, B, C across multiple outcome configurations). This separation enables evaluating many biological scenarios on a single frozen representation without retraining, improving computational efficiency. H.2 Representation-Generating Parameter Space Each synthetic run is defined by a joint configuration o...

2000

[1] [1]

Chen, Ming Y

Richard J. Chen, Ming Y . Lu, Drew F. K. Williamson, Tiffany Y . Chen, Jana Lipkova, Zahra Noor, Muhammad Shaban, Maha Shady, Mane Williams, Bumjin Joo, and Faisal Mahmood. Pan-cancer integrative histology-genomic analysis via multimodal deep learning.Cancer Cell, 40(8):865–878.e6, August 2022. ISSN 1535-6108. doi: 10.1016/j.ccell.2022.07.004

work page doi:10.1016/j.ccell.2022.07.004 2022

[2] [2]

Song, Tong Ding, Sophia J

Anurag Vaidya, Andrew Zhang, Guillaume Jaume, Andrew H. Song, Tong Ding, Sophia J. Wagner, Ming Y . Lu, Paul Doucet, Harry Robertson, Cristina Almagro-Perez, Richard J. Chen, Dina ElHarouni, Georges Ayoub, Connor Bossi, Keith L. Ligon, Georg Gerber, Long Phi Le, and Faisal Mahmood. Molecular-driven Foundation Model for Oncologic Pathology, January

[3] [3]

EXAONE Path 2.5: Pathology Foundation Model with Multi-Omics Alignment, December 2025

Juseung Yun, Sunwoo Yu, Sumin Ha, Jonghyun Kim, Janghyeon Lee, Jongseong Jang, and Soonyoung Lee. EXAONE Path 2.5: Pathology Foundation Model with Multi-Omics Alignment, December 2025. arXiv:2512.14019

work page arXiv 2025

[4] [4]

A multimodal knowledge-enhanced whole-slide pathology foundation model.Nature Communications, 16(1): 11406, December 2025

Yingxue Xu, Yihui Wang, Fengtao Zhou, Jiabo Ma, Cheng Jin, Shu Yang, Jinbang Li, Zhengyu Zhang, Chenglong Zhao, Huajun Zhou, Zhenhui Li, Huangjing Lin, Xin Wang, Jiguang Wang, Anjia Han, Ronald Cheong Kin Chan, Li Liang, Xiuming Zhang, and Hao Chen. A multimodal knowledge-enhanced whole-slide pathology foundation model.Nature Communications, 16(1): 11406,...

work page doi:10.1038/s41467-025-66220-x 2025

[5] [5]

Leveraging multi-modal foundation models for analysing spatial multi-omic and histopathology data

Tianyu Liu, Tinglin Huang, Tong Ding, Hao Wu, Peter Humphrey, Sudhir Perincheri, Kurt Schalper, Rex Ying, Hua Xu, James Zou, Faisal Mahmood, and Hongyu Zhao. Leveraging multi-modal foundation models for analysing spatial multi-omic and histopathology data. Nature Biomedical Engineering, pages 1–18, February 2026. ISSN 2157-846X. doi: 10.1038/ s41551-025-01602-6

2026

[6] [6]

Howard, James Dolezal, Sara Kochanny, Jefree Schulte, Heather Chen, Lara Heij, Dezheng Huo, Rita Nanda, Olufunmilayo I

Frederick M. Howard, James Dolezal, Sara Kochanny, Jefree Schulte, Heather Chen, Lara Heij, Dezheng Huo, Rita Nanda, Olufunmilayo I. Olopade, Jakob N. Kather, Nicole Cipriani, Robert L. Grossman, and Alexander T. Pearson. The impact of site-specific digital histology signatures on deep learning model accuracy and bias.Nature Communications, 12(1):4423, Ju...

work page doi:10.1038/s41467-021-24698-1 2021

[7] [7]

Do Histopathologi- cal Foundation Models Eliminate Batch Effects? A Comparative Study

Jonah Kömen, Hannah Marienwald, Jonas Dippel, and Julius Hense. Do Histopathologi- cal Foundation Models Eliminate Batch Effects? A Comparative Study. InAIM-FM Work- shop, Advances in Neural Information Processing Systems (NeurIPS). arXiv, November 2024. arXiv:2411.05489

work page arXiv 2024

[8] [8]

de Jong, Eric Marcus, and Jonas Teuwen

Edwin D. de Jong, Eric Marcus, and Jonas Teuwen. Current Pathology Foundation Models are unrobust to Medical Center Differences, February 2025. arXiv:2501.18055

work page arXiv 2025

[9] [9]

de Jong, Julius Hense, Hannah Marienwald, Jonas Dippel, Philip Naumann, Eric Marcus, Lukas Ruff, Maximilian Alber, Jonas Teuwen, Frederick Klauschen, and Klaus-Robert Müller

Jonah Kömen, Edwin D. de Jong, Julius Hense, Hannah Marienwald, Jonas Dippel, Philip Naumann, Eric Marcus, Lukas Ruff, Maximilian Alber, Jonas Teuwen, Frederick Klauschen, and Klaus-Robert Müller. Towards Robust Foundation Models for Digital Pathology, July 2025. arXiv:2507.17845

work page arXiv 2025

[10] [10]

Gustafsson, Kajsa Ledesma Eriksson, and Mattias Rantalainen

Erik Thiringer, Fredrik K. Gustafsson, Kajsa Ledesma Eriksson, and Mattias Rantalainen. Scanner-Induced Domain Shifts Undermine the Robustness of Pathology Foundation Models, January 2026. arXiv:2601.04163. 10

work page arXiv 2026

[11] [11]

Confounding factors and biases abound when predicting molecular biomarkers from histological images.Nature Biomedical Engineering, pages 1–15, March 2026

Muhammad Dawood, Kim Branson, Sabine Tejpar, Nasir Rajpoot, and Fayyaz ul Amir Afsar Minhas. Confounding factors and biases abound when predicting molecular biomarkers from histological images.Nature Biomedical Engineering, pages 1–15, March 2026. ISSN 2157- 846X. doi: 10.1038/s41551-026-01616-8

work page doi:10.1038/s41551-026-01616-8 2026

[12] [12]

Theis, and Bo Wang

Haotian Cui, Alejandro Tejada-Lapuerta, Maria Brbi´c, Julio Saez-Rodriguez, Simona Cristea, Hani Goodarzi, Mohammad Lotfollahi, Fabian J. Theis, and Bo Wang. Towards multimodal foundation models in molecular cell biology.Nature, 640(8059):623–633, April 2025. ISSN 1476-4687. doi: 10.1038/s41586-025-08710-y

work page doi:10.1038/s41586-025-08710-y 2025

[13] [13]

Vanguri, Jia Luo, Andrew T

Rami S. Vanguri, Jia Luo, Andrew T. Aukerman, Jacklynn V . Egger, Christopher J. Fong, Natally Horvat, Andrew Pagano, Jose de Arimateia Batista Araujo-Filho, Luke Geneslaw, Hira Rizvi, Ramon Sosa, Kevin M. Boehm, Soo-Ryum Yang, Francis M. Bodd, Katia Ventura, Travis J. Hollmann, Michelle S. Ginsberg, Jianjiong Gao, Rami Vanguri, Matthew D. Hellmann, Jenni...

2022

[14] [14]

A multimodal AI biomarker PATH-ORACLE improves prediction of recurrence in stage I lung adenocarcinoma, February 2026

Oz Kilim, Orsolya Pipek, Zsofia Sztupinszki, Miklos Diossy, Aurel Prosz, Cristina Naceur- Lombardelli, Selvaraju Veeriah, David Moore, Mariam Jamal-Hanjani, Allan Hackshaw, Janos Fillinger, Judit Moldvay, Istvan Csabai, Charles Swanton, and Zoltan Szallasi. A multimodal AI biomarker PATH-ORACLE improves prediction of recurrence in stage I lung adenocarcin...

work page doi:10.64898/2026.01.28.26344973 2026

[15] [15]

Howard, Jakob Nikolas Kather, and Alexander T

Frederick M. Howard, Jakob Nikolas Kather, and Alexander T. Pearson. Multimodal deep learning: An improvement in prognostication or a reflection of batch effect?Cancer Cell, 41 (1):5–6, January 2023. ISSN 1535-6108, 1878-3686. doi: 10.1016/j.ccell.2022.10.025

work page doi:10.1016/j.ccell.2022.10.025 2023

[16] [16]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. InPro- ceedings of the 38th International Conference on Machine Learning (ICML). arXiv, February

[17] [17]

An Information Criterion for Controlled Disentanglement of Multimodal Data

Chenyu Wang, Sharut Gupta, Xinyi Zhang, Sana Tonekaboni, Stefanie Jegelka, Tommi Jaakkola, and Caroline Uhler. An Information Criterion for Controlled Disentanglement of Multimodal Data. InProceedings of the International Conference on Learning Representations. arXiv, March 2025. arXiv:2410.23996

work page arXiv 2025

[18] [18]

Factorized Contrastive Learning: Going Beyond Multi-view Redundancy

Paul Pu Liang, Zihao Deng, Martin Ma, James Zou, Louis-Philippe Morency, and Ruslan Salakhutdinov. Factorized Contrastive Learning: Going Beyond Multi-view Redundancy. InAdvances in Neural Information Processing Systems (NeurIPS). arXiv, October 2023. arXiv:2306.05268

work page arXiv 2023

[19] [19]

Canonical Correlation Analysis (CCA) Based Multi-View Learning: An Overview, May 2021

Chenfeng Guo and Dongrui Wu. Canonical Correlation Analysis (CCA) Based Multi-View Learning: An Overview, May 2021. arXiv:1907.01693

work page arXiv 2021

[20] [20]

Lock, Katherine A

Eric F. Lock, Katherine A. Hoadley, J. S. Marron, and Andrew B. Nobel. Joint and individual variation explained (JIVE) for integrated analysis of multiple data types.The Annals of Applied Statistics, 7(1), March 2013. ISSN 1932-6157. doi: 10.1214/12-AOAS597

work page doi:10.1214/12-aoas597 2013

[21] [21]

Wagner, Andrew H

Tong Ding, Sophia J. Wagner, Andrew H. Song, Richard J. Chen, Ming Y . Lu, Andrew Zhang, Anurag J. Vaidya, Guillaume Jaume, Muhammad Shaban, Ahrong Kim, Drew F. K. Williamson, Harry Robertson, Bowen Chen, Cristina Almagro-Pérez, Paul Doucet, Sharifa Sahai, Chengkuan Chen, Christina S. Chen, Daisuke Komura, Akihiro Kawabe, Mieko Ochi, Shinya Sato, Tomoyuki...

work page doi:10.1038/s41591-025-03982-3 2025

[22] [22]

Sun, Arijit Patra, Ioannis Kagiampakis, and Etai Jacob

Gustavo Arango-Argoty, Elly Kipkogei, Ross Stewart, Gerald J. Sun, Arijit Patra, Ioannis Kagiampakis, and Etai Jacob. Pretrained transformers applied to clinical studies improve predictions of treatment efficacy and associated biomarkers.Nature Communications, 16(1): 2101, March 2025. ISSN 2041-1723. doi: 10.1038/s41467-025-57181-2

work page doi:10.1038/s41467-025-57181-2 2025

[23] [23]

Lu, Bowen Chen, Drew F

Ming Y . Lu, Bowen Chen, Drew F. K. Williamson, Richard J. Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Long Phi Le, Georg Gerber, Anil V . Parwani, Andrew Zhang, and Faisal Mahmood. A visual-language foundation model for computational pathology.Nature Medicine, 30(3):863–874, March 2024. ISSN 1546-170X. doi: 10.1038/s41591-024-02856-4

work page doi:10.1038/s41591-024-02856-4 2024

[24] [24]

Chen, Tong Ding, Ming Y

Richard J. Chen, Tong Ding, Ming Y . Lu, Drew F. K. Williamson, Guillaume Jaume, Andrew H. Song, Bowen Chen, Andrew Zhang, Daniel Shao, Muhammad Shaban, Mane Williams, Lukas Oldenburg, Luca L. Weishaupt, Judy J. Wang, Anurag Vaidya, Long Phi Le, Georg Gerber, Sharifa Sahai, Walt Williams, and Faisal Mahmood. Towards a general-purpose foundation model for ...

work page doi:10.1038/s41591-024-02857-3 2024

[25] [25]

H-optimus-0, 2024

Charlie Saillard, Rodolphe Jenatton, Felipe Llinares-López, Zelda Mariet, David Cahané, Eric Durand, and Jean-Philippe Vert. H-optimus-0, 2024. https://github.com/bioptimus/ releases/tree/main/models/h-optimus/v0

2024

[26] [26]

Daniel Kaplan, Ratna Sagari Grandhi, Connor Lane, Benjamin Warner, Tanishq Mathew Abraham, and Paul S. Scotti. How to Train a State-of-the-Art Pathology Foundation Model with $1.6k, 2025. Sophont Blog,https://sophont.med/blog/openmidnight

2025

[27] [27]

Training state-of-the-art pathology foundation models with orders of magnitude less data, April 2025

Mikhail Karasikov, Joost van Doorn, Nicolas Känzig, Melis Erdal Cesur, Hugo Mark Horlings, Robert Berke, Fei Tang, and Sebastian Otálora. Training state-of-the-art pathology foundation models with orders of magnitude less data, April 2025. arXiv:2504.05186

work page arXiv 2025

[28] [28]

Chen, Drew F

Guillaume Jaume, Lukas Oldenburg, Anurag Vaidya, Richard J. Chen, Drew F. K. Williamson, Thomas Peeters, Andrew H. Song, and Faisal Mahmood. Transcriptomics-guided Slide Repre- sentation Learning in Computational Pathology. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv, May 2024. arXiv:2405.11618

work page arXiv 2024

[29] [29]

Tran, Yiwei Xiao, Shengyu Li, Vrutant V

Weiqing Chen, Pengzhi Zhang, Tu N. Tran, Yiwei Xiao, Shengyu Li, Vrutant V . Shah, Hao Cheng, Kristopher W. Brannan, Keith Youker, Li Lai, Longhou Fang, Yu Yang, Nhat-Tu Le, Jun-ichi Abe, Shu-Hsia Chen, Qin Ma, Ken Chen, Qianqian Song, John P. Cooke, and Guangyu Wang. A visual–omics foundation model to bridge histopathology with spatial transcriptomics.Na...

2025

[30] [30]

The Conditional Entropy Bottleneck.Entropy, 22(9):999, September 2020

Ian Fischer. The Conditional Entropy Bottleneck.Entropy, 22(9):999, September 2020. ISSN 1099-4300. doi: 10.3390/e22090999

work page doi:10.3390/e22090999 2020

[31] [31]

Disentanglement of Variations with Multimodal Generative Modeling, September 2025

Yijie Zhang, Yiyang Shen, and Weiran Wang. Disentanglement of Variations with Multimodal Generative Modeling, September 2025. arXiv:2509.23548

work page arXiv 2025

[32] [32]

IndiSeek learns information-guided disentangled representations, December 2025

Yu Gui, Cong Ma, and Zongming Ma. IndiSeek learns information-guided disentangled representations, December 2025. arXiv:2509.21584

work page arXiv 2025

[33] [33]

Learning Optimal Multimodal Infor- mation Bottleneck Representations

Qilong Wu, Yiyang Shao, Jun Wang, and Xiaobo Sun. Learning Optimal Multimodal Infor- mation Bottleneck Representations. InProceedings of the 42nd International Conference on Machine Learning. arXiv, May 2025. arXiv:2505.19996

work page arXiv 2025

[34] [34]

Xinyi Zhang, G. V . Shivashankar, and Caroline Uhler. Partially shared multi-modal embedding learns holistic representation of cell state.Nature Computational Science, 6(3):285–300, March

[35] [35]

doi: 10.1038/s43588-025-00948-w

ISSN 2662-8457. doi: 10.1038/s43588-025-00948-w

work page doi:10.1038/s43588-025-00948-w

[36] [36]

Theis, Srivatsan Raghavan, Pe- ter S

Till Richter, Eric Zimmermann, James Hall, Fabian J. Theis, Srivatsan Raghavan, Pe- ter S. Winter, Ava P. Amini, and Lorin Crawford. Beyond alignment: Synergistic inte- gration is required for multimodal cell foundation models, March 2026. bioRxiv preprint, doi:10.64898/2026.02.23.707420. 12

work page doi:10.64898/2026.02.23.707420 2026

[37] [37]

Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations

Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Rätsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations. InProceedings of the 36th International Conference on Machine Learning (ICML). arXiv, June 2019. arXiv:1811.12359

work page internal anchor Pith review Pith/arXiv arXiv 2019

[38] [38]

Causal Structure and Representation Learning with Biomedical Applications, November 2025

Caroline Uhler and Jiaqi Zhang. Causal Structure and Representation Learning with Biomedical Applications, November 2025. arXiv:2511.04790. 13 Appendix A Additional Simulator Limitations The measurement model is linear (Appendix D.2), while real biological assays involve highly nonlinear transformations. This likely advantages linear models (CCA, JIVE) be...

work page arXiv 2025

[39] [39]

Real assays involve nonlin- ear mixing, which the linear simulator does not capture

Observations are linear mixtures of latents plus Gaussian noise. Real assays involve nonlin- ear mixing, which the linear simulator does not capture

[40] [40]

Real biology has complex dependencies and imposed independence is an optimistic assumption forz s/bseparation

Latents are independent Gaussians. Real biology has complex dependencies and imposed independence is an optimistic assumption forz s/bseparation

[41] [41]

Real cohorts differ in scanner, protocol, and pre-processing and we do not model measurement-level shifts

Measurement matrices A∗, B∗ are fixed across cohorts. Real cohorts differ in scanner, protocol, and pre-processing and we do not model measurement-level shifts

[42] [42]

Confounding is shared across modalities via a single latent b. Modality-specific artifacts uncorrelated with the shared batch axis are not modeled as a separate latent, though single- modality proxy ( γh >0, γ r = 0 ) captures a related mechanism. Such artifacts do not generate shared-looking signal and are therefore not the hardest failure mode for S1 mi...

[43] [43]

Real biomarkers may be sparse, with phenotype depending on a low-dimensional mechanism embedded in a higher-dimensional latent space

Outcome weight vectors ¯w∗ are fixed across patients and dense (every latent dimension con- tributes). Real biomarkers may be sparse, with phenotype depending on a low-dimensional mechanism embedded in a higher-dimensional latent space

[44] [44]

In practice, one modality may capture shared biology more reliably than the other

Modalities are treated as symmetric (similar SNR and contamination). In practice, one modality may capture shared biology more reliably than the other

[45] [45]

In real data, additional unmeasured factors may influence both the outcome and the observed features

All outcome-relevant structure is encoded in (zs, zh, zr, b) with no unmodeled confounders. In real data, additional unmeasured factors may influence both the outcome and the observed features

[46] [46]

In real data, different assays may sample different tissue regions or time points, introducing intratumor heterogeneity and sampling variation that the simulator does not model

All modalities observe the same patient’s latent state with perfect spatial and temporal correspondence. In real data, different assays may sample different tissue regions or time points, introducing intratumor heterogeneity and sampling variation that the simulator does not model. E Model Architectures and Training We evaluate four model classes spanning...

[47] [47]

This step is identical to CCA, so JIVE’s joint component matches CCA’s shared representation exactly

Joint subspace:compute the SVD of the cross-covariance C=X ⊤ h Xr/(n−1) with the top K left and right singular vectors Wh, Wr defining the joint projection directions. This step is identical to CCA, so JIVE’s joint component matches CCA’s shared representation exactly

[48] [48]

For evaluation cohorts, the encoder returns the concatenation [Z(h) joint |Z (h) indiv]∈R n×2K

Individual subspaces:project out the joint subspace from each modality ( Xres h =X h − XhWhW ⊤ h , analogously for r), then apply PCA to each residual to obtain K individual components per modality. For evaluation cohorts, the encoder returns the concatenation [Z(h) joint |Z (h) indiv]∈R n×2K. The first K dimensions are joint (shared) and the remaining K ...

[49] [49]

Probe direction ˆw. From the logistic regression coefficients WLR of the selected probe and per-dimension standard deviationsˆσestimated on A′: ˆw= WLR/ˆσ ∥WLR/ˆσ∥2 .(23) Selection follows Stage III: shared-dominant leads to using probe on zc; modality-dominant leads to using probe on zms; indeterminate leads to using probe on zc (conservative fallback); ...

[50] [50]

Each patient receives a scalar representing their posi- tion along the outcome-predictive direction

Scores.Project representations onto ˆwwithout per-cohort standardization: s(X) i = ⟨z(i,s) sig ,ˆw⟩for X∈ {A ′, B, C}. Each patient receives a scalar representing their posi- tion along the outcome-predictive direction. 3.A ′-referenced quantiles. q(X) i =F A′(s(X) i ) = |{j∈A ′ :s (A′) j ≤s (X) i }| |A′| , X∈ {B, C}.(24) A′ provides a fixed reference sca...

[51] [51]

Dtask(s) quantile =W 1(QB, QC), where QX ={q (X) i }i∈X

Wasserstein-1 distance. Dtask(s) quantile =W 1(QB, QC), where QX ={q (X) i }i∈X. A large W1 indicates that the cohort shift moves patients differentially along the probe direction, suggesting composition-dependent ordering. Null calibration.A one-sided permutation null is estimated by shuffling B/C cohort labels (pre- serving sizes); the probe direction ˆ...

[52] [52]

Scenario 2 (checked first).If CIlower(P (s) transfer)>null upper and Dtask(s) quantile ≥null upper, predic- tive signal transfers functionally but biological ordering is unstable across cohorts.Assign S2. 2.Scenario 1.If not S2,andA norm >null upper,andsignal-gatep <0.05,and(for factorized models) shared-dominant localization,and CIlower(P (s) transfer)>n...

[53] [53]

4.Scenario 3.If signal-gatep≥0.05and Scenario 2 not assigned:Assign S3

Scenario 4.If not S2 or S1,andsignal-gate p <0.05 ,and(for factorized models) modality- dominant localization,and CIlower(P (s) transfer)>null upper,and Dtask(s) quantile <null upper:Assign S4. 4.Scenario 3.If signal-gatep≥0.05and Scenario 2 not assigned:Assign S3. 5.Indeterminate.Any case not matching the above:Assign∅. G.5 Cross-Modality Scenario Intera...

[54] [54]

unimodal

Outcome-scenario evaluation(linear probes on Cohorts A ′, B, C across multiple outcome configurations). This separation enables evaluating many biological scenarios on a single frozen representation without retraining, improving computational efficiency. H.2 Representation-Generating Parameter Space Each synthetic run is defined by a joint configuration o...

2000