Multimodal Cancer Modeling in the Age of Foundation Model Embeddings
Pith reviewed 2026-05-22 15:36 UTC · model grok-4.3
The pith
Zero-shot foundation model embeddings from multiple modalities can be fused to outperform unimodal models in cancer prediction tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that multimodal fusion of zero-shot foundation model embeddings is straightforward and provides additive benefits, outperforming unimodal approaches in cancer modeling. Including pathology report text further improves results, and the effects of text summarization and hallucination can be systematically evaluated. This establishes an embedding-centric paradigm for handling multimodal cancer data from sources like TCGA.
What carries the argument
Zero-shot foundation model embeddings from different data modalities fused using classical machine learning models for cancer prediction.
Load-bearing premise
The zero-shot embeddings from foundation models contain sufficient task-relevant information for cancer prediction without any fine-tuning or task-specific adaptation of the embedding models.
What would settle it
Demonstrating that multimodal fusion fails to improve upon the best unimodal model or that fine-tuning the foundation models is necessary to achieve competitive performance would falsify the central claim.
read the original abstract
The Cancer Genome Atlas (TCGA) has enabled novel discoveries and served as a large-scale reference dataset in cancer through its harmonized genomics, clinical, and imaging data. Numerous prior studies have developed bespoke deep learning models over TCGA for tasks such as cancer survival prediction. A modern paradigm in biomedical deep learning is the development of foundation models (FMs) to derive feature embeddings agnostic to a specific modeling task. Biomedical text especially has seen growing development of FMs. While TCGA contains free-text data as pathology reports, these have been historically underutilized. Here, we investigate the ability to train classical machine learning models over multimodal, zero-shot FM embeddings of cancer data. We demonstrate the ease and additive effect of multimodal fusion, outperforming unimodal models. Further, we show the benefit of including pathology report text and rigorously evaluate the effect of model-based text summarization and hallucination. Overall, we propose an embedding-centric approach to multimodal cancer modeling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes an embedding-centric approach to multimodal cancer modeling on TCGA data. It extracts zero-shot embeddings from foundation models for genomics, imaging, and pathology report text, then trains classical machine learning models on unimodal and multimodal combinations. The central claims are that multimodal fusion is easy to implement and yields additive performance gains over unimodal baselines, that pathology report text provides measurable benefit, and that the effects of model-based summarization and hallucination on text embeddings can be rigorously quantified.
Significance. If the empirical results are robust, the work would illustrate a low-effort pathway for multimodal integration that leverages existing foundation models without task-specific fine-tuning of the embedding extractors. It would also highlight the value of historically underused free-text pathology reports and provide practical guidance on text preprocessing choices. These contributions could lower the barrier to multimodal cancer analysis if the zero-shot embeddings are shown to carry sufficient task-relevant signal.
major comments (2)
- [Abstract] Abstract: The abstract asserts clear outperformance and additive effects of multimodal fusion but supplies no quantitative results, baselines, data splits, or statistical tests. Without these details it is impossible to determine whether the reported gains survive proper controls or are sensitive to post-hoc modeling choices.
- [Results] Results / Methods: The central demonstration of additive multimodal benefit rests on the premise that off-the-shelf zero-shot embeddings already encode sufficient TCGA-specific clinical signals. An ablation comparing zero-shot embeddings against task-adapted or fine-tuned versions of the same foundation models would directly test whether the observed gains reflect complementary information or merely independent noise dimensions.
minor comments (2)
- The manuscript would benefit from an explicit statement of the exact foundation models (including versions and checkpoints) used for each modality and the precise classical ML algorithms and hyperparameter ranges applied to the concatenated embeddings.
- Figure captions and table legends should include the number of patients or samples per cohort and the cross-validation scheme to allow readers to assess statistical power.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments on our manuscript. We address each major comment point by point below, providing clarifications and indicating revisions made to strengthen the work.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract asserts clear outperformance and additive effects of multimodal fusion but supplies no quantitative results, baselines, data splits, or statistical tests. Without these details it is impossible to determine whether the reported gains survive proper controls or are sensitive to post-hoc modeling choices.
Authors: We agree that the abstract would benefit from greater specificity to allow readers to evaluate the claims immediately. In the revised manuscript, we have updated the abstract to include key quantitative results (e.g., AUC improvements for multimodal over unimodal models), details on the train/test splits employed, and mention of the statistical tests used to assess significance of the additive gains. revision: yes
-
Referee: [Results] Results / Methods: The central demonstration of additive multimodal benefit rests on the premise that off-the-shelf zero-shot embeddings already encode sufficient TCGA-specific clinical signals. An ablation comparing zero-shot embeddings against task-adapted or fine-tuned versions of the same foundation models would directly test whether the observed gains reflect complementary information or merely independent noise dimensions.
Authors: We appreciate this suggestion for probing the nature of the observed gains. Our study deliberately focuses on the zero-shot setting to demonstrate a low-effort, practical pathway that avoids task-specific fine-tuning of the foundation models. An ablation involving fine-tuning would address a related but distinct question and is beyond the current scope. We have added a paragraph in the Discussion section of the revised manuscript to explicitly acknowledge this point and identify such an ablation as a promising direction for future work. revision: partial
Circularity Check
No significant circularity in empirical multimodal fusion study
full rationale
The paper is an empirical study that extracts zero-shot embeddings from external foundation models (for genomics, imaging, and pathology text) and trains classical machine learning models on TCGA data for survival and subtype prediction. Multimodal fusion performance is evaluated via standard cross-validation and held-out metrics, with no mathematical derivation chain, parameter fitting that is then relabeled as prediction, or self-citation load-bearing steps that reduce the reported gains to inputs by construction. All central claims rest on observable performance differences against external benchmarks rather than internal redefinition or tautological fitting.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Zero-shot foundation model embeddings capture sufficient information for downstream cancer survival or classification tasks across modalities.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.