Contrastive and Adaptive Multi-modal Masked Autoencoder for Spatial Transcriptomics
Pith reviewed 2026-06-26 14:31 UTC · model grok-4.3
The pith
A contrastive masked autoencoder uses adaptive genetic anchors selected from histology to predict whole-slide gene expression more accurately than prior methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the Contrastive and Adaptive Multi-modal Masked Autoencoder integrates visual histology features with sparse genetic anchors via a cross-modal joint encoder and contrastive alignment, yielding robust representations that enable accurate whole-slide gene expression imputation and outperform existing methods in both histology-only prediction and spatial imputation settings, including with no anchors or with only 10 percent transcriptomic coverage.
What carries the argument
Cross-modal joint encoder inside a masked autoencoder that fuses histology image patches with adaptively selected genetic anchors and aligns the modalities through contrastive learning to support imputation of the remaining expression profile.
If this is right
- The method produces higher accuracy than prior work when predicting from histology images alone.
- Performance further improves when 10 percent transcriptomic coverage is supplied as anchors.
- Selected anchors form contiguous regions that match constraints of existing ST profiling hardware.
- Contrastive alignment of visual and genetic features inside the joint encoder yields representations that support accurate imputation across the slide.
Where Pith is reading between the lines
- The adaptive anchor strategy might transfer to other biomedical tasks where one expensive modality can be sparsely sampled to guide prediction from a cheaper modality.
- If the bio-saliency ranking generalizes, it could reduce the number of spots that need to be profiled in clinical workflows without sacrificing imputation quality.
- Testing the same selection logic on datasets with different tissue architectures would show whether the contiguous-region constraint remains the main practical bottleneck.
Load-bearing premise
The bio-saliency score and learning-to-rank procedure will reliably select spots that are both informative for gene expression and form contiguous regions compatible with real-world spatial transcriptomics hardware.
What would settle it
An independent test set in which the method does not exceed the accuracy of current histology-only or standard imputation baselines at matched coverage levels would falsify the superiority claim.
Figures
read the original abstract
The high cost of spatial transcriptomics (ST) has driven extensive studies into predicting gene expression directly from H&E histology images. However, this prediction task faces an inherent limitation, as tissue morphology alone provides insufficient information to fully resolve underlying gene expression. To address this limitation, a recent study leverages partial gene expression to guide the prediction process alongside histology images. Building on this paradigm, we approach the prediction task as a spatial imputation problem, employing a Masked Autoencoder (MAE) to utilize a small fraction of gene expression as genetic anchors for inferring whole-slide gene expression profiles. Specifically, we propose a bio-saliency score and a learning-to-rank strategy to adaptively identify the most informative spots within the tissue. Based on these identified spots, our framework selects contiguous regions as genetic anchors to ensure suitability for real-world ST profiling hardware. To effectively leverage these anchors, we design a cross-modal joint encoder that integrates visual and genetic modalities. By aligning the selected anchors with their corresponding visual features via contrastive learning, the encoder generates robust joint representations to accurately predict gene expression across the whole slide. Notably, our framework consistently surpasses existing methods in both histology-only prediction and spatial imputation, achieving superior accuracy even without genetic anchors and further excelling with as little as 10% transcriptomic coverage. Our code is available at https://github.com/Kyyle2114/CAMMST.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CAMMST, a contrastive and adaptive multi-modal masked autoencoder for spatial transcriptomics. It frames gene expression prediction from H&E images as a spatial imputation task, using a masked autoencoder with a small fraction of gene expression as genetic anchors. A bio-saliency score combined with learning-to-rank selects informative contiguous spots suitable for real-world ST hardware; these anchors are integrated via a cross-modal joint encoder aligned by contrastive learning. The framework claims consistent superiority over existing methods in both histology-only prediction and imputation, including with as little as 10% transcriptomic coverage. Code is released at the provided GitHub link.
Significance. If the reported accuracy gains hold under full scrutiny of the experiments, the work could meaningfully advance cost-effective ST profiling by enabling reliable imputation from minimal, hardware-compatible anchors and even improving morphology-only baselines. The explicit release of code supports reproducibility, and the adaptive anchor selection plus contrastive joint encoding represent a concrete technical contribution over prior MAE and multi-modal baselines.
major comments (2)
- [anchor selection procedure] § on anchor selection (bio-saliency + learning-to-rank): the claim that selected contiguous regions are reliably informative and hardware-compatible rests on the assumption that the ranking strategy generalizes across tissue types and ST platforms; this needs explicit ablation showing that performance degrades gracefully when the selection heuristic is replaced by random or uniform contiguous sampling of the same coverage percentage.
- [experimental results] Results tables (histology-only and 10%-anchor regimes): the reported consistent superiority must be accompanied by statistical significance tests (e.g., paired t-tests or Wilcoxon across multiple random seeds and multiple datasets) and variance estimates; without these, the cross-method comparison cannot be considered load-bearing for the central empirical claim.
minor comments (3)
- [abstract and method overview] The abstract states 'even without genetic anchors' yet the method description centers on anchor usage; clarify whether the histology-only mode simply disables the genetic branch or uses a different training regime.
- [method] Notation for the contrastive loss and the joint encoder should be defined once in a dedicated subsection rather than inline, to improve readability for readers unfamiliar with multi-modal MAE variants.
- [figures] Figure captions for the architecture diagram should explicitly label the bio-saliency map, the selected anchor patches, and the contrastive alignment arrows.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and recommendation for minor revision. The comments highlight valuable ways to strengthen the empirical support for our claims. We address each point below and will update the manuscript accordingly.
read point-by-point responses
-
Referee: [anchor selection procedure] § on anchor selection (bio-saliency + learning-to-rank): the claim that selected contiguous regions are reliably informative and hardware-compatible rests on the assumption that the ranking strategy generalizes across tissue types and ST platforms; this needs explicit ablation showing that performance degrades gracefully when the selection heuristic is replaced by random or uniform contiguous sampling of the same coverage percentage.
Authors: We agree that an explicit ablation against random and uniform contiguous sampling at matched coverage is needed to substantiate the value of the bio-saliency + learning-to-rank procedure. In the revision we will add this comparison (10% coverage) on all evaluated datasets, reporting the performance drop when the adaptive heuristic is replaced by non-informative contiguous selection. Our current experiments already cover multiple tissue types and platforms, but the requested controlled ablation will be included to directly address generalization. revision: yes
-
Referee: [experimental results] Results tables (histology-only and 10%-anchor regimes): the reported consistent superiority must be accompanied by statistical significance tests (e.g., paired t-tests or Wilcoxon across multiple random seeds and multiple datasets) and variance estimates; without these, the cross-method comparison cannot be considered load-bearing for the central empirical claim.
Authors: We accept that variance estimates and statistical tests are required for the central claims. We will rerun all experiments with at least five random seeds, report mean ± standard deviation in the tables, and add paired t-tests (or Wilcoxon signed-rank tests where appropriate) comparing CAMMST against each baseline across datasets and seeds. These results and p-values will be incorporated into the revised tables. revision: yes
Circularity Check
No significant circularity
full rationale
The paper applies standard MAE reconstruction and contrastive alignment losses to a multi-modal histology-plus-anchor setup, with anchor selection performed via an explicit bio-saliency + learning-to-rank procedure whose outputs are then fed into the encoder. No equation or claim reduces a reported prediction to a fitted parameter by construction, no uniqueness theorem is imported from prior self-work, and no ansatz is smuggled via citation. The central results are empirical accuracy numbers on held-out ST benchmarks, which remain falsifiable outside the training loop. The derivation chain is therefore self-contained against external data.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Nature communications12(1), 6012 (2021)
Andersson, A., Larsson, L., Stenbeck, L., Salmén, F., Ehinger, A., Wu, S.Z., Al- Eryani,G.,Roden,D.,Swarbrick,A.,Borg,Å.,etal.:Spatialdeconvolutionofher2- positive breast cancer delineates tumor-associated cell type interactions. Nature communications12(1), 6012 (2021)
2021
-
[2]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Bandara, W.G.C., Patel, N., Gholami, A., Nikkhah, M., Agrawal, M., Patel, V.M.: Adamae: Adaptive masking for efficient spatiotemporal learning with masked au- toencoders. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14507–14517 (2023)
2023
-
[3]
In: Proceedings of the 22nd international conference on Machine learning
Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hul- lender, G.: Learning to rank using gradient descent. In: Proceedings of the 22nd international conference on Machine learning. pp. 89–96 (2005)
2005
-
[4]
Nature Communications16(1), 4452 (2025)
Chelebian, E., Avenel, C., Wählby, C.: Combining spatial transcriptomics with tissue morphology. Nature Communications16(1), 4452 (2025)
2025
-
[5]
Nature Medicine (2024)
Chen, R.J., Ding, T., Lu, M.Y., Williamson, D.F., Jaume, G., Chen, B., Zhang, A., Shao, D., Song, A.H., Shaban, M., et al.: Towards a general-purpose foundation model for computational pathology. Nature Medicine (2024)
2024
-
[6]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Chung, Y., Ha, J.H., Im, K.C., Lee, J.S.: Accurate spatial gene expression predic- tion by integrating multi-resolution features. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11591–11600 (2024)
2024
-
[7]
Nature medicine pp
Ding, T., Wagner, S.J., Song, A.H., Chen, R.J., Lu, M.Y., Zhang, A., Vaidya, A.J., Jaume,G.,Shaban,M.,Kim,A.,etal.:Amultimodalwhole-slidefoundationmodel for pathology. Nature medicine pp. 1–13 (2025) 10 Kim et al
2025
-
[8]
In: Proceedings of the Com- puter Vision and Pattern Recognition Conference
Ganguly, A., Chatterjee, D., Huang, W., Zhang, J., Yurovsky, A., Johnson, T.S., Chen, C.: Merge: Multi-faceted hierarchical graph-based gnn for gene expression prediction from whole slide histopathology images. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 15611–15620 (2025)
2025
-
[9]
In: The Eleventh International Conference on Learning Representations (2023), https://openreview.net/forum?id=QPtMRyk5rb
Gong, Y., Rouditchenko, A., Liu, A.H., Harwath, D., Karlinsky, L., Kuehne, H., Glass, J.R.: Contrastive audio-visual masked autoencoder. In: The Eleventh International Conference on Learning Representations (2023), https://openreview.net/forum?id=QPtMRyk5rb
2023
-
[10]
Nature biomedical engineering4(8), 827– 834 (2020)
He, B., Bergenstråhle, L., Stenbeck, L., Abid, A., Andersson, A., Borg, Å., Maaskola, J., Lundeberg, J., Zou, J.: Integrating spatial gene expression and breast tumour morphology via deep learning. Nature biomedical engineering4(8), 827– 834 (2020)
2020
-
[11]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)
2022
-
[12]
Anticancer research27(1A), 27–34 (2007)
Hunt, D.A., Lane, H.M., Zygmont, M.E., Dervan, P.A., Hennigar, R.A.: Mrna stability and overexpression of fatty acid synthase in human breast cancer cell lines. Anticancer research27(1A), 27–34 (2007)
2007
-
[13]
Jain,S.,Eadon,M.T.:Spatialtranscriptomicsinhealthanddisease.Naturereviews nephrology20(10), 659–671 (2024)
2024
-
[14]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Jeong,T.,Kim,J.,Kim,J.,Kim,C.,Hwang,S.J.:Feast:Fullyconnectedexpressive attention for spatial transcriptomics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 26793–26802 (2026)
2026
-
[15]
cell 182(2), 497–514 (2020)
Ji, A.L., Rubin, A.J., Thrane, K., Jiang, S., Reynolds, D.L., Meyers, R.M., Guo, M.G., George, B.M., Mollbrink, A., Bergenstråhle, J., et al.: Multimodal analysis of composition and spatial architecture in human squamous cell carcinoma. cell 182(2), 497–514 (2020)
2020
-
[16]
Briefings in Bioin- formatics25(1) (2023)
Jia, Y., Liu, J., Chen, L., Zhao, T., Wang, Y.: Thitogene: a deep learning method for predicting spatial transcriptomics from histological images. Briefings in Bioin- formatics25(1) (2023)
2023
-
[17]
Clinical and Translational Oncology21(9), 1207–1219 (2019)
Jin, X., Zhu, L., Cui, Z., Tang, J., Xie, M., Ren, G.: Elevated expression of gnas promotes breast cancer cell proliferation and migration via the pi3k/akt/snail1/e- cadherin axis. Clinical and Translational Oncology21(9), 1207–1219 (2019)
2019
-
[18]
Briefings in Bioinformatics23(3) (2022)
Liu, Y., Wang, T., Duggan, B., Sharpnack, M., Huang, K., Zhang, J., Ye, X., Johnson, T.S.: Spcs: a spatial and pattern combined smoothing method for spatial transcriptomic expression. Briefings in Bioinformatics23(3) (2022)
2022
-
[19]
Nature611(7936), 594–602 (2022)
Lomakin, A., Svedlund, J., Strell, C., Gataric, M., Shmatko, A., Rukhovich, G., Park, J.S., Ju, Y.S., Dentro, S., Kleshchevnikov, V., et al.: Spatial genomics maps the structure, nature and evolution of cancer clones. Nature611(7936), 594–602 (2022)
2022
-
[20]
In: Proceedings of the IEEE/CVF Inter- national Conference on computer vision
Mejia, G., Cárdenas, P., Ruiz, D., Castillo, A., Arbeláez, P.: Sepal: spatial gene expression prediction from local graphs. In: Proceedings of the IEEE/CVF Inter- national Conference on computer vision. pp. 2294–2303 (2023)
2023
-
[21]
Nature Reviews Genetics23(12), 741–759 (2022)
Moffitt, J.R., Lundberg, E., Heyn, H.: The emerging landscape of spatial profiling technologies. Nature Reviews Genetics23(12), 741–759 (2022)
2022
-
[22]
Medical Image Analysis p
Niu, Y., Liu, J., Zhan, Y., Shi, J., Zhang, D., Reinius, M., Machado, I., Crispin- Ortuzar, M., Wu, J., Li, C., et al.: Ph2st: Prompt-guided hypergraph learning for spatial transcriptomics prediction in whole slide images. Medical Image Analysis p. 104008 (2026) Contrastive and Adaptive Multi-modal MAE for ST 11
2026
-
[23]
BioRxiv pp
Pang, M., Su, K., Li, M.: Leveraging information in spatial transcriptomics to predict super-resolution gene expression from histology images in tumors. BioRxiv pp. 2021–11 (2021)
2021
-
[24]
In: International Conference on Learning Representations (2022), https://openreview.net/forum?id=R8sQPpGCv0
Press, O., Smith, N., Lewis, M.: Train short, test long: Attention with linear bi- ases enables input length extrapolation. In: International Conference on Learning Representations (2022), https://openreview.net/forum?id=R8sQPpGCv0
2022
-
[25]
Nature596(7871), 211–220 (2021)
Rao, A., Barkley, D., França, G.S., Yanai, I.: Exploring tissue architecture using spatial transcriptomics. Nature596(7871), 211–220 (2021)
2021
-
[26]
Advances in neural information pro- cessing systems30(2017)
Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017)
2017
-
[27]
Advances in Neural Information Processing Systems 36, 70626–70637 (2023)
Xie, R., Pang, K., Chung, S., Perciani, C., MacParland, S., Wang, B., Bader, G.: Spatially resolved gene expression prediction from histology images via bi- modal contrastive learning. Advances in Neural Information Processing Systems 36, 70626–70637 (2023)
2023
-
[28]
Briefings in Bioinformatics23(5) (2022)
Zeng, Y., Wei, Z., Yu, W., Yin, R., Yuan, Y., Li, B., Tang, Z., Lu, Y., Yang, Y.: Spatial transcriptomics prediction from histology jointly through transformer and graph neural networks. Briefings in Bioinformatics23(5) (2022)
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.