pith. machine review for the scientific record. sign in

arxiv: 2604.12075 · v1 · submitted 2026-04-13 · 💻 cs.CV · cs.AI· cs.LG· q-bio.QM

Recognition: unknown

OpenTME: An Open Dataset of AI-powered H&E Tumor Microenvironment Profiles from TCGA

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:31 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGq-bio.QM
keywords tumor microenvironmentH&E histopathologyTCGAopen datasetspatial analysiscell classificationwhole-slide imagingAI pathology
0
0 comments X

The pith

OpenTME releases pre-computed tumor microenvironment profiles with over 4,500 quantitative readouts per slide from 3,634 TCGA H&E images across five cancers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The tumor microenvironment shapes cancer progression and response to therapy, yet consistent quantitative measurements from routine H&E slides have remained scarce at scale. This paper creates and releases OpenTME, an open dataset of AI-generated TME profiles drawn from thousands of whole-slide images in the TCGA collection for bladder, breast, colorectal, liver, and lung cancers. The profiles result from an automated pipeline that checks slide quality, segments tissue regions, detects and classifies individual cells, and computes spatial neighborhood statistics at cell-level resolution. The resulting collection exceeds 4,500 numeric features per slide and is hosted on Hugging Face for non-commercial academic use, with plans for ongoing additions. The release positions the dataset as a ready resource for biomarker work, spatial biology experiments, and new method development in computational pathology.

Core claim

We introduce OpenTME, an open-access dataset of pre-computed TME profiles derived from 3,634 H&E-stained whole-slide images across five cancer types (bladder, breast, colorectal, liver, and lung cancer) from TCGA. All outputs were generated using Atlas H&E-TME, an AI-powered application built on the Atlas family of pathology foundation models, which performs tissue quality control, tissue segmentation, cell detection and classification, and spatial neighborhood analysis, yielding over 4,500 quantitative readouts per slide at cell-level resolution.

What carries the argument

Atlas H&E-TME, the AI application that runs tissue quality control, segmentation, cell classification, and spatial neighborhood measurements to produce the per-slide quantitative profiles.

If this is right

  • Researchers gain immediate access to standardized, large-scale TME data without running their own image-analysis pipelines.
  • The dataset supports direct comparisons of microenvironment features across five distinct cancer types from the same source archive.
  • Integration with existing TCGA genomic and clinical records becomes straightforward for multimodal studies.
  • Computational groups can use the profiles to prototype and benchmark new spatial-analysis algorithms.
  • Continued expansion will increase the number of slides and cancer types covered over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The collection could function as a public benchmark for evaluating future AI models on pathology images.
  • Aggregating the profiles might surface previously unseen cross-cancer patterns in cell neighborhoods or tissue architecture.
  • Downstream models trained on these features could be tested for their ability to predict treatment response or survival from H&E alone.
  • The fixed set of 4,500+ readouts per slide reduces variability when different labs attempt to reproduce TME findings.

Load-bearing premise

The AI pipeline produces accurate cell classifications, tissue segmentations, and spatial measurements without systematic errors or biases that would need separate confirmation by pathologists or clinical data.

What would settle it

Direct comparison of the AI cell-type labels and neighborhood statistics against pathologist annotations on a random subset of the same slides would show low agreement rates.

Figures

Figures reproduced from arXiv: 2604.12075 by Andrew Norgan, Ari Angelo, Blanca Pablos, Christina Embacher, Cornelius B\"ohm, Evelyn Ramberger, Frederick Klauschen, Gerrit Erdmann, Julika Ribbat-Idel, Kai Standvoss, Klaus-Robert M\"uller, Lukas Ruff, Maaike Galama, Maximilian Alber, Miriam H\"agele, Nina Kozar-Gillan, Rosemarie Krupar, Simon Schallenberg, Todd Dembo, Verena Aumiller, Viktor Matyas.

Figure 1
Figure 1. Figure 1: Two TCGA examples from the OpenTME dataset with Atlas H&E-TME Tissue QC, Tissue Segmen [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

The tumor microenvironment (TME) plays a central role in cancer progression, treatment response, and patient outcomes, yet large-scale, consistent, and quantitative TME characterization from routine hematoxylin and eosin (H&E)-stained histopathology remains scarce. We introduce OpenTME, an open-access dataset of pre-computed TME profiles derived from 3,634 H&E-stained whole-slide images across five cancer types (bladder, breast, colorectal, liver, and lung cancer) from The Cancer Genome Atlas (TCGA). All outputs were generated using Atlas H&E-TME, an AI-powered application built on the Atlas family of pathology foundation models, which performs tissue quality control, tissue segmentation, cell detection and classification, and spatial neighborhood analysis, yielding over 4,500 quantitative readouts per slide at cell-level resolution. OpenTME is available for non-commercial academic research on Hugging Face. We will continue to expand OpenTME over time and anticipate it will serve as a resource for biomarker discovery, spatial biology research, and the development of computational methods for TME analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces OpenTME, an open-access dataset of pre-computed TME profiles derived from 3,634 H&E-stained whole-slide images across five TCGA cancer types (bladder, breast, colorectal, liver, and lung). All profiles were generated by the Atlas H&E-TME AI application (built on the Atlas family of pathology foundation models), which performs tissue quality control, segmentation, cell detection/classification, and spatial neighborhood analysis to produce over 4,500 quantitative readouts per slide at cell-level resolution. The dataset is released on Hugging Face for non-commercial academic use, with plans for future expansion.

Significance. If the underlying AI pipeline is shown to be accurate, OpenTME would be a valuable large-scale resource for TME biomarker discovery, spatial biology, and computational pathology method development, as it supplies consistent, high-resolution quantitative features that are otherwise computationally expensive to generate from raw TCGA slides. The open release and multi-cancer scope add to its potential utility.

major comments (2)
  1. [Abstract] Abstract: The central claim that Atlas H&E-TME produces reliable quantitative TME readouts (tissue segmentation, cell classification, spatial neighborhoods) is unsupported because the manuscript reports no validation metrics whatsoever—no cell-classification F1 scores, no segmentation Dice coefficients versus expert annotations, no spatial feature agreement checks, and no correlation with genomic/clinical endpoints. This directly undermines the utility of every one of the >4,500 readouts per slide.
  2. [Methods (Atlas H&E-TME pipeline description)] Methods/description of Atlas H&E-TME: No information is given on training data, fine-tuning, or performance of the foundation models on the specific TCGA cohorts, nor any internal or external validation of the TME pipeline outputs. Without these, potential systematic biases in cell typing or neighborhood statistics for the five cancer types cannot be assessed and affect all downstream use of the dataset.
minor comments (2)
  1. The manuscript would benefit from a brief table summarizing the exact 4,500+ readout categories and their definitions to improve usability for potential users.
  2. Clarify the exact license and any usage restrictions beyond 'non-commercial academic research' on the Hugging Face page and in the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important considerations for a data descriptor manuscript. OpenTME is released as a resource of pre-computed profiles generated by the Atlas H&E-TME pipeline; this paper does not constitute a primary validation study of the underlying models. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that Atlas H&E-TME produces reliable quantitative TME readouts (tissue segmentation, cell classification, spatial neighborhoods) is unsupported because the manuscript reports no validation metrics whatsoever—no cell-classification F1 scores, no segmentation Dice coefficients versus expert annotations, no spatial feature agreement checks, and no correlation with genomic/clinical endpoints. This directly undermines the utility of every one of the >4,500 readouts per slide.

    Authors: We agree that the abstract should avoid implying comprehensive validation within this manuscript. OpenTME is a data release paper, and the reliability claims rest on the prior characterization of the Atlas foundation models and the Atlas H&E-TME application in separate publications. In the revised version we will (1) tone down the abstract language to emphasize that the profiles are generated by a published pipeline and (2) add a concise Methods/Discussion paragraph with explicit citations to the relevant validation studies, including reported F1 scores for cell classification, Dice coefficients for segmentation, and any available spatial or clinical correlation results. This will allow readers to locate the supporting evidence without misrepresenting the current work as a validation study. revision: yes

  2. Referee: [Methods (Atlas H&E-TME pipeline description)] Methods/description of Atlas H&E-TME: No information is given on training data, fine-tuning, or performance of the foundation models on the specific TCGA cohorts, nor any internal or external validation of the TME pipeline outputs. Without these, potential systematic biases in cell typing or neighborhood statistics for the five cancer types cannot be assessed and affect all downstream use of the dataset.

    Authors: We accept this criticism. The Atlas models are general-purpose foundation models trained on large, multi-institutional pathology corpora and were applied to the TCGA slides without cohort-specific fine-tuning, which is an intentional design choice for broad applicability. In the revised manuscript we will expand the Methods section to (a) summarize the training data and reported performance metrics of the foundation models with citations, (b) state that no TCGA-specific fine-tuning or per-cohort validation was performed for this release, and (c) explicitly discuss the possibility of systematic biases and the value of community re-evaluation. We will also add a limitations paragraph addressing how users may assess or mitigate such biases when using the >4,500 readouts. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; dataset release is self-contained

full rationale

The paper introduces OpenTME as a direct release of pre-computed TME profiles from TCGA slides processed by the Atlas H&E-TME application. No mathematical derivations, equations, parameter fittings, predictions, or uniqueness theorems are claimed or walked through. The contribution consists solely of data generation and public availability on Hugging Face, with no load-bearing steps that reduce to self-citations or inputs by construction. This is a standard data-release paper with no circularity risk.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper's contribution is the curated dataset; it relies on pre-existing foundation models without introducing new fitted parameters or postulated entities.

axioms (1)
  • domain assumption The Atlas family of pathology foundation models perform accurate tissue quality control, segmentation, cell detection and classification, and spatial neighborhood analysis on H&E slides.
    Invoked in the description of how the profiles were generated; no independent validation data is referenced in the abstract.

pith-pipeline@v0.9.0 · 5597 in / 1265 out tokens · 46227 ms · 2026-05-10T15:31:43.799509+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 7 canonical work pages · 1 internal anchor

  1. [1]

    Towards comprehensive cellular characterisation of H&E slides.arXiv preprint arXiv:2508.09926, 2025

    Benjamin Adjadj, Pierre-Antoine Bannier, Guillaume Horent, Sebastien Mandela, Aurore Lyon, Kathryn Schutte, Ulysse Marteau, Valentin Gaury, Laura Dumont, Thomas Mathieu, Reda Belbahri, Benoît Schmauch, Eric Durand, Katharina V on Loga, and Lucie Gillet. Towards comprehensive cellular characterisation of H&E slides.arXiv preprint arXiv:2508.09926, 2025

  2. [2]

    Atlas 2 – foundation models for clinical deployment.arXiv preprint arXiv:2601.05148, 2026

    Maximilian Alber, Timo Milbich, Alexandra Carpen-Amarie, Stephan Tietz, Jonas Dippel, Lukas Muttenthaler, Beatriz Perez Cancer, Alessandro Benetti, Panos Korfiatis, Elias Eulig, Jérôme Lüscher, Jiasen Wu, Sayed Abid Hashimi, Gabriel Dernbach, Simon Schallenberg, 2https://huggingface.co/datasets/Aignostics/OpenTME 3https://github.com/aignostics/tme-studio ...

  3. [3]

    Atlas: A novel pathology foundation model by mayo clinic, charité, and aignostics.arXiv preprint arXiv:2501.05409, 2025

    Maximilian Alber, Stephan Tietz, Jonas Dippel, Timo Milbich, Timothée Lesort, Panos Korfiatis, Moritz Krügener, Beatriz Perez Cancer, Neelay Shah, Alexander Möllers, Philipp Seegerer, Alexandra Carpen-Amarie, Kai Standvoss, Gabriel Dernbach, Edwin de Jong, Simon Schal- lenberg, Andreas Kunft, Helmut Hoffer von Ankershoffen, Gavin Schaeferle, Patrick Duffy...

  4. [4]

    Topol, and Guergana K

    Salim Arslan, Disha Mehta, Alexei Gusev, Eric J. Topol, and Guergana K. Savova. A systematic pan-cancer study on deep learning-based prediction of multi-omic biomarkers from routine pathology images.Communications Medicine, 4:48, 2024

  5. [5]

    Chen, Tong Ding, Ming Y

    Richard J. Chen, Tong Ding, Ming Y . Lu, Drew F. K. Williamson, Guillaume Jaume, Andrew H. Song, Bowen Chen, Andrew Zhang, Daniel Shao, Muhammad Shaban, Mane Williams, Lukas Oldenburg, Luca L. Weishaupt, Judy J. Wang, Anurag Vaidya, Long Phi Le, Georg Gerber, Sharifa Sahai, Walt Williams, and Faisal Mahmood. Towards a general-purpose foundation model for ...

  6. [6]

    Human-interpretable image features derived from densely mapped cancer pathology slides predict diverse molecular phenotypes.Nature communications, 12(1):1613, 2021

    James A Diao, Jason K Wang, Wan Fung Chui, Victoria Mountain, Sai Chowdary Gullapally, Ramprakash Srinivasan, Richard N Mitchell, Benjamin Glass, Sara Hoffman, Sudha K Rao, et al. Human-interpretable image features derived from densely mapped cancer pathology slides predict diverse molecular phenotypes.Nature communications, 12(1):1613, 2021

  7. [7]

    RudolfV: A Foundation Model by Pathologists for Pathologists.arXiv preprint arXiv:2401.04079, 2024

    Jonas Dippel, Barbara Feulner, Tobias Winterhoff, Timo Milbich, Stephan Tietz, Simon Schal- lenberg, Gabriel Dernbach, Andreas Kunft, Simon Heinke, Marie-Lisa Eich, Julika Ribbat-Idel, Rosemarie Krupar, Philipp Anders, Niklas Prenißl, Philipp Jurmeister, David Horst, Lukas Ruff, Klaus-Robert Müller, Frederick Klauschen, and Maximilian Alber. RudolfV: A Fo...

  8. [8]

    PanNuke: An open pan-cancer histology dataset for nuclei instance segmentation and classifica- tion

    Jevgenij Gamper, Navid Alemi Koohbanani, Ksenija Benet, Ali Khuram, and Nasir Rajpoot. PanNuke: An open pan-cancer histology dataset for nuclei instance segmentation and classifica- tion. InEuropean Congress on Digital Pathology, volume 11435 ofLecture Notes in Computer Science, pages 11–19. Springer, 2019

  9. [9]

    Gamper, N

    Jevgenij Gamper, Navid Alemi Koohbanani, Simon Graham, Mostafa Jahanifar, Syed Ali Khurram, Ayesha Azam, Katherine Hewitt, and Nasir Rajpoot. PanNuke dataset extension, insights and baselines.arXiv preprint arXiv:2003.10778, 2020

  10. [10]

    Hover-net: Simultaneous segmentation and classification of nuclei in multi-tissue histology images.Medical Image Analysis, 58:101563, 2019

    Simon Graham, Quoc Dang Vu, Shan E Ahmed Raza, Ayesha Azam, Yee Wah Tsang, Jin Tae Kwak, and Nasir Rajpoot. Hover-net: Simultaneous segmentation and classification of nuclei in multi-tissue histology images.Medical Image Analysis, 58:101563, 2019

  11. [11]

    CellViT: Vi- sion transformers for precise cell segmentation and classification.Medical Image Analysis, 94:103143, 2024

    Fabian Hörst, Moritz Rempe, Lukas Heine, Constantin Seibold, Julius Keyl, Giulia Baldini, Selma Ugurel, Jens Siveke, Barbara Grünwald, Jan Egger, and Jens Kleesiek. CellViT: Vi- sion transformers for precise cell segmentation and classification.Medical Image Analysis, 94:103143, 2024

  12. [12]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

  13. [13]

    Deriva- tion of prognostic contextual histopathological features from whole-slide images of tumours via graph deep learning.Nature Biomedical Engineering, 6:1395–1406, 2022

    Yongju Lee, Jun Hyeong Park, Seonwook Oh, Kyoungseob Shin, Jiyu Sun, Mingu Jung, Changho Lee, Hyunjin Kim, Jin-Haeng Chung, Kyung Chul Moon, and Donggeun Yoo. Deriva- tion of prognostic contextual histopathological features from whole-slide images of tumours via graph deep learning.Nature Biomedical Engineering, 6:1395–1406, 2022. 6

  14. [14]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick L...

  15. [15]

    Budzinska, Tomasz Kucharczyk, Justyna Szumiło, Paweł Krawczyk, Nicola Crosetto, and Ewa Szczurek

    Alicja R ˛ aczkowska, Iwona Pa´snik, Michał Kukiełka, Marcin Nico´s, Magdalena A. Budzinska, Tomasz Kucharczyk, Justyna Szumiło, Paweł Krawczyk, Nicola Crosetto, and Ewa Szczurek. Deep learning-based tumor microenvironment segmentation is predictive of tumor mutations and patient survival in non-small-cell lung cancer.BMC Cancer, 22(1):1001, 2022

  16. [16]

    Shroyer, Tianhao Zhao, Rebecca Batiste, John Van Arnam, The Cancer Genome Atlas Research Network, Ilya Shmulevich, Arvind U

    Joel Saltz, Rajarsi Gupta, Le Hou, Tahsin Kurc, Pankaj Singh, Vu Nguyen, Dimitris Samaras, Kenneth R. Shroyer, Tianhao Zhao, Rebecca Batiste, John Van Arnam, The Cancer Genome Atlas Research Network, Ilya Shmulevich, Arvind U. K. Rao, Alexander J. Lazar, Ashish Sharma, and Vésteinn Thorsson. Spatial organization and molecular correlation of tumor- infiltr...

  17. [17]

    Cell detection with star-convex polygons

    Uwe Schmidt, Martin Weigert, Coleman Broaddus, and Gene Myers. Cell detection with star-convex polygons. InInternational conference on medical image computing and computer- assisted intervention, pages 265–273. Springer, 2018

  18. [18]

    Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

  19. [19]

    Yang, Junya Fujimoto, Faliu Yan, Ling Cai, Ling Yang, Bo Yao, Shengjie Li, Maria Chikina, Yuval Kluger, Ignacio I

    Shidan Wang, Ruichen Rong, Donghan M. Yang, Junya Fujimoto, Faliu Yan, Ling Cai, Ling Yang, Bo Yao, Shengjie Li, Maria Chikina, Yuval Kluger, Ignacio I. Wistuba, John D. Minna, and Guanghua Xiao. Computational staining of pathology images to study the tumor microenvi- ronment in lung cancer.Cancer Research, 80(10):2056–2066, 2020

  20. [20]

    Virchow2: Scaling self-supervised mixed magnification models in pathology.arXiv preprint arXiv:2408.00738, 2024

    Eric Zimmermann, Eugene V orontsov, Julian Viret, Adam Casson, Michal Zelechowski, George Shaikovski, Neil Tenenholtz, James Hall, David Klimstra, Razik Yousfi, Thomas Fuchs, Nicolò Fusi, Siqi Liu, and Kristen Severson. Virchow2: Scaling self-supervised mixed magnification models in pathology.arXiv preprint arXiv:2408.00738, 2024. 7 A Atlas H&E-TME: Model...