arxiv: 2605.03281 · v1 · submitted 2026-05-05 · 🧬 q-bio.QM · cs.LG· stat.ML

Recognition: unknown

Donor-Aware scRNA-seq Benchmarks for IBD Classification

Jonathan Muhire

Pith reviewed 2026-05-09 16:43 UTC · model grok-4.3

classification 🧬 q-bio.QM cs.LGstat.ML

keywords scRNA-seqIBD classificationdonor-aware cross-validationCLR compositionGatedStructuralCFNcompartment stratificationAUROCCrohn's disease

0 comments

The pith

Compartment-stratified CLR composition and GatedStructuralCFN embeddings classify IBD donors at AUROC 0.95-0.98 under strict donor-aware validation

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that random cell splits in scRNA-seq data cause donor leakage and inflated performance, so donor-aware cross-validation is required for valid IBD classification benchmarks. On the SCP259 ulcerative colitis cohort, compartment-stratified centered log-ratio cell-type composition reaches AUROC 0.956 while GatedStructuralCFN dependency embeddings on the same features reach 0.978. In the larger Kong Crohn's cohort, the same CFN approach achieves its peak of 0.960 in the colon after feature filtering and exceeds linear CLR, though linear models lead in the terminal ileum. Compartment stratification also removes unit-sum instability from dependency graphs, yielding stable edge recurrence.

Core claim

Compartment-stratified CLR transformed cell-type composition achieves AUROC 0.956 +/- 0.061 on SCP259 while GatedStructuralCFN on identical features reaches 0.978 +/- 0.050; in the Kong cohort CFN peaks at 0.960 +/- 0.055 in colon after filtering and exceeds linear CLR (0.900), with compartment-wise composition eliminating spurious unit-sum instability (Jaccard 0.026 versus top-20 recurrence of 1.0).

What carries the argument

Compartment-stratified CLR cell-type composition vectors fed into GatedStructuralCFN dependency embeddings, which extract stable inter-cell-type relations within each anatomical compartment while enforcing donor separation during training and testing.

If this is right

Compartment stratification is required to remove spurious correlations induced by the unit-sum constraint in cell composition features.
GatedStructuralCFN embeddings deliver a numerical edge over linear classifiers specifically in the colon region of Crohn's disease.
Cross-dataset transfer between Crohn's and ulcerative colitis cohorts reaches only modest AUC (0.833) when limited to four shared cell types.
Edge stability analysis shows compartment-wise features produce fully recurrent top dependencies whereas global composition does not.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The regional performance gap between colon and ileum suggests that separate models per intestinal compartment may be needed for optimal clinical translation.
Verification of cell-type and compartment labels with orthogonal assays such as spatial transcriptomics would strengthen the benchmark reliability.
The same donor-aware, compartment-stratified workflow could be applied directly to other multi-region tissues or non-IBD inflammatory conditions.

Load-bearing premise

Cell-type annotations and compartment labels are accurate and consistent across the two cohorts, and the donor-aware cross-validation fully blocks any information leakage between training and test donors.

What would settle it

Re-run the full pipeline after deliberately swapping compartment labels on a subset of cells or after allowing donor overlap in the train-test splits and measure whether AUROC falls below 0.90.

Figures

Figures reproduced from arXiv: 2605.03281 by Jonathan Muhire.

**Figure 2.** Figure 2: GatedStructuralCFN dependency matrices averaged across cross view at source ↗

**Figure 3.** Figure 3: Cross-dataset transfer AUROC restricted to the four cell types with view at source ↗

read the original abstract

Donor-level disease classification from single-cell RNA sequencing (scRNA-seq) requires strict donor-aware cross-validation: naive pipelines that split cells randomly conflate training and test donors, inflating reported performance through pseudoreplication. We present a donor-aware benchmark evaluating three feature representations across two independent IBD cohorts: centered log-ratio (CLR) transformed cell-type composition, GatedStructuralCFN dependency embeddings, and scVI variational autoencoder latent embeddings. The cohorts are the SCP259 ulcerative colitis atlas (UC vs. Healthy, n=30 donors, 51 cell types) and the Kong 2023 Crohn's disease atlas (CD vs. Healthy, n=71 donors, 55-68 cell types across three intestinal regions). Compartment-stratified CLR composition achieves AUROC 0.956 +/- 0.061 on SCP259; GatedStructuralCFN on the same features achieves 0.978 +/- 0.050. In the Kong cohort, CFN achieves its best performance in the colon region (0.960 +/- 0.055 after feature filtering), exceeding linear CLR (0.900 +/- 0.100), while terminal ileum classification is dominated by linear models (CatBoost CLR 0.967 +/- 0.075 vs. CFN 0.811 +/- 0.164). Cross-dataset transfer (CD->UC, four shared cell types) achieves AUC 0.833 with XGBoost CLR; the reverse direction performs at chance. CFN edge stability analysis shows that compartment-wise composition eliminates spurious unit-sum-induced instability present in global composition (Jaccard 0.026 vs. top-20 recurrence 1.0). CFN shows a consistent numerical advantage over linear models in the colon region of CD (AUROC 0.960 vs. 0.900), though no inter-method comparison reached statistical significance at n<=34 donors per region. Compartment-aware feature construction is critical for both classification performance and structural interpretability. Code: https://github.com/Jonathan-321/sfn-scrna-study

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This benchmark flags pseudoreplication in scRNA-seq IBD classifiers and gives concrete AUROC numbers on two cohorts, but small donor counts mean the reported edges over baselines are not significant.

read the letter

The main point is that this paper shows why random cell-level splits in scRNA-seq disease classification leak donor information and inflate performance, then runs donor-aware benchmarks on the SCP259 UC atlas and Kong CD atlas using CLR composition, GatedStructuralCFN embeddings, and scVI latents. It reports compartment-stratified CLR at 0.956 AUROC on SCP259 and CFN at 0.978, with CFN pulling ahead in the colon region of the Kong cohort while linear models do better in terminal ileum, plus some cross-dataset transfer results and a stability check on CFN edges under compartment stratification. The code is released, which helps.

Referee Report

3 major / 2 minor

Summary. The paper benchmarks donor-aware classification of IBD (UC vs healthy in SCP259; CD vs healthy in Kong) from scRNA-seq using three feature sets: compartment-stratified CLR cell-type compositions, GatedStructuralCFN dependency embeddings, and scVI latents. It stresses strict donor-level cross-validation to avoid pseudoreplication, reports AUROCs (e.g., 0.956 for CLR and 0.978 for CFN on SCP259; 0.960 for CFN in Kong colon after filtering), notes non-significant differences at small donor counts (n≤34), and highlights compartment-aware construction for performance and edge stability.

Significance. If the donor-aware splits and label harmonization hold, the work supplies a useful empirical reference for scRNA-seq disease classification, demonstrating that compartment stratification mitigates composition-induced instability and that structural embeddings can numerically outperform linear baselines in specific regions, while underscoring the limits of statistical power with current cohort sizes.

major comments (3)

[Methods] Methods (donor-aware CV implementation): the manuscript does not specify whether feature filtering, normalization, or any global preprocessing steps (explicitly mentioned for the Kong colon results) were performed inside or outside the donor-level folds; any global step would introduce leakage and undermine the central claim that the reported AUROCs (0.956–0.978) are unbiased.
[Results] Results (Kong cohort, colon region): the comparison of CFN (0.960 ± 0.055) vs linear CLR (0.900 ± 0.100) after 'feature filtering' lacks the exact filtering rule, threshold, or selection criterion; without this, the numerical advantage cannot be reproduced or interpreted as evidence of CFN superiority.
[Methods] Methods and cross-dataset transfer: cell-type harmonization between SCP259 (51 types) and Kong (55–68 types, three regions) is not detailed for the four shared types used in CD→UC transfer (AUC 0.833); any systematic annotation mismatch would affect both CLR and CFN equally and render the performance gap uninterpretable.

minor comments (2)

[Abstract] Abstract and Results: report the exact statistical test (e.g., paired Wilcoxon or DeLong) and p-values for all inter-method comparisons rather than only stating 'no statistical significance'.
[Methods] Figure legends or Methods: clarify how compartment labels were assigned and whether they were validated against independent annotations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments below and have made revisions to the manuscript to incorporate the suggested clarifications.

read point-by-point responses

Referee: [Methods] Methods (donor-aware CV implementation): the manuscript does not specify whether feature filtering, normalization, or any global preprocessing steps (explicitly mentioned for the Kong colon results) were performed inside or outside the donor-level folds; any global step would introduce leakage and undermine the central claim that the reported AUROCs (0.956–0.978) are unbiased.

Authors: We agree that explicit specification of the cross-validation procedure is essential to support the unbiased nature of the reported performance metrics. All feature filtering, normalization, and preprocessing steps were conducted strictly within each donor-level training fold, with no information from the test donors used at any stage. We have revised the Methods section to include a detailed description of the donor-aware CV pipeline, including confirmation that global steps were avoided, along with pseudocode illustrating the process. revision: yes
Referee: [Results] Results (Kong cohort, colon region): the comparison of CFN (0.960 ± 0.055) vs linear CLR (0.900 ± 0.100) after 'feature filtering' lacks the exact filtering rule, threshold, or selection criterion; without this, the numerical advantage cannot be reproduced or interpreted as evidence of CFN superiority.

Authors: The referee is correct that the precise filtering criteria were not fully specified in the original submission. For the Kong colon results, feature filtering consisted of excluding cell types present in fewer than 10% of donors within the training fold (a threshold selected to ensure sufficient data for reliable composition estimation). This was applied independently per fold. We have updated the Results section with this exact rule and added a supplementary table showing the filtered cell types for transparency. revision: yes
Referee: [Methods] Methods and cross-dataset transfer: cell-type harmonization between SCP259 (51 types) and Kong (55–68 types, three regions) is not detailed for the four shared types used in CD→UC transfer (AUC 0.833); any systematic annotation mismatch would affect both CLR and CFN equally and render the performance gap uninterpretable.

Authors: We acknowledge that the cell-type harmonization process for the cross-dataset transfer experiment was insufficiently described. The four shared types were identified by aligning cell-type labels based on shared marker genes and standard nomenclature from the original publications (specifically: T cells, B cells, Enterocytes, and Macrophages). We have added a dedicated subsection in the Methods detailing the harmonization criteria and a mapping table. While we agree that annotation inconsistencies could impact interpretability, the primary conclusions of the paper rely on within-cohort analyses, and the transfer results are presented as exploratory. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; purely empirical benchmark

full rationale

The manuscript is an empirical benchmark study that reports AUROC values obtained via donor-aware cross-validation on held-out donor data from two IBD cohorts. No equations, first-principles derivations, predictions, or uniqueness theorems are advanced that could reduce to fitted parameters, self-citations, or ansatzes by construction. All reported metrics (e.g., 0.956 AUROC for compartment-stratified CLR, 0.978 for GatedStructuralCFN) are direct measurements on independent test splits; compartment-stratified feature construction and edge-stability analysis are likewise post-hoc empirical observations rather than deductive steps. The paper therefore contains no load-bearing circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical evaluation of standard and proposed feature representations under proper validation; no new free parameters or invented entities are introduced beyond the benchmark setup.

axioms (1)

domain assumption Donor-aware cross-validation is necessary to avoid pseudoreplication in scRNA-seq classification tasks
Explicitly stated in the abstract as a requirement for valid performance reporting.

pith-pipeline@v0.9.0 · 5683 in / 1310 out tokens · 57588 ms · 2026-05-09T16:43:17.430481+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 6 canonical work pages

[1]

Cell , volume=

Intra- and inter-cellular rewiring of the human colon during ulcerative colitis , author=. Cell , volume=. 2019 , publisher=

2019
[2]

The landscape of immune dysregulation in

Kong, Lingjia and Pokatayev, Vladislav and Lefkovith, Ariel and Carter, Grace T and Creasey, Elizabeth A and Krishna, Chirag and Subramanian, Sathish and Kochar, Bharati and Ashenberg, Orr and Lau, Helena and Ananthakrishnan, Ashwin N and Graham, Daniel B and Deguine, Jacques and Xavier, Ramnik J , journal=. The landscape of immune dysregulation in. 2023 ...

2023
[3]

Nature Methods , volume=

Deep generative modeling for single-cell transcriptomics , author=. Nature Methods , volume=. 2018 , publisher=

2018
[4]

2026 , eprint=

Interpretable Functional Compositions for Tabular Discovery , author=. 2026 , eprint=

2026
[5]

Journal of the Royal Statistical Society: Series B (Methodological) , volume=

The statistical analysis of compositional data , author=. Journal of the Royal Statistical Society: Series B (Methodological) , volume=. 1982 , publisher=

1982
[6]

2016 , doi=

Chen, Tianqi and Guestrin, Carlos , booktitle=. 2016 , doi=

2016
[7]

V ., Ershov, V ., & Gulin, A

Dorogush, Anna Veronika and Ershov, Vasily and Gulin, Andrey , year=. doi:10.48550/arXiv.1810.11363 , url=. 1810.11363 , archivePrefix=

work page doi:10.48550/arxiv.1810.11363
[8]

Journal of the Royal Statistical Society: Series B (Methodological) , author =

John Aitchison. The statistical analysis of compositional data. Journal of the Royal Statistical Society: Series B (Methodological), 44 0 (2): 0 139--160, 1982. doi:10.1111/j.2517-6161.1982.tb01195.x

work page doi:10.1111/j.2517-6161.1982.tb01195.x 1982
[9]

The landscape of immune dysregulation in Crohn's disease revealed through single-cell transcriptomic profiling in the ileum and colon

Lingjia Kong, Vladislav Pokatayev, Ariel Lefkovith, Grace T Carter, Elizabeth A Creasey, Chirag Krishna, Sathish Subramanian, Bharati Kochar, Orr Ashenberg, Helena Lau, Ashwin N Ananthakrishnan, Daniel B Graham, Jacques Deguine, and Ramnik J Xavier. The landscape of immune dysregulation in Crohn's disease revealed through single-cell transcriptomic profil...

work page doi:10.1016/j.immuni.2023.01.002 2023
[10]

Interpretable functional compositions for tabular discovery, 2026

Fang Li. Interpretable functional compositions for tabular discovery, 2026. URL https://arxiv.org/abs/2601.20037. Department of Computer Science, Oklahoma Christian University. Code: https://github.com/fanglioc/StructuralCFN-public

work page arXiv 2026
[11]

Aaron Lou, Chenlin Meng, and Stefano Ermon

Romain Lopez, Jeffrey Regier, Michael B Cole, Michael I Jordan, and Nir Yosef. Deep generative modeling for single-cell transcriptomics. Nature Methods, 15 0 (12): 0 1053--1058, 2018. doi:10.1038/s41592-018-0229-2

work page doi:10.1038/s41592-018-0229-2 2018
[12]

Intra- and inter-cellular rewiring of the human colon during ulcerative colitis

Christopher S Smillie, Moshe Biton, Jose Ordovas-Montanes, Keri M Sullivan, Grace Burgin, Daniel B Graham, Rebecca H Herbst, Noga Rogel, Michal Slyper, Julia Waldman, et al. Intra- and inter-cellular rewiring of the human colon during ulcerative colitis. Cell, 178 0 (3): 0 714--730, 2019. doi:10.1016/j.cell.2019.06.029

work page doi:10.1016/j.cell.2019.06.029 2019