arxiv: 2604.02511 · v1 · submitted 2026-04-02 · 💻 cs.LG · q-bio.GN· q-bio.MN

Recognition: 1 theorem link

· Lean Theorem

Re-analysis of the Human Transcription Factor Atlas Recovers TF-Specific Signatures from Pooled Single-Cell Screens with Missing Controls

Arka Jain , Umesh Sharma

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:52 UTC · model grok-4.3

classification 💻 cs.LG q-bio.GNq-bio.MN

keywords transcription factorpooled single-cell screendifferential expressionbackground subtractionperturbation atlasgene regulationembryoid bodyMORF barcode

0 comments

The pith

Background subtraction with embryoid body cells recovers TF-specific signatures for 59 of 61 testable factors in a pooled screen missing internal controls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that a deposited pooled single-cell overexpression screen of 3,550 transcription factor open reading frames can still support robust per-TF analysis when the original negative controls are absent from the metadata. By subtracting shared expression patterns from embryoid body cells as an external baseline, the re-analysis detects strong TF-specific differential expression in nearly every testable case, far exceeding the 27 factors identified by standard one-versus-rest comparisons. This rescue also produces effect sizes that align with independent published rankings and reveals convergent pathway changes across conditions. The work shows that principled artifact removal allows public perturbation atlases to remain usable for studying gene regulation even after experimental details are lost.

Core claim

The central claim is that background subtraction against embryoid body cells as an external baseline identifies TF-specific transcriptional signatures for 59 of 61 testable transcription factors in the human TF Atlas pooled screen, compared with only 27 detected by one-vs-rest analysis. This approach recovers significant agreement with prior effect-size rankings, highlights HOPX, MAZ, PAX6, FOS, and FEZF2 as the strongest remodelers, and links individual TFs to pathways such as differentiation, Hippo signaling, focal adhesion, and collagen biosynthesis while revealing broader convergent signatures in Wnt, neurogenic, EMT, and Hippo programs.

What carries the argument

Background subtraction of shared batch and transduction patterns observed in embryoid body cells, applied after MORF barcode demultiplexing and quality control to isolate per-TF differential expression.

Load-bearing premise

Embryoid body cells supply a valid external baseline whose shared patterns represent only non-specific artifacts and do not remove or confound genuine TF-specific transcriptional changes.

What would settle it

If independent targeted TF perturbation experiments in matched cell types show that the recovered signatures share no more overlap with ground-truth TF targets than the one-vs-rest signatures do, the background-subtraction recovery would be falsified.

Figures

Figures reproduced from arXiv: 2604.02511 by Arka Jain, Umesh Sharma.

**Figure 1.** Figure 1: UMAP of the combined TF Atlas dataset. (A) 254,519 cells colored by sample identity, showing separation between experimental conditions and overlap among pooled screen replicates. (B) Leiden clustering identifies 83 transcriptionally distinct populations. 3.2 MORF Barcode Demultiplexing Demultiplexing of the 8 pooled screen samples assigned 60,997 cells (79.2%) to specific TF identities (Figure 2A). The as… view at source ↗

**Figure 2.** Figure 2: MORF barcode demultiplexing of pooled screen samples. (A) UMAP of 77,018 cells colored by demultiplexing status: assigned (blue), ambiguous (orange), undetected (green). (B) Distribution of assigned cells across TFs. 3.3 Per-TF Differential Expression Using EB cells as a negative control with background subtraction, 59 of 61 TFs with ≥20 cells showed at least one TF-specific DEG (|log2FC| > 0.5, FDR < 0.05… view at source ↗

**Figure 3.** Figure 3: TF-specific differentially expressed genes per TF (vs EB control with background subtraction, |log2FC| > 0.5, FDR < 0.05). Red: upregulated; blue: downregulated. Top 30 TFs shown. 5 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Functional enrichment of condition-level DEGs. (A) Most frequently enriched GO/KEGG terms across perturbations, ranked by minimum adjusted p-value. (B) Dotplot showing term enrichment across perturbations (dot size = gene count, color = −log10 adjusted p-value). 3.6 Per-TF Pathway Enrichment from Pooled Screen To characterize the functional programs driven by individual TFs in the pooled screen, we perform… view at source ↗

**Figure 5.** Figure 5: Functional enrichment of per-TF DEGs from pooled screen. (A) Most frequently enriched GO/KEGG terms across 13 TFs with significant ORA results, ranked by minimum adjusted p-value. (B) Dotplot showing per-TF enrichment landscape (dot size = gene count, color = −log10 adjusted p-value). Only TFs with ≥5 significant DEGs (|log2FC| > 0.5, FDR < 0.05) were tested. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: TF-vs-pathway enrichment heatmap. Clustered heatmap of the top 20 most recurrent enriched terms across 11 TFs with significant ORA results. Color intensity represents −log10 (adjusted p-value), clipped at 10. Hierarchical clustering reveals modules of co-enriched pathways and TFs with similar functional programs. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Batch effects assessment via Harmony integration. UMAP visualizations of the 8 pooled screen replicates before and after Harmony batch correction. Minimal changes confirm that inter-replicate batch effects are negligible and do not confound downstream analyses. Harmony converged in 3 iterations. 3.8 Validation Against Published TF Rankings To validate our per-TF transcriptional signatures, we compared the … view at source ↗

**Figure 8.** Figure 8: Validation of per-TF signatures against Joung et al. (2023) rankings. Scatter plots comparing our per-TF DEG count with Joung et al. average rank (left) and scRNA-seq rank (right). Lower rank indicates stronger TF effect, so the negative Spearman ρ reflects positive agreement between the two studies. The significant correlation (ρ = −0.316, p = 0.013 for scRNA-seq rank) validates that our independently de… view at source ↗

read the original abstract

Public pooled single-cell perturbation atlases are valuable resources for studying transcription factor (TF) function, but downstream re-analysis can be limited by incomplete deposited metadata and missing internal controls. Here we re-analyze the human TF Atlas dataset (GSE216481), a MORF-based pooled overexpression screen spanning 3,550 TF open reading frames and 254,519 cells, with a reproducible pipeline for quality control, MORF barcode demultiplexing, per-TF differential expression, and functional enrichment. From 77,018 cells in the pooled screen, we assign 60,997 (79.2\%) to 87 TF identities. Because the deposited barcode mapping lacks the GFP and mCherry negative controls present in the original library, we use embryoid body (EB) cells as an external baseline and remove shared batch/transduction artifacts by background subtraction. This strategy recovers TF-specific signatures for 59 of 61 testable TFs, compared with 27 detected by one-vs-rest alone, showing that robust TF-level signal can be rescued despite missing intra-pool controls. HOPX, MAZ, PAX6, FOS, and FEZF2 emerge as the strongest transcriptional remodelers, while per-TF enrichment links FEZF2 to regulation of differentiation, EGR1 to Hippo and cardiac programs, FOS to focal adhesion, and NFIC to collagen biosynthesis. Condition-level analyses reveal convergent Wnt, neurogenic, EMT, and Hippo signatures, and Harmony indicates minimal confounding batch effects across pooled replicates. Our per-TF effect sizes significantly agree with Joung et al.'s published rankings (Spearman $\rho = -0.316$, $p = 0.013$; negative because lower rank indicates stronger effect). Together, these results show that the deposited TF Atlas data can support validated TF-specific transcriptional and pathway analyses when paired with principled external controls, artifact removal, and reproducible computation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Using EB cells as external baseline recovers far more TF signatures than one-vs-rest, but the assumption that it only removes artifacts needs direct checks.

read the letter

The paper shows that swapping in embryoid body cells for the missing negative controls in this pooled TF overexpression atlas lets them recover differential signatures for 59 out of 61 testable TFs, versus only 27 with a standard one-vs-rest approach. They also report per-TF pathway enrichments and a modest but significant Spearman correlation with the original Joung rankings. The pipeline itself looks straightforward: standard QC, barcode demultiplexing, background subtraction, and enrichment, plus a check that Harmony sees little batch structure across replicates. That part is useful for anyone trying to reuse the same public dataset without new experiments. The soft spot is the EB baseline itself. The claim stands or falls on whether those cells carry only batch and transduction noise and nothing that overlaps real TF-driven programs; if they do, the subtraction could be removing signal or adding noise, and the higher recovery count would be partly artifactual. The abstract gives cell assignment rates and the correlation but no error bars, no multiple-testing details, and no side-by-side comparison to the original controls that were in the library. The numbers are encouraging and the work is reproducible on its face, but the central assumption is still untested in the summary. This is the kind of incremental methods note that people re-analyzing public perturbation atlases would actually read and try. It deserves a serious referee who can look at the full methods and any supplementary validation of the subtracted gene set.

Referee Report

1 major / 2 minor

Summary. The manuscript re-analyzes the human TF Atlas pooled single-cell overexpression screen (GSE216481) spanning 3,550 TFs and 254,519 cells. Due to missing deposited GFP/mCherry negative controls, the authors employ embryoid body (EB) cells as an external baseline for background subtraction of shared batch/transduction artifacts. Using a reproducible pipeline for QC, MORF barcode demultiplexing, per-TF differential expression, and enrichment, they assign 79.2% of 77,018 screened cells to 87 TF identities and recover TF-specific signatures for 59 of 61 testable TFs (versus 27 by one-vs-rest). Top remodelers include HOPX, MAZ, PAX6, FOS, and FEZF2; per-TF and condition-level enrichments link TFs to differentiation, Hippo, cardiac, focal adhesion, and collagen programs. Harmony shows minimal batch effects, and per-TF effect sizes correlate with Joung et al. rankings (Spearman ρ = -0.316, p = 0.013).

Significance. If the EB baseline validly isolates TF effects, the work demonstrates a practical, reproducible strategy for rescuing TF-specific signals from incomplete public perturbation atlases, substantially increasing detectable signatures and enabling pathway analyses. The agreement with an independent prior ranking and the identification of convergent Wnt/neurogenic/EMT/Hippo programs add value for the field.

major comments (1)

[Abstract and Methods] Abstract and Methods: The assumption that embryoid body cells contain only batch/transduction artifacts (with no overlapping differentiation or TF-related programs) is load-bearing for the central claim of recovering 59/61 signatures via background subtraction. No direct comparison to the original negative controls, no orthogonal validation of the subtracted gene set, and no sensitivity analysis of how partial signal removal would affect the Spearman correlation are described; this leaves open the possibility that the reported improvement over one-vs-rest partly reflects signal loss rather than rescue.

minor comments (2)

[Abstract] Abstract: The reported cell assignment rate (79.2%), recovery counts, and Spearman correlation lack accompanying statistical test details, error bars, multiple-testing correction information, or sample-size clarification, which would strengthen assessment of the 59/61 recovery figure.
[Abstract] Abstract: The negative sign of the Spearman correlation is explained, but the exact ranking metric from Joung et al. and whether the test accounts for ties should be stated explicitly for reproducibility.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for their constructive comments on our re-analysis of the TF Atlas dataset. We address the major concern point-by-point below and commit to revisions where feasible.

read point-by-point responses

Referee: [Abstract and Methods] Abstract and Methods: The assumption that embryoid body cells contain only batch/transduction artifacts (with no overlapping differentiation or TF-related programs) is load-bearing for the central claim of recovering 59/61 signatures via background subtraction. No direct comparison to the original negative controls, no orthogonal validation of the subtracted gene set, and no sensitivity analysis of how partial signal removal would affect the Spearman correlation are described; this leaves open the possibility that the reported improvement over one-vs-rest partly reflects signal loss rather than rescue.

Authors: We agree that the EB baseline assumption is central and that additional checks would strengthen the manuscript. The GFP/mCherry negative controls were not deposited with GSE216481, precluding direct comparison. EB cells were selected as an external baseline because they undergo comparable differentiation without TF overexpression, enabling subtraction of shared batch and transduction effects; this choice is justified by the similar culture conditions and the fact that EB cells lack the MORF barcodes used for TF assignment. To address potential signal loss, we will add a sensitivity analysis in the revised Methods and Results: we will vary the background subtraction threshold (e.g., using different quantiles of EB expression) and report the resulting changes in the number of recovered TF signatures (currently 59/61) and the Spearman correlation with Joung et al. rankings. We will also expand the Methods justification for EB with references to EB composition in the literature. While we lack independent orthogonal data for the subtracted gene set, the recovered signatures show biological coherence (e.g., FEZF2 linked to differentiation, FOS to focal adhesion) and the per-TF effect sizes correlate significantly with an independent ranking (ρ = -0.316, p = 0.013), supporting that the improvement over one-vs-rest reflects artifact removal rather than loss of true signal. revision: partial

standing simulated objections not resolved

Direct comparison to the original GFP/mCherry negative controls cannot be performed because they are absent from the deposited dataset.

Circularity Check

0 steps flagged

No circularity: external EB baseline and independent validation keep derivation self-contained

full rationale

The paper subtracts shared patterns using embryoid body cells that are explicitly external to the pooled screen (not part of GSE216481) and reports Spearman agreement with Joung et al. rankings from a prior independent study. No equations, fitted parameters, or self-citations reduce the recovered TF signatures (59/61) or effect sizes to quantities defined or fitted from the same screen data. The one-vs-rest comparison and Harmony batch check are standard downstream steps that do not create self-referential loops. This is the normal case of an analysis anchored on external controls and external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that embryoid body cells constitute a clean negative baseline and that background subtraction removes only non-TF artifacts; no free parameters or new entities are explicitly introduced.

free parameters (1)

TF assignment and detection thresholds
Implicit thresholds used to assign 79.2% of cells to 87 TF identities and to declare 59/61 signatures as recovered.

axioms (1)

domain assumption Embryoid body cells serve as a suitable external negative control whose shared expression patterns represent only batch and transduction artifacts
Invoked to justify background subtraction that isolates TF-specific effects.

pith-pipeline@v0.9.0 · 5670 in / 1326 out tokens · 49858 ms · 2026-05-13T20:52:41.636779+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we use embryoid body (EB) cells as an external baseline and remove shared batch/transduction artifacts by background subtraction. This strategy recovers TF-specific signatures for 59 of 61 testable TFs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

The human transcription factors

Samuel A Lambert, Arttu Jolma, Laura F Campitelli, Prem K Das, Yimeng Yin, Mihai Albu, Xiaoting Chen, Jussi Taipale, Timothy R Hughes, and Matthew T Weirauch. The human transcription factors. Cell, 172 0 (4): 0 650--665, 2018

work page 2018
[2]

A census of human transcription factors: function, expression and evolution

Juan M Vaquerizas, Sarah K Kummerfeld, Sarah A Teichmann, and Nicholas M Luscombe. A census of human transcription factors: function, expression and evolution. Nature Reviews Genetics, 10 0 (4): 0 252--263, 2009

work page 2009
[3]

Fulco, Livnat Jerby-Arnon, Nemanja D

Atray Dixit, Oren Parnas, Biyu Li, Jenny Chen, Charles P. Fulco, Livnat Jerby-Arnon, Nemanja D. Marjanovic, Danielle Dionne, Tyler Burks, Raktima Raychowdhury, Britt Adamson, Thomas M. Norman, Eric S. Lander, Jonathan S. Weissman, Nir Friedman, and Aviv Regev. Perturb-seq: dissecting molecular circuits with scalable single-cell rna profiling of pooled gen...

work page 2016
[4]

Mapping information-rich genotype-phenotype landscapes with genome-scale perturb-seq

Joseph M Replogle, Reuben A Saunders, Angela N Pogson, Jeffrey A Hussmann, Alexander Lenail, Alina Guna, Lauren Mascibroda, Eric J Wagner, Karen Adelman, Jennifer A Doudna, et al. Mapping information-rich genotype-phenotype landscapes with genome-scale perturb-seq. Cell, 185 0 (14): 0 2559--2575, 2022

work page 2022
[5]

A transcription factor atlas of directed differentiation

Julia Joung, Sai Ma, Tristan Tay, Kathryn R Geiger-Schuller, Paul C Kirchgatterer, Vanessa K Verdine, Baolin Guo, Mario Arias-Garcia, William E Allen, Isha Singh, et al. A transcription factor atlas of directed differentiation. Cell, 186 0 (1): 0 209--229, 2023

work page 2023
[6]

Current best practices in single-cell rna-seq analysis: a tutorial

Malte D Luecken and Fabian J Theis. Current best practices in single-cell rna-seq analysis: a tutorial. Molecular Systems Biology, 15 0 (6): 0 e8746, 2019

work page 2019
[7]

Scanpy: large-scale single-cell gene expression data analysis

F Alexander Wolf, Philipp Angerer, and Fabian J Theis. Scanpy: large-scale single-cell gene expression data analysis. Genome Biology, 19 0 (1): 0 1--5, 2018

work page 2018
[8]

Comprehensive integration of single-cell data

Tim Stuart, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi, William M Mauck, Yuhan Hao, Marlon Stoeckius, Peter Smibert, and Rahul Satija. Comprehensive integration of single-cell data. Cell, 177 0 (7): 0 1888--1902, 2019

work page 1902
[9]

Enrichr: a comprehensive gene set enrichment analysis web server 2016 update

Maxim V Kuleshov, Matthew R Jones, Andrew D Rouillard, Nicolas F Fernandez, Qiaonan Duan, Zichen Wang, Simon Koplev, Sherry L Jenkins, Kathleen M Jagodnik, Alexander Lachmann, et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Research, 44 0 (W1): 0 W90--W97, 2016

work page 2016
[10]

Gseapy: a comprehensive package for performing gene set enrichment analysis in python

Zhuoqing Fang, Xinyuan Liu, and Gary Peltz. Gseapy: a comprehensive package for performing gene set enrichment analysis in python. Bioinformatics, 39 0 (1): 0 btac757, 2023

work page 2023
[11]

Fezl regulates the differentiation and axon targeting of layer 5 subcortical projection neurons in cerebral cortex

Bin Chen, Laura R Schaevitz, and Susan K McConnell. Fezl regulates the differentiation and axon targeting of layer 5 subcortical projection neurons in cerebral cortex. Proceedings of the National Academy of Sciences, 102 0 (47): 0 17184--17189, 2005

work page 2005
[12]

Fast, sensitive and accurate integration of single-cell data with harmony

Ilya Korsunsky, Nghia Millard, Jean Fan, Kamil Slowikowski, Fan Zhang, Kevin Wei, Yuriy Baglaenko, Michael Brenner, Po-ru Loh, and Soumya Raychaudhuri. Fast, sensitive and accurate integration of single-cell data with harmony. Nature Methods, 16 0 (12): 0 1289--1296, 2019

work page 2019
[13]

Characterizing the molecular regulation of inhibitory immune checkpoints with multimodal single-cell screens

Efthymia Papalexi, Eleni P Mimitou, Andrew W Butler, Samantha Foster, Bernadette Bracken, William M Mauck, Hans-Hermann Wessels, Yuhan Hao, Bonnie V Yeung, Peter Smibert, et al. Characterizing the molecular regulation of inhibitory immune checkpoints with multimodal single-cell screens. Nature Genetics, 53 0 (3): 0 322--331, 2021

work page 2021
[14]

Mingze Dong, Bao Wang, Jessica Wei, Antonio H. de O. Fonseca, Curtis J. Perry, Alexander Frey, Feriel Ouerghi, Ellen F. Foxman, Jeffrey J. Ishizuka, Rahul M. Dhodapkar, and David van Dijk. Causal identification of single-cell experimental perturbation effects with cinema-ot. Nature Methods, 20 0 (11): 0 1769--1779, 2023

work page 2023