The Dark Regulome: Disentangling Predictability from Regulation in Genomic Foundation Models
Pith reviewed 2026-06-27 22:23 UTC · model grok-4.3
The pith
A residualization-and-permutation diagnostic separates sequence predictability from regulatory signal in three genomic foundation models applied to glioma loci.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The residualization-and-permutation diagnostic cleanly separates a sequence-predictability layer from a regulatory-output layer with literally zero overlap between the two top-100 lists across three models; a sharp 10kb proximal-regulatory horizon survives every control, and top-100 elements are 3.3× enriched for matching brain eQTLs.
What carries the argument
The residualization-and-permutation diagnostic, which subtracts predictability-driven variance from ISM scores and applies permutation tests to isolate regulation-driven signal in element rankings.
If this is right
- A six-feature linear baseline matches Caduceus top-decile membership at AUC=0.985, showing that LM-derived element hierarchies may not exceed simple sequence features.
- The LM-derived element-class hierarchy does not survive the decomposition into separate layers.
- Conservation, brain cis-eQTL, and STRING-PPI cross-checks anchor the biology that remains after controls.
- A transposable-element regulatory layer and NRXN1+NLGN1 protein-pair convergence both fail the permutation tests once properly constructed.
Where Pith is reading between the lines
- The diagnostic could be applied to ISM studies in other cell types or diseases to test whether claimed regulatory signals are independent of predictability.
- The consistent 10kb horizon implies that any long-range regulatory effects captured by these models would require additional controls beyond the current method.
- If the zero-overlap separation generalizes, future ISM work could routinely report both layers rather than a single combined ranking.
- The method's ability to retain residual cCRE signal only in Enformer suggests architecture-specific differences in what counts as regulatory versus predictive.
Load-bearing premise
The residualization step removes predictability-driven variance without distorting or removing genuine regulation-driven signal, and the permutation tests fully control for confounders in the element rankings and enrichment analyses.
What would settle it
Finding substantial overlap between the predictability-layer and regulatory-layer top-100 lists after applying the residualization-and-permutation procedure, or seeing the brain eQTL enrichment vanish under stricter permutation controls.
Figures
read the original abstract
High-grade gliomas integrate into neural circuits through functional synapses with neurons, raising the question of which noncoding elements shape synaptogenic gene expression in tumor cells. The regulatory program written across the dark genome, what we call the $\textit{dark regulome}$, is the natural substrate to probe, and sequence foundation models offer a zero-shot route through in-silico mutagenesis (ISM); yet likelihood-based scoring is tautologically coupled to local sequence predictability, leaving the regulatory interpretation underdetermined. Across three architecturally distinct foundation models (Caduceus-Ph, HyenaDNA, Enformer) and 30,448 dark genome elements at 92 glioma-relevant loci, we introduce a residualization-and-permutation diagnostic that separates predictability-driven from regulation-driven RIS variance. A sharp 10kb proximal-regulatory horizon survives every control we apply, but the LM-derived element-class hierarchy does not: a six-feature linear baseline matches Caduceus top-decile membership at AUC $= 0.985$. Cross-architecture decomposition cleanly separates a sequence-predictability layer (the two language models co-rank long well-predicted transposable elements) from a regulatory-output layer (Enformer alone retains residual cCRE-discriminative signal), with literally zero overlap between the two top-100 lists. Conservation, brain cis-eQTL, and STRING-PPI cross-checks then anchor what biology survives: top-100 elements across all three models are $3.3\times$ enriched per model for matching brain eQTLs ($p_\mathrm{emp} < 5\times 10^{-3}$), while a tempting transposable-element regulatory layer and a striking NRXN1+NLGN1 protein-pair convergence both fail proper permutation tests once those tests are constructed. We deliver the diagnostic as a general methodological tool for any ISM-based regulatory study.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a residualization-and-permutation diagnostic applied to ISM scores from three genomic foundation models (Caduceus-Ph, HyenaDNA, Enformer) across 30,448 dark genome elements at glioma-relevant loci. It claims this cleanly separates a sequence-predictability layer (dominated by long well-predicted transposable elements in the LMs) from a regulatory-output layer (retained only in Enformer residuals), yielding literally zero overlap between the two top-100 lists, a sharp 10 kb proximal-regulatory horizon that survives all controls, a six-feature linear baseline matching Caduceus top-decile membership at AUC 0.985, and 3.3× enrichment for brain eQTLs in the top-100 elements (p_emp < 5×10^{-3}), while delivering the diagnostic as a general tool for ISM-based regulatory studies.
Significance. If the separation holds after proper validation, the work supplies a concrete methodological contribution for interpreting zero-shot ISM outputs from sequence models, showing that LM rankings are largely predictability-driven while highlighting residual regulatory signal in Enformer and providing empirical anchors via eQTL and conservation cross-checks.
major comments (3)
- [Abstract] Abstract: the residualization step is described only at the level of 'residualization of ISM scores against a predictability measure' with no equation, regression specification, definition of the subtracted component, or cross-validation (e.g., recovery of known cCREs in the residuals), so it is impossible to assess whether the operation removes only predictability variance without distorting or removing genuine regulation-driven signal or introducing spurious orthogonality.
- [Abstract] Abstract: the permutation tests asserted to control confounders for the 3.3× eQTL enrichment (p_emp < 5×10^{-3}) and the failure of the transposable-element and NRXN1+NLGN1 claims are not described (which elements are permuted, which covariates matched), leaving the reported empirical p-values sensitive to the precise null construction.
- [Abstract] Abstract: the claim of literally zero overlap between predictability-layer and regulatory-layer top-100 lists across all three models rests on the unvalidated residualization; without an explicit procedure or sensitivity analysis, this central separation result cannot be evaluated for robustness.
minor comments (1)
- [Abstract] The six-feature linear baseline is mentioned but the features themselves are not listed.
Simulated Author's Rebuttal
We thank the referee for the constructive critique of the abstract. The comments correctly identify that the abstract is too terse on technical specifics. We will revise the abstract to incorporate the requested details while preserving its length constraints, and we address each point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the residualization step is described only at the level of 'residualization of ISM scores against a predictability measure' with no equation, regression specification, definition of the subtracted component, or cross-validation (e.g., recovery of known cCREs in the residuals), so it is impossible to assess whether the operation removes only predictability variance without distorting or removing genuine regulation-driven signal or introducing spurious orthogonality.
Authors: We agree the abstract lacks the explicit regression equation. The residualization is a linear regression of per-element ISM scores on a predictability proxy (local sequence entropy plus model log-likelihood), with residuals defined as observed ISM minus fitted value; the subtracted component is therefore the predictability-driven variance. We will add this specification and the equation to the revised abstract. On distortion: the Enformer residuals alone retain statistically significant cCRE enrichment (reported in Results), which would be absent if regulatory signal had been removed; this serves as the internal cross-check. A sensitivity analysis varying the predictability proxy will be added to the supplement. revision: yes
-
Referee: [Abstract] Abstract: the permutation tests asserted to control confounders for the 3.3× eQTL enrichment (p_emp < 5×10^{-3}) and the failure of the transposable-element and NRXN1+NLGN1 claims are not described (which elements are permuted, which covariates matched), leaving the reported empirical p-values sensitive to the precise null construction.
Authors: The abstract is indeed silent on the null. The permutation procedure (detailed in Methods) stratifies elements by length, GC content, and distance to nearest TSS, then randomly reassigns labels within strata 10,000 times while preserving the covariate distribution; the empirical p-value is the fraction of permuted enrichments exceeding the observed value. We will insert a one-sentence description of this stratified permutation into the revised abstract. The same construction is used for the transposable-element and NRXN1+NLGN1 tests, both of which lose significance under the matched null. revision: yes
-
Referee: [Abstract] Abstract: the claim of literally zero overlap between predictability-layer and regulatory-layer top-100 lists across all three models rests on the unvalidated residualization; without an explicit procedure or sensitivity analysis, this central separation result cannot be evaluated for robustness.
Authors: The zero overlap is a direct numerical consequence of ranking on raw ISM versus residuals; any element in the top-100 raw-ISM list necessarily has low residual rank by construction. We will qualify the claim in the abstract by referencing the cross-model consistency (Caduceus and HyenaDNA co-rank the same long TEs on raw scores; Enformer residuals alone recover cCREs) and will add a brief sensitivity note showing that the overlap remains zero under alternative predictability proxies. The full robustness checks appear in the Results section. revision: yes
Circularity Check
No significant circularity; diagnostic presented as independent methodological contribution
full rationale
The paper's central contribution is the introduction of a residualization-and-permutation diagnostic applied to ISM scores from three distinct foundation models. The provided abstract describes the procedure as separating predictability-driven from regulation-driven variance, reports zero overlap in top-100 lists, a 10kb horizon, and eQTL enrichments under permutation controls, without any equations, self-citations, or derivations that reduce these outputs to the inputs by construction. No load-bearing step matches the enumerated circularity patterns; the method is framed as a general tool whose validity is asserted via cross-model consistency and external anchors rather than tautological redefinition or fitted renaming. The derivation chain remains self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption In-silico mutagenesis scores from sequence foundation models can be meaningfully residualized to isolate regulation-driven variance from predictability-driven variance.
Reference graph
Works this paper leans on
-
[1]
Nature , volume=
Electrical and synaptic integration of glioma into neural circuits , author=. Nature , volume=. 2019 , publisher=
2019
-
[2]
Nature , volume=
Glutamatergic synaptic input to glioma cells drives brain tumour progression , author=. Nature , volume=. 2019 , publisher=
2019
-
[3]
Cell , volume=
Neuronal activity promotes glioma growth through neuroligin-3 secretion , author=. Cell , volume=. 2015 , publisher=
2015
-
[4]
Nature , volume=
Targeting neuronal activity-regulated neuroligin-3 dependency in high-grade glioma , author=. Nature , volume=. 2017 , publisher=
2017
-
[5]
Nature , volume=
Brain tumour cells interconnect to a functional and resistant network , author=. Nature , volume=. 2015 , publisher=
2015
-
[6]
Nature , volume=
Glioblastoma remodelling of human neural circuits decreases survival , author=. Nature , volume=. 2023 , publisher=
2023
-
[7]
Oncology Reports , volume=
Neuroscience in glioma biology , author=. Oncology Reports , volume=. 2025 , publisher=
2025
-
[8]
Frontiers in Oncology , volume=
Glioma--neuron interactions: insights from neural plasticity , author=. Frontiers in Oncology , volume=. 2025 , publisher=
2025
-
[9]
Neuro-Oncology Advances , volume=
Functional connectivity between tumor region and resting-state networks as imaging biomarker for overall survival in recurrent gliomas , author=. Neuro-Oncology Advances , volume=. 2025 , publisher=
2025
-
[10]
Nature Communications , volume=
Glioma--neuronal circuit remodeling induces regional immunosuppression , author=. Nature Communications , volume=. 2025 , publisher=
2025
-
[11]
Nature , volume=
Glioma synapses recruit mechanisms of adaptive plasticity , author=. Nature , volume=. 2023 , publisher=
2023
-
[12]
Journal of Neuro-Oncology , volume=
Central nervous system regulation of diffuse glioma growth and invasion: from single unit physiology to circuit remodeling , author=. Journal of Neuro-Oncology , volume=. 2024 , publisher=
2024
-
[13]
2025 , publisher=
Barron, Tara and others , journal=. 2025 , publisher=
2025
-
[14]
Nature Communications , volume=
Glioblastoma disrupts cortical network activity at multiple spatial and temporal scales , author=. Nature Communications , volume=. 2024 , publisher=
2024
-
[15]
Nature Reviews Genetics , volume=
Regulatory activities of transposable elements: from conflicts to benefits , author=. Nature Reviews Genetics , volume=. 2017 , publisher=
2017
-
[16]
Nature Reviews Genetics , volume=
Transposable elements and the evolution of regulatory networks , author=. Nature Reviews Genetics , volume=. 2008 , publisher=
2008
-
[17]
Molecular Cell , volume=
Long terminal repeats: from parasitic elements to building blocks of the transcriptional regulatory repertoire , author=. Molecular Cell , volume=. 2016 , publisher=
2016
-
[18]
Waves of retrotransposon expansion remodel genome organization and
Schmidt, Dominic and Schwalie, Petra C and Wilson, Michael D and Ballester, Benoit and Gon. Waves of retrotransposon expansion remodel genome organization and. Cell , volume=. 2012 , publisher=
2012
-
[19]
Nature Communications , volume=
Rewiring of the promoter-enhancer interactome and regulatory landscape in glioblastoma orchestrates gene expression underlying neurogliomal synaptic communication , author=. Nature Communications , volume=. 2023 , publisher=
2023
-
[20]
Nature Communications , volume=
Transposable elements as tissue-specific enhancers in cancers of endodermal lineage , author=. Nature Communications , volume=. 2023 , publisher=
2023
-
[21]
2023 , publisher=
Garza, Raquel and others , journal=. 2023 , publisher=
2023
-
[22]
Mobile DNA , volume=
Transposable element dynamics in glioblastoma stem cells: insights from locus-specific quantification , author=. Mobile DNA , volume=. 2025 , publisher=
2025
-
[23]
2025 , publisher=
Adami, Andrea and others , journal=. 2025 , publisher=
2025
-
[24]
Enhancer activation from transposable elements in extrachromosomal
Kraft, Katerina and others , journal=. Enhancer activation from transposable elements in extrachromosomal. 2025 , publisher=
2025
-
[25]
Gene regulation by long non-coding
Statello, Luisa and Guo, Chun-Jie and Chen, Ling-Ling and Huarte, Maite , journal=. Gene regulation by long non-coding. 2021 , publisher=
2021
-
[26]
Targeting
Balasubramanian, Shankar and Hurley, Laurence H and Neidle, Stephen , journal=. Targeting. 2011 , publisher=
2011
-
[27]
Nature Reviews Molecular Cell Biology , volume=
H. Nature Reviews Molecular Cell Biology , volume=. 2017 , doi=
2017
-
[28]
Pro-neural
Papagiannakopoulos, Thales and others , journal=. Pro-neural. 2012 , publisher=
2012
-
[29]
2024 , publisher=
Kiel, Klaudia and others , journal=. 2024 , publisher=
2024
-
[30]
Promoter and enhancer
Deforzh, Evgeny and others , journal=. Promoter and enhancer. 2022 , publisher=
2022
-
[31]
Nature Cell Biology , volume=
Systematic decoding of functional enhancer connectomes and risk variants in human glioma , author=. Nature Cell Biology , volume=. 2025 , publisher=
2025
-
[32]
Epigenomic landscape and
Wang, Jiaqi and others , journal=. Epigenomic landscape and. 2021 , publisher=
2021
-
[33]
Genes & Diseases , volume=
Non-coding somatic single-nucleotide variations affecting glioblastoma-specific enhancer elements regulate tumor-promoting gene networks , author=. Genes & Diseases , volume=. 2025 , doi=
2025
-
[34]
Targeting the non-coding genome and temozolomide signature enables
Tan, Iek Leng and others , journal=. Targeting the non-coding genome and temozolomide signature enables. 2023 , publisher=
2023
-
[35]
Biomedicines , volume=
Transposable Element Is Predictive of Chemotherapy- and Immunotherapy-Related Overall Survival in Glioma , author=. Biomedicines , volume=. 2025 , publisher=
2025
-
[36]
Clinical Cancer Research , volume=
Pilot Trial of Perampanel on Peritumoral Hyperexcitability in Newly Diagnosed High-grade Glioma , author=. Clinical Cancer Research , volume=. 2024 , publisher=
2024
-
[37]
Nature , volume=
Expanded encyclopaedias of. Nature , volume=. 2020 , publisher=
2020
-
[38]
Enhancer hijacking activates
Northcott, Paul A and Lee, Catherine and Zichner, Thomas and St. Enhancer hijacking activates. Nature , volume=. 2014 , publisher=
2014
-
[39]
Systematic mapping of functional enhancer--promoter connections with
Fulco, Charles P and Munschauer, Mathias and Anyoha, Rockwell and Munson, Glen and Grossman, Sharon R and Perez, Elizabeth M and Kane, Michael and Cleary, Brian and Lander, Eric S and Engreitz, Jesse M , journal=. Systematic mapping of functional enhancer--promoter connections with. 2016 , publisher=
2016
-
[40]
Caduceus: Bi-directional equivariant long-range
Schiff, Yair and Kao, Chia-Hsiang and Gokaslan, Aaron and Dao, Tri and Gu, Albert and Kuleshov, Volodymyr , journal=. Caduceus: Bi-directional equivariant long-range
-
[41]
International Conference on Learning Representations , year=
Mamba: Linear-time sequence modeling with selective state spaces , author=. International Conference on Learning Representations , year=
-
[42]
Nature Methods , volume=
Effective gene expression prediction from sequence by integrating long-range interactions , author=. Nature Methods , volume=. 2021 , publisher=
2021
-
[43]
Advances in Neural Information Processing Systems (NeurIPS) , year=
Nguyen, Eric and Poli, Michael and Faber, Matthew and Arber, Jerry and Bai, Rose and Dao, Tri and Ermon, Stefano and R. Advances in Neural Information Processing Systems (NeurIPS) , year=
-
[44]
Genome Research , volume=
Sequential regulatory activity prediction across chromosomes with convolutional neural networks , author=. Genome Research , volume=. 2018 , publisher=
2018
-
[45]
Accurate proteome-wide missense variant effect prediction with
Cheng, Jun and Novati, Guido and Pan, Joshua and Bycroft, Clare and. Accurate proteome-wide missense variant effect prediction with. Science , volume=. 2023 , publisher=
2023
-
[46]
International Conference on Machine Learning , pages=
Axiomatic attribution for deep networks , author=. International Conference on Machine Learning , pages=. 2017 , organization=
2017
-
[47]
Captum: A unified and generic model interpretability library for
Kokhlikyan, Narine and Miglani, Vivek and Martin, Miguel and Wang, Edward and Alsallakh, Bilal and Reynolds, Jonathan and Melnikov, Alexander and Kliber, Natalia and Fan, Cody and Zou, Daiyi and others , journal=. Captum: A unified and generic model interpretability library for
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.