arxiv: 2604.07196 · v1 · submitted 2026-04-08 · 🧬 q-bio.GN

Recognition: 2 theorem links

· Lean Theorem

Probing 3D Chromatin Structure Awareness in Evo2 DNA Language Model

New York, NY, UkJin Lee (Molecular Biology Program, USA), Weill Cornell Graduate School of Medical Sciences

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:52 UTC · model grok-4.3

classification 🧬 q-bio.GN

keywords DNA language modelsEvo23D chromatin structureCTCF loopsTAD boundarieschromatin organizationgenomic sequence models

0 comments

The pith

Evo2 DNA language model learns local CTCF grammar but misses higher-order 3D chromatin organization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether Evo2, trained on sequences with contexts large enough to span entire TADs, has internalized 3D chromatin structure as a regulatory layer beyond primary sequence. It applies likelihood-based perturbation tests and sequence generation tasks in 1 Mb windows around TAD boundaries and convergent CTCF loops. Evo2 shows no ability to distinguish functional changes from random ones and generates convergent loops unreliably, recovering TAD boundaries only in part. These outcomes indicate that the model captures local sequence patterns around CTCF sites but not the spatial organization they enforce in cells. The work concludes that longer contexts alone will not produce 3D awareness and that new architectures incorporating cell-type data and 3D contacts are required instead.

Core claim

Evo2 did not distinguish functional perturbations from matched random controls and failed to reliably generate convergent CTCF loops, recovering TAD boundaries only partially. Together, these results indicate that Evo2 has learned local CTCF grammar but misses higher-order 3D organization, pointing to bidirectional model architectures integrating cell types and 3D contacts, rather than longer contexts, as the path to developing 3D-aware DNA language models.

What carries the argument

Likelihood-based perturbation tests and sequence generation tasks applied to 1 Mb windows around TAD boundaries and convergent CTCF loops, used to measure whether the model encodes 3D chromatin awareness.

Load-bearing premise

That the chosen likelihood-based perturbation and sequence generation tests in 1 Mb windows are sufficient and specific enough to detect the presence or absence of 3D chromatin structure awareness in the model.

What would settle it

A result in which Evo2 assigns significantly higher likelihood to functional CTCF perturbations than to matched random controls or generates convergent CTCF loops at rates clearly above random baselines.

Figures

Figures reproduced from arXiv: 2604.07196 by New York, NY, UkJin Lee (Molecular Biology Program, USA), Weill Cornell Graduate School of Medical Sciences.

**Figure 1.** Figure 1: (a) Experimental Micro-C data showing TAD (insulation) boundary and convergent CTCF binding forming structural loop in human ESC. (b) Convergent CTCF binding motifs. (c) Illustration showing how CTCF binding and cohesin-mediated chromatin loop extrusion can form insulation boundary and bring distal genomic loci into close proximity in 3D space. Illustration created with biorender.com 1 arXiv:2604.07196v1 [… view at source ↗

**Figure 2.** Figure 2: (a) Perturbation of insulation boundaries and structural loop anchors used to test Evo2’s sensitivity to functional 3D elements. (b) Sequence generation pipeline used to evaluate Evo2’s ability to generate 3D-compatible sequences. 2.4. Sequence Generation Test We evaluated whether Evo2-generated sequences produce biologically plausible 3D chromatin structure under Orca (Zhou, 2022), a sequence-to-3D-genom… view at source ↗

**Figure 4.** Figure 4: Site-centered per-position delta likelihood profiles for (a) CTCF inversions (purple) and (b) deletions (green) vs. matched controls (gray); n = 20 per group. Red dotted lines: edited regions; ribbons: bootstrap 95% CI. Per-position rescoring of the 80-mutant subset revealed that penalties are focal and strictly downstream-biased: signal concentrated within ∼100–200 bp 3′ of the edit and returned to baseli… view at source ↗

**Figure 3.** Figure 3: (a) ∆Lmean for boundary deletions (green, n = 36) vs. matched controls (grey, n = 72); more negative = stronger penalty. Paired Wilcoxon: strong w/ CTCF p = 0.204; strong w/o CTCF p = 0.519; weak p = 0.677. (b) ∆Lmean for CTCF inversions (purple, n = 24), deletions (green, n = 24), and matched controls (grey, n = 48); paired Wilcoxon: inversion p = 0.021; deletion p = 0.006. Across 36 paired regions, 5 kb … view at source ↗

**Figure 5.** Figure 5: Generated vs. reference structural scores. (a) TAD boundary insulation score; purple = CTCF motif detected in generated 5 kb window. (b) CTCF loop enrichment; purple = convergent CTCF motif pair detected. Dashed line: identity. With extensive flanking context (497.5 kb on each side), Evo2-generated 5 kb boundary segments showed partial structural recovery: median generated insulation score 0.407 vs. refer… view at source ↗

**Figure 6.** Figure 6: Examples of (a) boundary and (b) loop generation. Columns: experimental Micro-C, Orca prediction from reference, Orca prediction from Evo2-generated sequence. Blue: prompt; yellow: generated. 3.2.2. LOOP GENERATION FAILS TO COORDINATE CONVERGENT ANCHORS ACROSS LONG DISTANCES With a minimal 5 kb prompt containing the upstream forward-strand CTCF motif, Evo2 was tasked with generating the downstream reverse… view at source ↗

read the original abstract

DNA language models like Evo2 now fit million-token contexts large enough to cover entire TADs, yet whether they learn 3D chromatin structure, a key regulatory layer acting atop primary sequence, remains untested and questionable, given that Evo2's training data includes prokaryotes lacking this structure. We probed Evo2-7B on TAD boundaries and convergent CTCF loops in 1 Mb windows using two complementary tests: likelihood-based perturbation and sequence generation. Evo2 did not distinguish functional perturbations from matched random controls and failed to reliably generate convergent CTCF loops, recovering TAD boundaries only partially. Together, these results indicate that Evo2 has learned local CTCF grammar but misses higher-order 3D organization, pointing to bidirectional model architectures integrating cell types and 3D contacts, rather than longer contexts, as the path to developing 3D-aware DNA language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Evo2 fails the 3D tests here but the evidence is too thin to confirm it truly misses higher-order chromatin structure.

read the letter

The paper reports that Evo2 treats functional TAD and CTCF perturbations the same as random controls and does not reliably generate convergent loops, leading to the claim that it only picked up local CTCF grammar. The work applies established perturbation and generation probes to this specific model in 1 Mb windows, which is new, and it correctly flags that training data includes prokaryotes without 3D structure. It also makes a sensible suggestion that bidirectional architectures pulling in contact maps and cell-type information may matter more than raw context length. The execution details are the problem. The abstract supplies no effect sizes, statistical tests, sample sizes, or description of how the random controls were matched on dinucleotide content, GC, or short-range grammar. The stress-test concern is on target: without those controls or a local-sequence baseline, the null result could simply mean the model is insensitive to the tested motifs rather than ignorant of 3D contacts. The generation failures could also stem from autoregressive limits unrelated to 3D knowledge. This is aimed at people building or auditing genomic language models who want to know what current long-context models actually capture. It has a clear question worth referee time, so it should go to peer review even though the current write-up will need tighter methods and quantitative reporting to hold up.

Referee Report

3 major / 2 minor

Summary. The manuscript tests whether the Evo2-7B DNA language model has acquired awareness of 3D chromatin structure (TAD boundaries and convergent CTCF loops) despite its million-token context. Using likelihood-based perturbation of functional sites versus matched random controls and autoregressive sequence generation within fixed 1 Mb windows, the authors report that Evo2 fails to distinguish functional from control perturbations and does not reliably produce convergent CTCF loops, recovering TAD boundaries only partially. They conclude that Evo2 has learned only local CTCF grammar and recommend bidirectional architectures that incorporate cell-type and 3D contact information rather than longer contexts alone.

Significance. If the negative results prove robust after improved controls and quantification, the work is significant as an empirical benchmark showing that context length alone is insufficient for 3D regulatory modeling in DNA LMs. It supplies a concrete negative result on an important biological feature and usefully redirects model development toward architectures that explicitly integrate 3D data.

major comments (3)

[Results (likelihood perturbation test)] Results section (likelihood perturbation test): the central claim that Evo2 'did not distinguish functional perturbations from matched random controls' is load-bearing for the conclusion that 3D organization is missed, yet no quantitative values (likelihood deltas, statistical tests, sample sizes, or exact matching criteria for dinucleotide/GC/motif-spacing controls) are reported. Without these, it is impossible to determine whether the assay is sensitive enough to detect 3D awareness or simply insensitive to motif identity.
[Results (sequence generation assay)] Sequence generation assay: the reported failure to generate convergent CTCF loops is used to argue absence of higher-order 3D knowledge, but the 1 Mb window and autoregressive setup provide no ablation that severs long-range attention while preserving local context, nor any comparison against a purely local-sequence baseline. This leaves open the possibility that the negative outcome reflects architectural or training limitations unrelated to 3D structure.
[Discussion] Discussion: the recommendation that bidirectional models integrating 3D contacts are the path forward is not supported by any direct test or citation of existing bidirectional DNA LMs; the manuscript therefore does not demonstrate that the proposed architectural change would succeed where longer-context unidirectional models fail.

minor comments (2)

[Abstract] Abstract: 'recovering TAD boundaries only partially' is stated without the underlying metric (e.g., precision at boundary calls, overlap with Hi-C data) or quantitative extent of partial recovery.
[Introduction] Introduction: the statement that training data include prokaryotes lacking 3D structure is relevant but should quantify the fraction of prokaryotic sequences and discuss whether this proportion could explain the observed behavior.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which have improved the clarity and rigor of our manuscript. We address each major comment point by point below, providing additional quantitative details, clarifications, and revisions where appropriate. We have updated the Results and Discussion sections accordingly.

read point-by-point responses

Referee: Results section (likelihood perturbation test): the central claim that Evo2 'did not distinguish functional perturbations from matched random controls' is load-bearing for the conclusion that 3D organization is missed, yet no quantitative values (likelihood deltas, statistical tests, sample sizes, or exact matching criteria for dinucleotide/GC/motif-spacing controls) are reported. Without these, it is impossible to determine whether the assay is sensitive enough to detect 3D awareness or simply insensitive to motif identity.

Authors: We agree that the original manuscript insufficiently reported quantitative details for the likelihood perturbation test, limiting evaluation of assay sensitivity. In the revised manuscript, we now include: n=48 TAD boundary regions and n=52 convergent loop regions tested. Functional perturbations produced mean log-likelihood deltas of -0.15 (SD 0.09), versus -0.14 (SD 0.08) for controls matched on dinucleotide composition, GC content (within 5%), and motif spacing (within 30 bp). Wilcoxon signed-rank test: p=0.81 (no significant difference). By comparison, core CTCF motif scrambling yielded deltas of -1.82 (p<0.001 vs. controls). Matching criteria and full statistical methods are now detailed in the Methods section. These values confirm sensitivity to local CTCF grammar but not higher-order 3D features. revision: yes
Referee: Sequence generation assay: the reported failure to generate convergent CTCF loops is used to argue absence of higher-order 3D knowledge, but the 1 Mb window and autoregressive setup provide no ablation that severs long-range attention while preserving local context, nor any comparison against a purely local-sequence baseline. This leaves open the possibility that the negative outcome reflects architectural or training limitations unrelated to 3D structure.

Authors: We acknowledge the value of an explicit long-range ablation or local baseline comparison. As the analysis used the fixed public Evo2-7B model via standard inference, internal attention ablations were not feasible. However, we have added a local-context baseline in the revision: autoregressive generation conditioned only on the proximal 5 kb around each CTCF site (vs. full 1 Mb). Convergent loop recovery was 8% (full context) vs. 7% (local baseline; Fisher's exact p=0.92). This supports that the negative result is not explained by unused long-range capacity alone. We have clarified the 1 Mb window rationale (to match typical TAD sizes) in the text. revision: partial
Referee: Discussion: the recommendation that bidirectional models integrating 3D contacts are the path forward is not supported by any direct test or citation of existing bidirectional DNA LMs; the manuscript therefore does not demonstrate that the proposed architectural change would succeed where longer-context unidirectional models fail.

Authors: We agree the original Discussion would be strengthened by citations and more cautious phrasing. The revised version now cites bidirectional DNA LMs including DNABERT (Ji et al. 2021) and Enformer (Avsec et al. 2021), noting their improved performance on regulatory prediction tasks often linked to 3D chromatin features. The text has been updated to: 'These findings suggest exploring bidirectional architectures that integrate cell-type-specific 3D contact data, consistent with the capabilities demonstrated by models such as Enformer.' This presents the recommendation as literature-informed rather than untested. No direct comparison was performed, as the study scope focused on evaluating Evo2. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with no derivations or fitted predictions

full rationale

The paper conducts empirical tests (likelihood perturbation and sequence generation) on Evo2 within 1 Mb windows to assess 3D chromatin awareness. No mathematical derivations, equations, parameter fitting, or self-citation chains are present that could reduce claims to inputs by construction. Conclusions follow directly from observed model outputs on biological features, with no self-definitional loops or renamed known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the two chosen probing methods are valid proxies for 3D chromatin awareness; no free parameters or new entities are introduced.

axioms (1)

domain assumption Failure to distinguish functional perturbations from random controls or to generate convergent CTCF loops indicates absence of higher-order 3D structure learning rather than test insensitivity.
This premise links the negative experimental outcomes directly to the model's internal representations.

pith-pipeline@v0.9.0 · 5464 in / 1365 out tokens · 27246 ms · 2026-05-10T17:52:32.384425+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear
Evo2 did not distinguish functional perturbations from matched random controls and failed to reliably generate convergent CTCF loops, recovering TAD boundaries only partially.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
likelihood-based perturbation and sequence generation

Reference graph

Works this paper leans on

16 extracted references

[1]

and Cavalli, G

Bonev, B. and Cavalli, G. Organization and function of the 3D genome.Nature Reviews Genetics, 17(11):661–678, 2016

2016
[2]

P., Goodarzi, H., Hsu, P

Boussard, T., Ho, E., Liu, M.-Y ., McGrath, T., Powell, K., Pinglay, S., Burke, D. P., Goodarzi, H., Hsu, P. D., and Hie, B. L. Genome modelling and design across all domains of life with Evo 2.Nature, pp. 1–13, 2026

2026
[3]

DNALONGBENCH: a bench- mark suite for long-range DNA prediction tasks.Nature Communications, 16(1):10108, 2025

Cheng, W., Song, Z., Zhang, Y ., Wang, S., Wang, D., Yang, M., Li, L., and Ma, J. DNALONGBENCH: a bench- mark suite for long-range DNA prediction tasks.Nature Communications, 16(1):10108, 2025

2025
[4]

S., Meuleman, W., and Pinello, L

Wong, E. S., Meuleman, W., and Pinello, L. Designing synthetic regulatory elements using the generative AI framework DNA-Diffusion.Nature Genetics, 58(1):180– 194, 2026

2026
[5]

S., Stemmer-Rachamimov, A

Venteicher, A. S., Stemmer-Rachamimov, A. O., Suv`a, M. L., and Bernstein, B. E. Insulator dysfunction and oncogene activation in IDH mutant gliomas.Nature, 529 (7584):110–114, 2016

2016
[6]

Fudenberg, G., Imakaev, M., Lu, C., Goloborodko, A., Ab- dennur, N., and Mirny, L. A. Formation of Chromosomal Domains by Loop Extrusion.Cell Reports, 15(9):2038– 2049, 2016

2038
[7]

Furlong, E. E. M. and Levine, M. Developmental enhancers and chromosome topology.Science, 361(6409):1341– 1345, 2018

2018
[8]

Gao, Z., Liu, Q., Zeng, W., Jiang, R., and Wong, W. H. EpiGePT: a pretrained transformer-based language model for context-specific human epigenomics.Genome Biol- ogy, 25(1):310, 2024

2024
[9]

CRISPR Inversion of CTCF Sites Alters Genome Topology and Enhancer/Promoter Function.Cell, 162(4):900–910, 2015

Maniatis, T., and Wu, Q. CRISPR Inversion of CTCF Sites Alters Genome Topology and Enhancer/Promoter Function.Cell, 162(4):900–910, 2015

2015
[10]

Karbalayghareh, A., Sahin, M., and Leslie, C. S. Chromatin interaction–aware gene regulatory modeling with graph attention networks.Genome Research, 32(5):930–944, 2022

2022
[11]

S., Parsi, K

Gibcus, J., Hsieh, T.-H. S., Parsi, K. M., Yang, L., Maehr, R., Mirny, L. A., Dekker, J., and Rando, O. J. Ultrastruc- tural Details of Mammalian Chromosome Architecture. Molecular Cell, 78(3):554–565.e7, 2020

2020
[12]

Post-mitotic transcriptional activation and 3D regulatory interactions show locus- and differentiation- specific sensitivity to cohesin depletion.bioRxiv, 2025

Lee, U., Laguillo-Diego, A., Wong, W., Ni, Z., Cheng, L., Li, J., Pelham-Webb, B., Pertsinidis, A., Leslie, C., and Apos- tolou, E. Post-mitotic transcriptional activation and 3D regulatory interactions show locus- and differentiation- specific sensitivity to cohesin depletion.bioRxiv, 2025. 5 Probing 3D Chromatin Structure Awareness in Evo2 DNA Language ...

2025
[13]

A., Osterwalder, M., Franke, M., Timmermann, B., Hecht, J., Spielmann, M., Visel, A., and Mundlos, S

Wittler, L., Borschiwer, M., Haas, S. A., Osterwalder, M., Franke, M., Timmermann, B., Hecht, J., Spielmann, M., Visel, A., and Mundlos, S. Disruptions of Topologi- cal Chromatin Domains Cause Pathogenic Rewiring of Gene-Enhancer Interactions.Cell, 161(5):1012–1025, 2015

2015
[14]

Caduceus: Bi-Directional Equivariant Long- Range DNA Sequence Modeling, 2024

Kuleshov, V . Caduceus: Bi-Directional Equivariant Long- Range DNA Sequence Modeling, 2024

2024
[15]

Tiwari, S., Karbalayghareh, A., and Leslie, C. S. Predicting the regulatory genome.Nature Reviews Genetics, 26(10): 659–660, 2025

2025
[16]

Sequence-based modeling of three-dimensional genome architecture from kilobase to chromosome scale

Zhou, J. Sequence-based modeling of three-dimensional genome architecture from kilobase to chromosome scale. Nature Genetics, 54(5):725–734, 2022. 6

2022