pith. machine review for the scientific record. sign in

arxiv: 2604.06549 · v1 · submitted 2026-04-08 · 🧬 q-bio.GN

Recognition: no theorem link

The Mechanistic Invariance Test: Genomic Language Models Fail to Learn Positional Regulatory Logic

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:39 UTC · model grok-4.3

classification 🧬 q-bio.GN
keywords genomic language modelspositional regulatory logicmechanistic invariance testAT content correlationcompositional biasregulatory elementsgene regulation
0
0 comments X

The pith

Genomic language models fail to learn positional regulatory logic and instead exploit AT content correlations in DNA sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Mechanistic Invariance Test, a benchmark of 650 sequences with scrambled controls, to check whether genomic language models capture how the position of regulatory elements matters for gene control. Systematic tests show that models' sensitivity to sequence changes tracks only the overall AT base content, not the actual locations of elements. This leads to models assigning higher scores to wrong positions than correct ones and ignoring strand orientation entirely. A basic 100-parameter position-aware model outperforms all the large language models, indicating the issue is not lack of scale but the models' training biases.

Core claim

Through the Mechanistic Invariance Test and follow-up probes including AT titration, positional ablation, spacing changes, and strand flips, all tested genomic language models exhibit performance driven solely by correlation with AT content (r=0.78-0.96) rather than any understanding of positional grammar. Models invert biological reality by scoring incorrect positions higher than correct ones, remain strand-blind, and show compositional effects dominating positional ones by a factor of 46. Larger models amplify the bias, while a simple position-aware PWM reaches perfect scores on the benchmark.

What carries the argument

The Mechanistic Invariance Test (MIT), a 650-sequence benchmark across 8 classes with scrambled controls that separates compositional sensitivity from positional regulatory understanding.

Load-bearing premise

That the Mechanistic Invariance Test with its scrambled controls cleanly isolates positional regulatory logic from compositional effects without introducing other uncontrolled biases in sequence generation or scoring.

What would settle it

A direct measurement showing that genomic language models assign higher regulatory scores to sequences with correct element positions than to matched sequences with incorrect positions when AT content is held constant across all positions.

Figures

Figures reproduced from arXiv: 2604.06549 by Bryan Cheng, Jasper Zhang.

Figure 1
Figure 1. Figure 1: MIT benchmark overview. (a) Promoter architecture showing -35 box, -10 box, UP el￾ement, and extended -10 positions. (b) CSS across models; asterisk indicates pFDR < 0.05. Abbre￾viations: PA=PA-PWM, RPA=RPA-PWM, Thm=Thermodynamic. (c) SCR measuring positional awareness; all gLMs near chance (0.5). (d) CSS vs. SCR: biophysical models (orange) achieve both high CSS and SCR; gLMs (blue) show only compositiona… view at source ↗
Figure 2
Figure 2. Figure 2: Mechanistic probing. (a) AT titration: LL increases with AT% (r = 0.78). (b) Positional sweep: removing UP (∆≈3.7) matters more than misplacing it (∆≈0.5). (c) Spacing: HyenaDNA peaks at 14bp; PA-PWM at 17bp. (d) Strand: forward scores lower than RC [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect magnitudes and model comparison. (a) Effect sizes on log scale showing AT content dominates. (b) MES comparison between natural and synthetic sequences. (c) CSS vs. SCR scatter with model type coloring. I PER-MODEL DETAILED ANALYSIS I.1 HYENADNA ANALYSIS HyenaDNA uses Hyena operators—a subquadratic alternative to attention—for modeling long￾range dependencies. The model was pretrained on the human r… view at source ↗
Figure 4
Figure 4. Figure 4: Comprehensive metrics summary. (a) Metrics heatmap across all five gLMs and bio￾physical baselines (PA-PWM CSS=1.00, SCR=0.98). (b) CSS grouped by architecture type. (c) Compensation benefit by AT content. (d) CSS vs. model parameters showing PA-PWM (∼100 params) achieves highest CSS. Architecture: The hyenadna-small-32k variant has 6.6M parameters with a context length of 32,768 bp. It processes single nu… view at source ↗
read the original abstract

Genomic language models (gLMs) have transformed computational biology, achieving state-of-the-art performance across genomic tasks. Yet a fundamental question threatens the foundation of this success: do these models learn the mechanistic principles governing gene regulation, or do they merely exploit statistical shortcuts? We introduce the Mechanistic Invariance Test (MIT), a rigorous 650-sequence benchmark across 8 classes with scrambled controls that enables clean discrimination between compositional sensitivity and genuine positional understanding. We evaluate five gLMs spanning all major architectural paradigms (autoregressive, masked, and bidirectional state-space models) and uncover a universal failure mode. Through systematic mechanistic probing via AT titration, positional ablation, spacing perturbation, and strand orientation tests, we demonstrate that apparent compensation sensitivity is driven entirely by AT content correlation (r=0.78-0.96 across architectures), not positional regulatory logic. The failures are striking: Evo2-1B and Caduceus score regulatory elements at incorrect positions higher than correct positions, inverting biological reality. All models are strand-blind. Compositional effects dominate positional effects by 46-fold. Perhaps most revealing, a simple 100-parameter position-aware PWM achieves perfect performance (CSS=1.00, SCR=0.98), exposing that billion-parameter gLMs fail not from insufficient capacity but from fundamentally misaligned inductive biases. Larger models show stronger compositional bias, demonstrating that scale amplifies rather than corrects this limitation. These findings reveal that current gLMs capture surface statistics while missing the positional grammar essential for gene regulation, demanding architectural innovation before deployment in synthetic biology, gene therapy, and clinical variant interpretation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Mechanistic Invariance Test (MIT), a 650-sequence benchmark across 8 classes with scrambled controls, to distinguish compositional sensitivity from positional regulatory logic in genomic language models. Evaluating five gLMs (autoregressive, masked, and state-space architectures) via AT titration, positional ablation, spacing perturbation, and strand orientation tests, it reports that all apparent positional effects are explained by AT-content correlation (r=0.78-0.96), with Evo2-1B and Caduceus inverting correct/incorrect position scores, all models being strand-blind, and compositional effects dominating positional ones by 46-fold. A 100-parameter position-aware PWM baseline achieves near-perfect scores (CSS=1.00, SCR=0.98), while larger models exhibit stronger compositional bias.

Significance. If the MIT controls are shown to isolate positional logic without residual compositional confounds, the work provides a valuable empirical demonstration that current gLMs capture surface statistics rather than the positional grammar of gene regulation. The systematic mechanistic probes and direct comparison to a lightweight PWM baseline are strengths, highlighting that the limitation is inductive bias rather than capacity. The finding that scale amplifies the bias has implications for deploying gLMs in synthetic biology and variant interpretation, and the benchmark itself offers a reusable test for future architectural improvements.

major comments (2)
  1. [MIT benchmark description] Section describing the MIT benchmark and scrambling procedure: the central claim that scrambled controls enable 'clean discrimination' between compositional and positional effects depends on explicit verification that scrambled sequences match originals on all statistics the models exploit (dinucleotide frequencies, motif co-occurrence, local GC gradients). Without reported Kolmogorov-Smirnov tests or matching statistics on these features, the AT-titration and ablation results could be driven by residual non-positional cues rather than pure composition.
  2. [Results on model inversions and dominance] Results section on positional ablation and inversion findings: the reported 46-fold dominance of compositional over positional effects and the inversion of correct vs. incorrect positions in Evo2-1B and Caduceus require the exact definition of the dominance ratio (e.g., which effect-size metric from the spacing-perturbation test) and statistical significance testing with correction for the 650-sequence multiple comparisons to support the 'universal failure mode' conclusion.
minor comments (2)
  1. [Abstract and PWM baseline] Abstract and methods: the 100-parameter PWM is described as 'position-aware' but its exact construction (e.g., how positions are encoded relative to the 8 classes) should be detailed with pseudocode or a small table to allow direct replication.
  2. [Figures] Figure captions for the AT-titration and strand-orientation plots: axis labels and error bars should explicitly state whether they represent mean ± SEM across the 650 sequences or per-class aggregates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us improve the clarity and rigor of our manuscript. Below, we provide point-by-point responses to the major comments and outline the revisions we will implement.

read point-by-point responses
  1. Referee: Section describing the MIT benchmark and scrambling procedure: the central claim that scrambled controls enable 'clean discrimination' between compositional and positional effects depends on explicit verification that scrambled sequences match originals on all statistics the models exploit (dinucleotide frequencies, motif co-occurrence, local GC gradients). Without reported Kolmogorov-Smirnov tests or matching statistics on these features, the AT-titration and ablation results could be driven by residual non-positional cues rather than pure composition.

    Authors: We agree that explicit verification is necessary to fully substantiate the claim of clean discrimination. In the revised manuscript, we will add a supplementary section reporting Kolmogorov-Smirnov tests and matching statistics for dinucleotide frequencies, motif co-occurrence, and local GC gradients between original and scrambled sequences. These will confirm no significant residual differences, thereby strengthening the isolation of positional effects. revision: yes

  2. Referee: Results section on positional ablation and inversion findings: the reported 46-fold dominance of compositional over positional effects and the inversion of correct vs. incorrect positions in Evo2-1B and Caduceus require the exact definition of the dominance ratio (e.g., which effect-size metric from the spacing-perturbation test) and statistical significance testing with correction for the 650-sequence multiple comparisons to support the 'universal failure mode' conclusion.

    Authors: We will revise the results section to explicitly define the dominance ratio as the ratio of the compositional effect size (measured via the spacing-perturbation test) to the positional effect size (from positional ablation). We will also add statistical significance testing using Wilcoxon signed-rank tests with Bonferroni correction for multiple comparisons across the 650 sequences. These changes will provide rigorous quantitative support for the reported inversions and the 46-fold dominance. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical benchmark with external controls

full rationale

The paper is an empirical benchmarking study that introduces the Mechanistic Invariance Test and applies it to evaluate gLMs on sequence perturbations. All reported results (AT correlations, positional scoring inversions, 46-fold dominance, PWM baseline performance) are direct experimental measurements on held-out or generated sequences rather than quantities derived from fitted parameters inside the same equations. No self-citation chain, uniqueness theorem, or ansatz is invoked to justify the central claims; the PWM baseline is presented as an independent comparator, not as a fitted input renamed as a prediction. The derivation chain is therefore self-contained against external sequence data and model outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the benchmark design and AT-content correlation interpretation correctly attribute model behavior to missing positional logic rather than other factors.

axioms (1)
  • domain assumption Scrambled controls preserve compositional statistics while fully removing positional information
    Invoked to interpret model failures as evidence of absent positional understanding.

pith-pipeline@v0.9.0 · 5593 in / 1253 out tokens · 48745 ms · 2026-05-10T18:39:22.495822+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

8 extracted references

  1. [1]

    Positions 25 and 45 show anomalously large penalties (−2.53 and−2.17) because the UP element overlaps with the -35 box (positions 30–35) and -10 box (positions 53–58), disrupting their consensus sequences

  2. [2]

    Excluding these confounded positions, the positional effect ranges from−0.49 to +0.49—a total span of only 0.98 LL units

  3. [3]

    15:−3.70) is∼8×largerthan the maximum positional effect (0.46)

    The compositional effect (None vs. 15:−3.70) is∼8×largerthan the maximum positional effect (0.46). B.4 FULLSPACINGSENSITIVITYRESULTS Table 11: Complete spacing sensitivity experiment (n= 50per spacing). Spacing (bp) Mean LL Std Dev∆vs. 17bp 12−143.47 4.10−1.20 13−142.71 4.56−0.44 14 (HyenaDNA peak)−141.79 4.87 +0.48 15−142.87 4.91−0.60 16−142.66 4.61−0.40...

  4. [4]

    HyenaDNA peaks at 14 bp, not the biologically optimal 17 bp

  5. [5]

    The total range across all spacings is only 1.68 LL units (−143.47 to−141.79)

  6. [6]

    For comparison, the AT content effect spans 21.0 LL units—12.5×larger

  7. [7]

    PA-PWM succeeds by construction

    The model shows no preference for the biologically correct 17±1 bp range. 13 Workshop @ ICLR 2026 B.5 FULLSTRANDORIENTATIONRESULTS Table 12: Complete strand orientation experiment (n= 50per condition) for HyenaDNA. Condition Mean LL Std Dev∆vs. Forward Forward (correct)−143.79 4.45 0.00 RC motifs in place−142.83 3.99 +0.96 Full reverse complement−142.13 3...

  8. [8]

    Our benchmark leverages this biological knowledge to create rigorous tests

    are well-characterized biochemically. Our benchmark leverages this biological knowledge to create rigorous tests. M EXAMPLESEQUENCES We provide representative sequences from each class to illustrate the benchmark design. Note: positions 0–57 shown; full sequences are 100 bp with random background extending to position 99. M.1 CLASSC: SYNTHETICINTACT Pos: ...