arxiv: 2604.04155 · v1 · submitted 2026-04-05 · 💻 cs.LG · cs.IT· math.IT· q-bio.QM· stat.ML

Recognition: 3 theorem links

· Lean Theorem

The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models

Prashant C. Raju

Authors on Pith no claims yet

Pith reviewed 2026-05-13 16:47 UTC · model grok-4.3

classification 💻 cs.LG cs.ITmath.ITq-bio.QMstat.ML

keywords geometric alignment taxdiscrete tokenizationcontinuous geometryscientific foundation modelsgeometric distortionrate-distortion theorybiological modelsmutual information

0 comments

The pith

Discrete tokenization in scientific foundation models imposes up to 8.5 times more geometric distortion than continuous alternatives on identical encoders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that forcing continuous physical and biological manifolds through discrete categorical bottlenecks creates an intrinsic Geometric Alignment Tax that prevents faithful representation. Controlled ablations on synthetic dynamical systems show that swapping cross-entropy for a continuous head cuts distortion by up to 8.5x, while learned codebooks display a non-monotonic double bind in which finer quantization improves reconstruction yet harms geometry. Evaluations of fourteen biological models reveal three consistent failure regimes and confirm that no current architecture simultaneously achieves low distortion, high mutual information, and global coherence. These results matter because accurate preservation of continuous geometry is required for reliable modeling of dynamical systems in biology and physics.

Core claim

The root cause is the Geometric Alignment Tax, an intrinsic cost of discrete tokenization. On identical encoders, continuous objectives produce at most 1.3x architectural variation while discrete tokenization produces 3,000x variation. Learned codebooks worsen geometric fidelity with finer quantization despite better reconstruction. Real models fall into Local-Global Decoupling, Representational Compression, or Geometric Vacuity, and Evo 2's reverse-complement robustness reflects conserved composition rather than learned symmetry.

What carries the argument

The Geometric Alignment Tax: the measurable cost of routing continuous manifolds through discrete categorical bottlenecks, quantified by rate-distortion curves and MINE mutual-information estimates.

If this is right

Continuous objectives make architecture choice nearly irrelevant (1.3x spread) while discrete objectives amplify architectural differences by three orders of magnitude.
Learned codebooks create a non-monotonic trade-off: finer quantization improves reconstruction but increases geometric distortion.
Existing biological foundation models fall into one of three regimes: Local-Global Decoupling, Representational Compression, or Geometric Vacuity.
No model reaches the joint optimum of low distortion, high mutual information, and global coherence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Scientific applications may require architectures that avoid discrete bottlenecks entirely rather than tuning tokenization granularity.
The observed divergence under discrete objectives suggests that downstream tasks relying on geometric relationships, such as molecular dynamics or trajectory prediction, will inherit systematic errors.
Hybrid representations that combine limited discrete tokens with continuous refinement layers could be tested to reduce the tax while retaining some discrete benefits.
The same rate-distortion and coherence diagnostics could be applied to foundation models in chemistry or climate science to check for analogous alignment failures.

Load-bearing premise

The synthetic dynamical systems and rate-distortion/MINE metrics used in the evaluations faithfully capture the geometric properties and alignment failures present in real biological and physical data.

What would settle it

A discrete-tokenized model that simultaneously achieves low geometric distortion, high mutual information, and global coherence on real DNA sequences or physical trajectories would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.04155 by Prashant C. Raju.

**Figure 1.** Figure 1: A. Track A vs. Track B Lipschitz profiles: smooth arcs (continuous physics) vs. divergent, multi-scale fracture (discrete biology). B. Continuous vs. discrete Procrustes D across architectures on the Lorenz dataset at 1% noise. All continuous conditions cluster near zero; discrete conditions span an order of magnitude. C. VQ double bind: reconstruction MSE (decreasing) vs. Procrustes D (non-monotone) vs. c… view at source ↗

**Figure 2.** Figure 2: A. ESM-2 composite stability (blue, left axis) vs. parameters, with Procrustes reduction overlaid (orange, right axis). Stability declines monotonically from 8M to 3B; the 15B “recovery” is unmasked by the simultaneous spike in Procrustes reduction, revealing global manifold drift rather than genuine geometric improvement. B. Conceptual illustration of the two failure modes. Ground Truth: the manifold is a… view at source ↗

**Figure 3.** Figure 3: A. Texture Hypothesis Test. RC RDM similarity across four conditions for Evo 2 (7B, 8K context, 10,000 sequences). Dinuc-shuffled real DNA (per-sequence k-mer counts preserved) recovers 97% of the real-random gap; texturematched Markov (population-level statistics only) recovers 3%. B. The RC Dissociation explained. On synthetic DNA (left), discrete tokens destroy the A↔T / C↔G bijection entirely (RDM ∼ 0… view at source ↗

**Figure 4.** Figure 4: (A) Excess MI (bias-corrected) across the three failure regimes. ProtMamba falls below zero (Geometric Vacuity), ESM-1b and OpenFold show large positive values (Representational Compression), and Evo 2 is modest and positive (Local-Global Decoupling). Random baselines sit at zero by construction. (B) Regime I: Evo 2 global vs. local MI. The flat curve across 64× context expansion confirms informational sha… view at source ↗

**Figure 5.** Figure 5: Effect of embedding-level RCCR on DNABERT-2 (117M). (A) Training loss converges rapidly (99.4% reduction in 10 epochs). (B) Per-sequence RC cosine gap collapses from 0.041 to 0.000: perfect pointwise consistency. (C) Despite this, Procrustes disparity between forward and RC embedding matrices increases 91% (0.76 → 1.45): population-level geometric structure degrades. (D) Shesha composite stability by pertu… view at source ↗

read the original abstract

Foundation models for biology and physics optimize predictive accuracy, but their internal representations systematically fail to preserve the continuous geometry of the systems they model. We identify the root cause: the Geometric Alignment Tax, an intrinsic cost of forcing continuous manifolds through discrete categorical bottlenecks. Controlled ablations on synthetic dynamical systems demonstrate that replacing cross-entropy with a continuous head on an identical encoder reduces geometric distortion by up to 8.5x, while learned codebooks exhibit a non-monotonic double bind where finer quantization worsens geometry despite improving reconstruction. Under continuous objectives, three architectures differ by 1.3x; under discrete tokenization, they diverge by 3,000x. Evaluating 14 biological foundation models with rate-distortion theory and MINE, we identify three failure regimes: Local-Global Decoupling, Representational Compression, and Geometric Vacuity. A controlled experiment confirms that Evo 2's reverse-complement robustness on real DNA reflects conserved sequence composition, not learned symmetry. No model achieves simultaneously low distortion, high mutual information, and global coherence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows tokenization creates measurable geometric distortion in scientific FMs via synthetics and flags three regimes in real models, but the causal jump from synthetics to biology needs more checks.

read the letter

The main takeaway is that forcing continuous scientific data through discrete tokens carries a real cost. Controlled ablations on synthetic dynamical systems show that swapping cross-entropy for a continuous head on the same encoder cuts geometric distortion by up to 8.5x. Learned codebooks hit a non-monotonic bind where finer quantization improves reconstruction but worsens geometry. Under continuous training the three architectures stay within 1.3x of each other, but discrete tokenization stretches that gap to 3000x. On 14 biological models they apply rate-distortion and MINE to identify Local-Global Decoupling, Representational Compression, and Geometric Vacuity, and a targeted check confirms Evo 2's reverse-complement behavior tracks sequence stats rather than learned symmetry. No model hits low distortion, high mutual information, and global coherence together. That framing and the quantitative splits are the clearest new pieces. The work does a clean job of isolating the tokenization variable and giving numbers to a problem people have noticed in practice. The soft spot is the reliance on synthetic dynamical systems for the causal story. The headline result and the tax mechanism are demonstrated there, then used to explain the regimes seen in real models. If those synthetics do not preserve the curvature, long-range correlations, or global coherence of protein backbones or genomic sequences, the attribution weakens even though the metrics on the 14 models are applied directly. The paper is aimed at people building or auditing foundation models for biology and physics. It deserves a serious referee because the experiments are controlled enough to generate useful discussion and the metrics are reproducible, even if reviewers will press on generalizability to real manifolds.

Referee Report

2 major / 2 minor

Summary. The paper claims that foundation models for biology and physics incur a 'Geometric Alignment Tax' from forcing continuous manifolds through discrete tokenization bottlenecks. Controlled ablations on synthetic dynamical systems show that a continuous head on an identical encoder reduces geometric distortion by up to 8.5x versus cross-entropy, while learned codebooks exhibit a non-monotonic double bind (finer quantization improves reconstruction but worsens geometry). Under continuous objectives architectures differ by only 1.3x, but under discrete tokenization they diverge by 3,000x. Rate-distortion and MINE analysis of 14 biological foundation models identifies three failure regimes (Local-Global Decoupling, Representational Compression, Geometric Vacuity); a controlled experiment shows Evo 2's reverse-complement robustness reflects sequence composition rather than learned symmetry. No model simultaneously achieves low distortion, high mutual information, and global coherence.

Significance. If the central claims hold, the work supplies a concrete, quantitative diagnosis of why current scientific foundation models systematically distort geometry and offers a clear architectural direction (continuous heads) that measurably mitigates the problem. The controlled synthetic ablations and the taxonomy of failure regimes provide falsifiable predictions that could guide future model design for physics and biology.

major comments (2)

[§5] §5 (Real-model evaluation): The attribution of the three failure regimes observed in the 14 biological foundation models to the Geometric Alignment Tax is load-bearing for the paper's central claim, yet rests on extrapolation from synthetic dynamical systems. The manuscript reports rate-distortion and MINE metrics on the real models but does not include any direct test (e.g., curvature histograms, correlation-length statistics, or topological invariants) showing that the chosen synthetic systems preserve the manifold geometry of real data such as protein backbones or genomic sequences. Without this, the causal link between tokenization and the observed regimes remains correlational rather than demonstrated.
[§3.2] §3.2 (Ablation results): The headline quantitative claim of an 8.5x reduction in geometric distortion when replacing cross-entropy with a continuous head is central to the argument. The geometric-distortion metric itself (presumably derived from the rate-distortion or MINE quantities introduced later) is not given an explicit equation or pseudocode in the ablation section, making it impossible to verify that the factor is independent of post-hoc metric choices or hyper-parameter tuning.

minor comments (2)

[Abstract] Abstract and §4: The factor '3,000x' divergence under discrete tokenization is striking but the baseline (which architecture pair, which exact distortion measure) is not restated, reducing readability.
Notation: The paper introduces 'Geometric Alignment Tax' as a named quantity but does not supply a compact mathematical expression for it; a short definition (e.g., Tax = D_geo(discrete) / D_geo(continuous)) would aid precision.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. The two major points raised are addressable through targeted revisions that strengthen the manuscript without altering its core claims. We respond to each below.

read point-by-point responses

Referee: [§5] §5 (Real-model evaluation): The attribution of the three failure regimes observed in the 14 biological foundation models to the Geometric Alignment Tax is load-bearing for the paper's central claim, yet rests on extrapolation from synthetic dynamical systems. The manuscript reports rate-distortion and MINE metrics on the real models but does not include any direct test (e.g., curvature histograms, correlation-length statistics, or topological invariants) showing that the chosen synthetic systems preserve the manifold geometry of real data such as protein backbones or genomic sequences. Without this, the causal link between tokenization and the observed regimes remains correlational rather than demonstrated.

Authors: We agree that a direct geometric comparison between the synthetic dynamical systems and real biological manifolds would make the causal attribution more robust. The synthetic systems (Lorenz, Rössler, and linear oscillators) were chosen because they exhibit the same continuous manifold properties—smooth trajectories, local Euclidean structure, and global coherence—that are distorted by tokenization in the real models. In the revised manuscript we will add a new subsection to §5 that computes correlation-length statistics and curvature histograms on both the synthetic trajectories and on representative subsets of the protein backbone and genomic sequence data used for the 14-model evaluation. This will quantify the degree of manifold similarity and thereby convert the current correlational evidence into a stronger, geometry-grounded link. revision: yes
Referee: [§3.2] §3.2 (Ablation results): The headline quantitative claim of an 8.5x reduction in geometric distortion when replacing cross-entropy with a continuous head is central to the argument. The geometric-distortion metric itself (presumably derived from the rate-distortion or MINE quantities introduced later) is not given an explicit equation or pseudocode in the ablation section, making it impossible to verify that the factor is independent of post-hoc metric choices or hyper-parameter tuning.

Authors: We acknowledge the omission. The geometric-distortion metric used throughout the paper, including in the §3.2 ablations, is the normalized average pairwise distance distortion: D_geo = (1/M) Σ_{i<j} |d_X(x_i,x_j) − d_Z(z_i,z_j)| / d_X(x_i,x_j), where d_X is Euclidean distance in input space and d_Z is Euclidean distance in the continuous latent space (or in the codebook embedding for discrete cases). We will insert this explicit definition together with the corresponding pseudocode immediately before the ablation results in the revised §3.2, ensuring the 8.5× factor can be independently recomputed from the released code and data. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from ablations and external evaluations

full rationale

The paper presents its core findings as outcomes of controlled ablations on synthetic dynamical systems (showing up to 8.5x distortion reduction with continuous heads) and rate-distortion/MINE evaluations on 14 real biological models. These are framed as experimental demonstrations rather than mathematical derivations. No steps reduce by construction to fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations whose content is unverified outside the paper. The claims rest on observable differences across architectures and objectives, with no evidence that the reported geometric tax or failure regimes are tautological with the input metrics or synthetic data generation process.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper introduces the Geometric Alignment Tax as a new explanatory concept and relies on domain assumptions about what rate-distortion theory and MINE measure in model representations. No free parameters or invented entities with independent evidence are explicitly detailed in the abstract.

axioms (1)

domain assumption Rate-distortion theory and MINE provide valid measures of geometric distortion and mutual information in the internal representations of foundation models.
These tools are used to evaluate the 14 biological models and identify failure regimes.

invented entities (1)

Geometric Alignment Tax no independent evidence
purpose: To name and explain the intrinsic cost of forcing continuous manifolds through discrete tokenization bottlenecks.
Introduced as the root cause identified through ablations; no independent evidence outside the paper's experiments is mentioned.

pith-pipeline@v0.9.0 · 5492 in / 1533 out tokens · 52317 ms · 2026-05-13T16:47:24.366826+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel + dAlembert_to_ODE_general matches

?

matches
MATCHES: this paper passage directly uses, restates, or depends on the cited Recognition theorem or module.

replacing cross-entropy with a continuous head on an identical encoder reduces geometric distortion by up to 8.5× ... Under continuous objectives, three architectures differ by 1.3×; under discrete tokenization, they diverge by 3,000×
IndisputableMonolith/Foundation/ArithmeticFromLogic embed_injective + embed_strictMono_of_one_lt echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

finer quantization improves reconstruction but worsens geometric stability ... Dproc ∝ 1/log K
IndisputableMonolith/Foundation/AbsoluteFloorClosure reality_from_one_distinction matches

?

matches
MATCHES: this paper passage directly uses, restates, or depends on the cited Recognition theorem or module.

the tax is not a property of attention, recurrence, or convolution; it is the price of discretizing a continuous world before processing it

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 2 internal anchors

[1]

J., Bambrick, J., Bodenstein, S

Abramson, J., Adler, J., Dunger, J., Evans, R., Green, T., Pritzel, A., Ronneberger, O., Willmore, L., Ballard, A. J., Bambrick, J., Bodenstein, S. W., Evans, D. A., Hung, C.-C., O’Neill, M., Reiman, D., Tunyasuvunakool, K., Wu, Z., Žemgulytė, A., Arvaniti, E., Beattie, C., Bertolli, O., Bridgland, A., Cherepanov, A., Congreve, M., Cowen-Rivers, A. I., Co...

work page 2024
[2]

J., Berenberg, D., Fisk, I., Zanichelli, N., Zhang, B., Nowaczynski, A., Wang, B., Stepniewska-Dziubinska, M

Ahdritz, G., Bouatta, N., Floristean, C., Kadyan, S., Xia, Q., Gerecke, W., O Donnell, T. J., Berenberg, D., Fisk, I., Zanichelli, N., Zhang, B., Nowaczynski, A., Wang, B., Stepniewska-Dziubinska, M. M., Zhang, S., Ojewole, A., Guney, M. E., Biderman, S., Watkins, A. M., Ra, S., Lorenzo, P. R., Nivon, L., Weitzner, B., Ban, Y.-E. A., Sorger, P. K., Mostaq...

work page 2024
[3]

and Bengio, Y

Alain, G. and Bengio, Y. (2016). Understanding intermediate layers using linear classifier probes. arXiv

work page 2016
[4]

Altschul, S. F. and Erickson, B. W. (1985). Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. Molecular biology and evolution , 2(6):526--538

work page 1985
[5]

R., Ward, T., Bycroft, C., Nicolaisen, L., Arvaniti, E., Pan, J., Thomas, R., Dutordoir, V., Perino, M., De, S., Karollus, A., Gayoso, A., Sargeant, T., Mottram, A., Wong, L

Avsec, Z ., Latysheva, N., Cheng, J., Novati, G., Taylor, K. R., Ward, T., Bycroft, C., Nicolaisen, L., Arvaniti, E., Pan, J., Thomas, R., Dutordoir, V., Perino, M., De, S., Karollus, A., Gayoso, A., Sargeant, T., Mottram, A., Wong, L. H., Drot \'a r, P., Kosiorek, A., Senior, A., Tanburn, R., Applebaum, T., Basu, S., Hassabis, D., and Kohli, P. (2026). A...

work page 2026
[6]

I., Baratin, A., Rajeswar, S., Ozair, S., Bengio, Y., Hjelm, R

Belghazi, M. I., Baratin, A., Rajeswar, S., Ozair, S., Bengio, Y., Hjelm, R. D., and Courville, A. C. (2018). Mutual Information Neural Estimation . In International Conference on Machine Learning

work page 2018
[7]

S., and Song, Y

Benegas, G., Batra, S. S., and Song, Y. S. (2023). DNA language models are powerful predictors of genome-wide variant effects . Proceedings of the National Academy of Sciences , 120(44):e2311219120

work page 2023
[8]

G., Ku, J., Naghipourfar, M., Poli, M., Sun, G., Brockman, G., Chang, D., Fanton, A., Gonzalez, G

Brixi, G., Durrant, M. G., Ku, J., Naghipourfar, M., Poli, M., Sun, G., Brockman, G., Chang, D., Fanton, A., Gonzalez, G. A., King, S. H., Li, D. B., Merchant, A. T., Nguyen, E., Ricci-Tam, C., Romero, D. W., Schmok, J. C., Taghibakhshi, A., Vorontsov, A., Yang, B., Deng, M., Gorton, L., Nguyen, N., Wang, N. K., Pearce, M. T., Simon, E., Adams, E., Amador...

work page 2026
[9]

Bullock, C. (1716). The Cobler of Preston: A Farce. As it is Acted at the New Theatre in Lincolns-Inn-Fields . Printed for R. Palmer, London

work page
[10]

L., Raney, B

Casper, J., Speir, M. L., Raney, B. J., Perez, G., Nassar, L. R., Lee, C. M., Hinrichs, A. S., Gonzalez, J. N., Fischer, C., Diekhans, M., Clawson, H., Benet-Pages, A., Barber, G. P., Vaske, C. J., van Baren, M. J., Wang, K., Rodriguez, Y. J. P., Jenkins-Kiefer, J. A., Chalamala, M., Haussler, D., Kent, W. J., and Haeussler, M. (2025). The UCSC Genome Bro...

work page 2025
[11]

Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory . John Wiley & Sons, Nashville, TN, 2 edition

work page 2006
[12]

H., Oteri, F., Dallago, C., Trop, E., de Almeida, B

Dalla-Torre, H., Gonzalez, L., Mendoza-Revilla, J., Lopez Carranza, N., Grzywaczewski, A. H., Oteri, F., Dallago, C., Trop, E., de Almeida, B. P., Sirelkhatim, H., Richard, G., Skwark, M., Beguir, K., Lopez, M., and Pierrot, T. (2024). Nucleotide transformer: building and evaluating robust foundation models for human genomics. Nature Methods , 22(2):287–297

work page 2024
[13]

and Gu, A

Dao, T. and Gu, A. (2024). Transformers are SSM s: Generalized Models and Efficient Algorithms Through Structured State Space Duality . In International Conference on Machine Learning

work page 2024
[14]

Defoe, D. (1726). The Political History of the Devil, As Well Ancient as Modern: In Two Parts . Printed for T. Warner, London

work page
[15]

Donsker, M. D. and Varadhan, S. R. S. (1983). Asymptotic evaluation of certain markov process expectations for large time. IV . Communications on Pure and Applied Mathematics , 36(2):183–212

work page 1983
[16]

Dryden, I. L. and Mardia, K. V. (1998). Statistical analysis of shape . Wiley Series in Probability and Statistics. John Wiley & Sons, Chichester, England

work page 1998
[17]

Franklin, B. (1789). Letter to J ean B aptiste L e R oy, November 13, 1789

work page
[18]

and Gray, R

Gersho, A. and Gray, R. M. (1991). Vector Quantization and Signal Compression . The Springer International Series in Engineering and Computer Science. Springer

work page 1991
[19]

Gray, R. (1990). Quantization noise spectra. IEEE Transactions on Information Theory , 36(6):1220--1244

work page 1990
[20]

and Dao, T

Gu, A. and Dao, T. (2024). Mamba: Linear-Time Sequence Modeling with Selective State Spaces . In First Conference on Language Modeling

work page 2024
[21]

Huang, T., Song, Z., Ying, R., and Jin, W. (2024). Protein-nucleic acid complex modeling with frame averaging transformer. In Advances in Neural Information Processing Systems

work page 2024
[22]

Ji, Y., Zhou, Z., Liu, H., and Davuluri, R. V. (2021). DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome . Bioinformatics , 37(15):2112--2120

work page 2021
[23]

and S najder, J

Juki \'c , J. and S najder, J. (2024). From robustness to improved generalization and calibration in pre-trained language models. Transactions of the Association for Computational Linguistics , 13:264--280

work page 2024
[24]

Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., Back, T., Petersen, S., Reiman, D., Clancy, E., Zielinski, M., Steinegger, M., Pacholska, M., Berg...

work page 2021
[25]

and Singh, S

Khromov, G. and Singh, S. P. (2024). Some Fundamental Aspects about Lipschitz Continuity of Neural Networks . In International Conference on Learning Representations

work page 2024
[26]

Kornblith, S., Norouzi, M., Lee, H., and Hinton, G. (2019). Similarity of Neural Network Representations Revisited . In International Conference on Machine Learning

work page 2019
[27]

Kriegeskorte, N., Mur, M., and Bandettini, P. (2008). Representational similarity analysis – connecting the branches of systems neuroscience. Frontiers in Systems Neuroscience

work page 2008
[28]

Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y., dos Santos Costa, A., Fazel-Zarandi, M., Sercu, T., Candido, S., and Rives, A. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science , 379(6637):1123–1130

work page 2023
[29]

Ma, M. (2025). Reverse-Complement Consistency for DNA Language Models . arXiv preprint arXiv:2509.18529

work page arXiv 2025
[30]

M., and Zemel, Y

Masarotto, V., Panaretos, V. M., and Zemel, Y. (2018). Procrustes Metrics on Covariance Operators and Optimal Transportation of Gaussian Processes . Sankhya A , 81(1):172–213

work page 2018
[31]

P., Cesista, F., Zahorodnii, A., Bernstein, J., and Isola, P

Newhouse, L., Hess, R. P., Cesista, F., Zahorodnii, A., Bernstein, J., and Isola, P. (2025). Training transformers with enforced lipschitz constants. arXiv preprint arXiv:2507.13338

work page arXiv 2025
[32]

D., Poli, M., Faizi, M., Thomas, A

Nguyen, E. D., Poli, M., Faizi, M., Thomas, A. W., Birch-Sykes, C., Wornow, M., Patel, A., Rabideau, C. M., Massaroli, S., Bengio, Y., Ermon, S., Baccus, S. A., and R \'e , C. (2023). HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution . In Advances in Neural Information Processing Systems

work page 2023
[33]

Raju, P. C. (2026a). From Syntax to Semantics: Geometric Stability as the Missing Axis of Perturbation Biology . arXiv preprint arXiv:2603.00678

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Raju, P. C. (2026b). Geometric Stability: The Missing Axis of Representations . arXiv preprint arXiv:2601.09173

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Raju, P. C. (2026c). Shesha: Self-Consistency Metrics for Representational Stability . doi: 10.5281/zenodo.18227453

work page doi:10.5281/zenodo.18227453
[36]

Rohlf, F. J. and Slice, D. (1990). Extensions of the Procrustes Method for the Optimal Superimposition of Landmarks . Systematic Zoology , 39(1):40

work page 1990
[37]

Schiff, Y., Kao, C.-H., Gokaslan, A., Dao, T., Gu, A., and Kuleshov, V. (2024). Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling . In International Conference on Machine Learning

work page 2024
[38]

Sch\" o nemann, P. H. (1966). A Generalized Solution of the Orthogonal Procrustes Problem . Psychometrika , 31(1):1–10

work page 1966
[39]

Sgarbossa, D., Malbranke, C., and Bitbol, A.-F. (2025). ProtMamba: a homology-aware but alignment-free protein state space model . Bioinformatics , 41(6)

work page 2025
[40]

Shannon, C. E. (1959). Coding Theorems for a Discrete Source With a Fidelity Criterion . IRE National Convention Record , 7(4):142--163

work page 1959
[41]

Su, J., Han, C., Zhou, Y., Shan, J., Zhou, X., and Yuan, F. (2024). SaProt: Protein Language Modeling with Structure-aware Vocabulary . In International Conference on Learning Representations

work page 2024
[42]

E., Huang, H., McGarvey, P., Mazumder, R., and Wu, C

Suzek, B. E., Huang, H., McGarvey, P., Mazumder, R., and Wu, C. H. (2007). UniRef: comprehensive and non-redundant UniProt reference clusters . Bioinformatics , 23(10):1282–1288

work page 2007
[43]

Watson, J. D. and Crick, F. H. C. (1953). Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid . Nature , 171(4356):737–738

work page 1953
[44]

and Gilpin, W

Zhang, Y. and Gilpin, W. (2025). Zero-shot forecasting of chaotic systems. In International Conference on Learning Representations

work page 2025
[45]

V., and Liu, H

Zhou, Z., Ji, Y., Li, W., Dutta, P., Davuluri, R. V., and Liu, H. (2024). DNABERT -2: Efficient Foundation Model and Benchmark For Multi-Species Genomes . In International Conference on Learning Representations

work page 2024