pith. machine review for the scientific record. sign in

arxiv: 2604.16570 · v1 · submitted 2026-04-17 · 💻 cs.LG · cs.AI

Recognition: unknown

In Search of Lost DNA Sequence Pretraining

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords pretrainingdatasetsdownstreamsequenceevaluationproblemsvocabularyachieved
0
0 comments X

The pith

DNA pretraining suffers from inappropriate evaluation datasets, flawed neighbor-masking, and neglected vocabulary design; the authors supply guidelines and a reproducible testbed to fix them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

DNA is the instruction book for living things, and researchers train large AI models on long strings of A, C, G, T letters to predict what genes do or how proteins are made. Current training methods skip some basic checks: they test on datasets that may not match real use cases, use a masking trick that hides nearby letters in ways that create artificial patterns, and rarely explain how they split the DNA into tokens the model can read. The authors ran experiments to show these shortcuts hurt performance and then gave clear rules for choosing better test data, designing tasks, and picking vocabularies. They also released a standard set of benchmarks so different labs can compare models on equal terms.

Core claim

We reveal three critical yet heretofore overlooked problems in DNA pretraining: inappropriate downstream datasets, inherent flaws in the neighbor-masking strategy, and the lack of detailed discussion on vocabulary. ... we introduce a standardized testbed that enables reproducible and rigorous benchmarking of DNA pretraining methods.

Load-bearing premise

That the three identified problems are the primary overlooked issues and that the proposed guidelines plus testbed will produce meaningfully better genomic models, as validated by the authors' experiments.

Figures

Figures reproduced from arXiv: 2604.16570 by Jianqiang Huang, Jiaxin Qi, Jinli Ou, Yan Cui, Yuhua Zheng, Zhijiang Tang.

Figure 1
Figure 1. Figure 1: (a) Overview of DNA pretraining and its downstream ap [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustrations of our guiding tasks. Frozen tokens are [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustrations of our downstream dataset selection crite [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of token importance in the BPE vocabu [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Supplementary for Figure 3 in the main paper. The scaling law adherence status of all 26 downstream datasets. We model the [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Supplementary for Figure 4 in the main paper. Visualization of pretraining loss curves for different guiding tasks and different [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of token distribution. All vocabulary sizes are 256, excluding special tokens. “Accuracy” refers to the token [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
read the original abstract

DNA sequence encoding is fundamental to gene function prediction, protein synthesis, and diverse downstream biological tasks. Despite the substantial progress achieved by large-scale DNA sequence pretraining, existing studies have overwhelmingly emphasized pretraining scale and custom downstream evaluation datasets, while neglecting some essential components of the pretraining paradigm. In this paper, we reveal three critical yet heretofore overlooked problems in DNA pretraining: inappropriate downstream datasets, inherent flaws in the neighbor-masking strategy, and the lack of detailed discussion on vocabulary. Therefore, we undertake comprehensive investigations and propose principled guidelines, including selection criteria for evaluation datasets, guiding task design, and in-depth vocabulary analysis. Extensive experiments validate the significance of our identified problems and support the rationale behind our recommendations. Finally, we introduce a standardized testbed that enables reproducible and rigorous benchmarking of DNA pretraining methods to advance the development of genomic foundation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical methods critique with no mathematical derivations, no fitted constants, and no new postulated entities; all content is drawn from standard ML practice in genomics.

pith-pipeline@v0.9.0 · 5450 in / 1056 out tokens · 52916 ms · 2026-05-10T08:49:32.171812+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 10 canonical work pages · 3 internal anchors

  1. [1]

    An atlas of active enhancers across human cell types and tissues.Nature, 507(7493):455–461, 2014

    Robin Andersson, Claudia Gebhard, Irene Miguel-Escalada, Ilka Hoof, Jette Bornholdt, Mette Boyd, Yun Chen, Xiaobei Zhao, Christian Schmidl, Takahiro Suzuki, et al. An atlas of active enhancers across human cell types and tissues.Nature, 507(7493):455–461, 2014. 3

  2. [2]

    Barcodebert: Transformers for biodiversity analysis.arXiv preprint arXiv:2311.02401, 2023

    Pablo Millan Arias, Niousha Sadjadi, Monireh Safari, ZeM- ing Gong, Austin T Wang, Scott C Lowe, Joakim Bruslund Haurum, Iuliia Zarubiieva, Dirk Steinke, Lila Kari, et al. Barcodebert: Transformers for biodiversity analysis.arXiv preprint arXiv:2311.02401, 2023. 2, 3

  3. [3]

    Genome modeling and design across all domains of life with evo 2.BioRxiv, pages 2025–02, 2025

    Garyk Brixi, Matthew G Durrant, Jerome Ku, Michael Poli, Greg Brockman, Daniel Chang, Gabriel A Gonza- lez, Samuel H King, David B Li, Aditi T Merchant, et al. Genome modeling and design across all domains of life with evo 2.BioRxiv, pages 2025–02, 2025. 1, 3

  4. [4]

    An integrated encyclo- pedia of dna elements in the human genome.Nature, 489 (7414):57, 2012

    ENCODE Project Consortium et al. An integrated encyclo- pedia of dna elements in the human genome.Nature, 489 (7414):57, 2012. 1, 3

  5. [5]

    Nucleotide trans- former: building and evaluating robust foundation models for human genomics.Nature Methods, 22(2):287–297, 2025

    Hugo Dalla-Torre, Liam Gonzalez, Javier Mendoza-Revilla, Nicolas Lopez Carranza, Adam Henryk Grzywaczewski, Francesco Oteri, Christian Dallago, Evan Trop, Bernardo P de Almeida, Hassan Sirelkhatim, et al. Nucleotide trans- former: building and evaluating robust foundation models for human genomics.Nature Methods, 22(2):287–297, 2025. 1, 3, 5, 6, 13, 14

  6. [6]

    Bert: Pre-training of deep bidirectional trans- formers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers), pages 4171– 4186, 2019. 3, 4

  7. [7]

    An introduction to roc analysis.Pattern recog- nition letters, 27(8):861–874, 2006

    Tom Fawcett. An introduction to roc analysis.Pattern recog- nition letters, 27(8):861–874, 2006. 6

  8. [8]

    Jaspar 2020: update of the open-access database of transcrip- tion factor binding profiles.Nucleic acids research, 48(D1): D87–D92, 2020

    Oriol Fornes, Jaime A Castro-Mondragon, Aziz Khan, Robin Van der Lee, Xi Zhang, Phillip A Richmond, Bhavi P Modi, Solenne Correard, Marius Gheorghe, Damir Baranaˇsi´c, et al. Jaspar 2020: update of the open-access database of transcrip- tion factor binding profiles.Nucleic acids research, 48(D1): D87–D92, 2020. 3

  9. [9]

    Gencode 2021.Nucleic acids research, 49(D1):D916–D923, 2021

    Adam Frankish, Mark Diekhans, Irwin Jungreis, Julien La- garde, Jane E Loveland, Jonathan M Mudge, Cristina Sisu, James C Wright, Joel Armstrong, If Barnes, et al. Gencode 2021.Nucleic acids research, 49(D1):D916–D923, 2021. 3

  10. [10]

    Cpg is- lands in vertebrate genomes.Journal of molecular biology, 196(2):261–282, 1987

    Margaret Gardiner-Garden and Marianne Frommer. Cpg is- lands in vertebrate genomes.Journal of molecular biology, 196(2):261–282, 1987. 2

  11. [11]

    Human genome assembly grch38.p14 (gcf 000001405.40).https://www.ncbi

    Genome Reference Consortium. Human genome assembly grch38.p14 (gcf 000001405.40).https://www.ncbi. nlm.nih.gov/assembly/GCF_000001405.40/,

  12. [12]

    Accessed: 2025-07-29. 6, 14

  13. [13]

    Comparing two k-category assignments by a k-category correlation coefficient.Computational biology and chemistry, 28(5-6):367–374, 2004

    Jan Gorodkin. Comparing two k-category assignments by a k-category correlation coefficient.Computational biology and chemistry, 28(5-6):367–374, 2004. 6

  14. [14]

    Genomic benchmarks: a collection of datasets for genomic sequence classification

    Katar ´ına Greˇsov´a, Vlastimil Martinek, David ˇCech´ak, Petr ˇSimeˇcek, and Panagiotis Alexiou. Genomic benchmarks: a collection of datasets for genomic sequence classification. BMC Genomic Data, 24(1):25, 2023. 3, 6, 13

  15. [15]

    Springer, 2009

    Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman.The elements of statistical learning: data mining, inference, and prediction. Springer, 2009. 6, 7, 14, 15

  16. [16]

    Prentice Hall PTR, 1998

    Simon Haykin.Neural networks: a comprehensive founda- tion. Prentice Hall PTR, 1998. 6, 14

  17. [17]

    Delving deep into rectifiers: Surpassing human-level perfor- mance on imagenet classification

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level perfor- mance on imagenet classification. InProceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015. 6, 14

  18. [18]

    Dnabert: pre-trained bidirectional encoder representa- tions from transformers model for dna-language in genome

    Yanrong Ji, Zhihan Zhou, Han Liu, and Ramana V Davu- luri. Dnabert: pre-trained bidirectional encoder representa- tions from transformers model for dna-language in genome. Bioinformatics, 37(15):2112–2120, 2021. 1, 2, 3, 4, 6, 11

  19. [19]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

  20. [20]

    Chromatin modifications and their func- tion.Cell, 128(4):693–705, 2007

    Tony Kouzarides. Chromatin modifications and their func- tion.Cell, 128(4):693–705, 2007. 3

  21. [21]

    Systems and algorithms for convo- lutional multi-hybrid language models at scale.arXiv preprint arXiv:2503.01868, 2025

    Jerome Ku, Eric Nguyen, David W Romero, Garyk Brixi, Brandon Yang, Anton V orontsov, Ali Taghibakhshi, Amy X Lu, Dave P Burke, Greg Brockman, et al. Systems and al- gorithms for convolutional multi-hybrid language models at scale.arXiv preprint arXiv:2503.01868, 2025. 3

  22. [22]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 14

  23. [23]

    MIT press, 1999

    Christopher Manning and Hinrich Schutze.Foundations of statistical natural language processing. MIT press, 1999. 2

  24. [24]

    Bend: Benchmarking dna language models on biologically meaningful tasks.arXiv preprint arXiv:2311.12570, 2023

    Frederikke Isa Marin, Felix Teufel, Marc Horlacher, Dennis Madsen, Dennis Pultz, Ole Winther, and Wouter Boomsma. Bend: Benchmarking dna language models on biologically meaningful tasks.arXiv preprint arXiv:2311.12570, 2023. 1, 3, 6, 12, 13, 14

  25. [25]

    Hye- nadna: Long-range genomic sequence modeling at single nu- cleotide resolution.Advances in neural information process- ing systems, 36:43177–43201, 2023

    Eric Nguyen, Michael Poli, Marjan Faizi, Armin Thomas, Michael Wornow, Callum Birch-Sykes, Stefano Massaroli, 9 Aman Patel, Clayton Rabideau, Yoshua Bengio, et al. Hye- nadna: Long-range genomic sequence modeling at single nu- cleotide resolution.Advances in neural information process- ing systems, 36:43177–43201, 2023. 1, 3, 6, 14

  26. [26]

    Sequence mod- eling and design from molecular to genome scale with evo

    Eric Nguyen, Michael Poli, Matthew G Durrant, Brian Kang, Dhruva Katrekar, David B Li, Liam J Bartie, Armin W Thomas, Samuel H King, Garyk Brixi, et al. Sequence mod- eling and design from molecular to genome scale with evo. Science, 386(6723):eado9336, 2024. 1, 3

  27. [27]

    Ch ip-atlas: a data-mining suite powered by full integration of public ch ip-seq data.EMBO reports, 19(12):e46255, 2018

    Shinya Oki, Tazro Ohta, Go Shioi, Hideki Hatanaka, Osamu Ogasawara, Yoshihiro Okuda, Hideya Kawaji, Ryo Nakaki, Jun Sese, and Chikara Meno. Ch ip-atlas: a data-mining suite powered by full integration of public ch ip-seq data.EMBO reports, 19(12):e46255, 2018. 3

  28. [28]

    Model decides how to tokenize: Adaptive dna sequence tok- enization with mxdna.Advances in Neural Information Pro- cessing Systems, 37:66080–66107, 2024

    Lifeng Qiao, Peng Ye, Yuchen Ren, Weiqiang Bai, Chaoqi Liang, Xinzhu Ma, Nanqing Dong, and Wanli Ouyang. Model decides how to tokenize: Adaptive dna sequence tok- enization with mxdna.Advances in Neural Information Pro- cessing Systems, 37:66080–66107, 2024. 6, 14

  29. [29]

    Improving language understanding by gen- erative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by gen- erative pre-training. 2018. 3

  30. [30]

    U- net: Convolutional networks for biomedical image segmen- tation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InInternational Conference on Medical image com- puting and computer-assisted intervention, pages 234–241. Springer, 2015. 6, 14

  31. [31]

    Neural Machine Translation of Rare Words with Subword Units

    Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units.arXiv preprint arXiv:1508.07909, 2015. 2, 5

  32. [32]

    A mathematical theory of communi- cation.The Bell system technical journal, 27(3):379–423,

    Claude E Shannon. A mathematical theory of communi- cation.The Bell system technical journal, 27(3):379–423,

  33. [33]

    An analysis of variance test for normality.Biometrika, 52(3):591–611, 1965

    S Shaphiro and MBJB Wilk. An analysis of variance test for normality.Biometrika, 52(3):591–611, 1965. 5, 7

  34. [34]

    K-mer content, correlation, and po- sition analysis of genome dna sequences for the identifica- tion of function and evolutionary features.Genes, 8(4):122,

    Aaron Sievers, Katharina Bosiek, Marc Bisch, Chris Dreessen, Jascha Riedel, Patrick Froß, Michael Hausmann, and Georg Hildenbrand. K-mer content, correlation, and po- sition analysis of genome dna sequences for the identifica- tion of function and evolutionary features.Genes, 8(4):122,

  35. [35]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

  36. [36]

    The information bottleneck method

    Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method.arXiv preprint physics/0004057, 2000. 11

  37. [37]

    Generator: a long-context generative genomic foundation model

    Wei Wu, Qiuyi Li, Mingyang Li, Kun Fu, Fuli Feng, Jieping Ye, Hui Xiong, and Zheng Wang. Generator: a long-context generative genomic foundation model.arXiv preprint arXiv:2502.07272, 2025. 3

  38. [38]

    Dnagpt: A gen- eralized pre-trained tool for versatile dna sequence analysis tasks.arXiv preprint arXiv:2307.05628, 2023

    Daoan Zhang, Weitong Zhang, Yu Zhao, Jianguo Zhang, Bing He, Chenchen Qin, and Jianhua Yao. Dnagpt: A gen- eralized pre-trained tool for versatile dna sequence analysis tasks.arXiv preprint arXiv:2307.05628, 2023. 3

  39. [39]

    Predicting effects of noncoding variants with deep learning–based sequence model.Nature methods, 12(10):931–934, 2015

    Jian Zhou and Olga G Troyanskaya. Predicting effects of noncoding variants with deep learning–based sequence model.Nature methods, 12(10):931–934, 2015. 1

  40. [40]

    Dnabert- 2: Efficient foundation model and benchmark for multi-species genome.arXiv preprint arXiv:2306.15006, 2023

    Zhihan Zhou, Yanrong Ji, Weijian Li, Pratik Dutta, Ra- mana Davuluri, and Han Liu. Dnabert-2: Efficient founda- tion model and benchmark for multi-species genome.arXiv preprint arXiv:2306.15006, 2023. 3, 5, 6 10 The supplementary material provides additional details to complement the main paper.More details for method, we further prove the information lea...