DNA Language Models: An Assessment of Pre-Training for Fine-Tuning Tasks

Julien Mozziconacci; Micka\"el Delcey; Romain Karpinsky

arxiv: 2606.30140 · v1 · pith:F53OZK3Znew · submitted 2026-06-29 · 🧬 q-bio.GN · cs.CL

DNA Language Models: An Assessment of Pre-Training for Fine-Tuning Tasks

Romain Karpinsky , Julien Mozziconacci , Micka\"el Delcey This is my paper

Pith reviewed 2026-06-30 03:28 UTC · model grok-4.3

classification 🧬 q-bio.GN cs.CL

keywords DNA language modelspre-trainingfine-tuningBPE tokenizationtransformer modelsconvolutional modelsgenomics

0 comments

The pith

Pretraining transformer DNA models may not deliver gains worth their cost on fine-tuning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether the performance improvements from heavily pre-trained transformer models on genomics tasks are large enough to offset their training costs compared to simpler convolutional models. It also quantifies how much pretraining itself adds and whether BPE tokenization is beneficial for DNA data. A reader would care because foundation models are expensive to train and deploy, so knowing when they are truly superior allows better allocation of resources in genomic research. The assessment directly tests three questions on model type, pretraining contribution, and tokenization effects through targeted benchmarks.

Core claim

Transformer-based models do not always provide sufficient improvements on fine-tuning tasks upon heavy pretraining to justify the overhead, while the actual contribution of pretraining and the impact of BPE tokenization on genomics-related tasks can be isolated and measured by comparing against convolutional baselines such as ConvNova.

What carries the argument

Ablation studies and systematic benchmarks that isolate pretraining effects and BPE tokenization when comparing transformer architectures to convolutional models on DNA fine-tuning tasks.

If this is right

Simpler convolutional models may suffice for many genomics fine-tuning tasks where pretraining gains are modest.
Resources currently spent on large-scale pretraining could be redirected toward task-specific optimization or alternative architectures.
BPE tokenization choices should be validated per domain rather than assumed optimal for DNA sequences.
Model development pipelines for genomics would benefit from including explicit pretraining contribution metrics in evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same assessment approach could be extended to protein or RNA sequence tasks to test whether pretraining overhead patterns generalize.
If BPE underperforms, domain-specific alternatives such as k-mer based tokenization could be developed and tested as direct replacements.
Wider adoption of cost-benefit benchmarks might slow the default transfer of LLM scaling practices into biology without domain validation.

Load-bearing premise

That systematic benchmark comparisons across transformer and convolutional DNA models remain scarce and that the relevance of BPE tokenization for DNA sequence representation is still debated within the genomics community.

What would settle it

A benchmark study showing that pre-trained transformers with BPE tokenization consistently and substantially outperform both non-pretrained versions and convolutional models across a wide range of genomics fine-tuning tasks would falsify the premise that their overhead requires special justification.

Figures

Figures reproduced from arXiv: 2606.30140 by Julien Mozziconacci, Micka\"el Delcey, Romain Karpinsky.

**Figure 1.** Figure 1: Number of sequences (Nseq) for each dataset in the GUE benchmark. The datasets span a wide range of sizes, from a few thousand sequences in certain promoter and mouse tasks to more than 70,000 sequences in the virus classification dataset. 4 [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Sequence length (L) across datasets in the GUE benchmark. Most tasks involve short sequences (70–100 bases), while others such as EMP tasks use longer sequences (∼500 bases), and the virus classification dataset contains sequences up to 1000 bases. 4 Effect of Byte Pair Encoding To assess the impact of Byte Pair Encoding (BPE) on lightweight architectures, we evaluated U-Net and ConvNova across all GUE ben… view at source ↗

read the original abstract

Recent breakthroughs in foundation models and Large Language Models (LLMs) have introduced new opportunities for studying and decoding genomic sequences. Several state-of-the-art approaches, such as DNABERT2, rely on transformer-based architectures, while others, such as ConvNova, still build upon more conventional convolutional models. However, systematic benchmark comparisons across these methods remain scarce. Given that transformer-based models require extensive and costly pretraining, it is crucial to evaluate whether their performance gains justify this overhead. Moreover, LLMs such as DNABERT2 typically rely on Byte Pair Encoding (BPE) tokenization, whose relevance for DNA sequence representation is still debated within the genomics community. In this work, we investigate three key questions: (i) do transformer-based models provide sufficient improvements on fine-tuning tasks upon heavy pretraining, (ii) what is the actual contribution of pretraining in this setting, and (iii) how does BPE tokenization impact performance on genomics-related tasks?

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper asks three practical questions about whether pretraining transformers and BPE tokenization are worth the cost for DNA fine-tuning tasks, but the abstract shows no results or methods so the answers remain unknown.

read the letter

The main takeaway is that this work sets up a controlled comparison to test if transformer DNA models like DNABERT2 deliver enough gains on fine-tuning to justify their pretraining expense versus convolutional models like ConvNova, while also measuring the isolated effects of pretraining and BPE tokenization.

It does a solid job stating the motivation plainly: pretraining is costly, systematic head-to-head benchmarks are rare, and BPE's suitability for DNA is still argued about in the community. Framing the assessment around three explicit questions keeps the scope tight and relevant to resource decisions in genomics AI.

What is new is the benchmark exercise itself. The abstract correctly notes the scarcity of such comparisons, so executing them could provide usable numbers on performance deltas and tokenization impact, assuming the experiments are run cleanly.

The soft spots are clear from the available text. No methods, datasets, results, or error analysis appear in the abstract, which makes it impossible to judge whether the fine-tuning tasks are representative, whether model sizes and data are matched fairly, or whether any gains are statistically reliable. Without those details the assessment stays at the level of a plan rather than a completed evaluation. The empirical nature keeps circularity risk low, but the lack of evidence is the central limitation.

This is for researchers choosing architectures for genomics models who need evidence on compute tradeoffs. A reader focused on practical deployment would get value from the outcomes if they hold up under scrutiny. It deserves a serious referee because the questions address a real allocation issue in the field, even though the current version would need the full experimental section to stand on its own.

Referee Report

2 major / 0 minor

Summary. The manuscript motivates and poses three research questions for an empirical assessment of DNA language models: whether transformer-based models (e.g., DNABERT2) deliver sufficient gains on fine-tuning tasks to justify costly pretraining relative to convolutional baselines (e.g., ConvNova); what the isolated contribution of pretraining is; and how BPE tokenization affects performance on genomics tasks. It notes that systematic comparisons remain scarce and that BPE relevance for DNA is debated.

Significance. A controlled, reproducible benchmark answering these questions would help the genomics community weigh the practical value of transformer pretraining against simpler convolutional alternatives and clarify tokenization choices, potentially guiding resource allocation in foundation-model development for sequences.

major comments (2)

[Abstract] Abstract: the manuscript states it will 'investigate three key questions' but provides no methods, datasets, fine-tuning tasks, baselines, metrics, or results. Without these elements it is impossible to determine whether any performance claims or comparisons are supported.
No section, table, or figure supplies the promised assessment; the central claim that pretraining overhead must be justified therefore rests on an unexecuted plan rather than evidence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the review. The comments correctly identify that the manuscript as presented motivates the three research questions but does not supply the promised empirical assessment, methods, datasets, or results.

read point-by-point responses

Referee: [Abstract] Abstract: the manuscript states it will 'investigate three key questions' but provides no methods, datasets, fine-tuning tasks, baselines, metrics, or results. Without these elements it is impossible to determine whether any performance claims or comparisons are supported.

Authors: We agree that the abstract is high-level and does not include these details. The manuscript will be revised either to reframe the work explicitly as a position piece outlining open questions for the community or to incorporate a summary of methods (e.g., specific fine-tuning tasks on genomics benchmarks, baselines such as ConvNova and DNABERT2, metrics, and BPE ablation results) if the assessment is completed. revision: yes
Referee: [—] No section, table, or figure supplies the promised assessment; the central claim that pretraining overhead must be justified therefore rests on an unexecuted plan rather than evidence.

Authors: We acknowledge that no section, table, or figure in the current manuscript provides the assessment or supporting evidence. The central claim therefore cannot be substantiated as written. The manuscript will be revised to remove the implication that the assessment has been executed or to add the required experimental sections, tables, and figures comparing transformer pretraining gains against convolutional baselines and evaluating BPE tokenization. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical benchmarking study that poses three explicit research questions about pretraining overhead, contribution of pretraining, and BPE tokenization impact on genomics fine-tuning tasks. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. The central argument is a call for controlled comparisons motivated by external costs and community debate, with no load-bearing step that reduces to its own inputs by construction. Self-citations, if present, are not required for the assessment to hold, and the work remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, free parameters, axioms, or invented entities are referenced in the abstract; the work is purely evaluative.

pith-pipeline@v0.9.1-grok · 5708 in / 988 out tokens · 61798 ms · 2026-06-30T03:28:29.438781+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Y . Bo, W. Mao, Y . Shao, W. Bai, P. Ye, X. Ma, J. Zhao, H. Chen, and C. Shen. Revisiting convolution architecture in the realm of dna foundation models.arXiv preprint arXiv:2502.18538, 2025

work page arXiv 2025
[2]

Dalla Torre, L

H. Dalla Torre, L. Gonzalez, J. Mendoza Revilla, N. Lopez Carranza, A. H. Grzywaczewski, F. Oteri, C. Dallago, E. Trop, B. P. de Almeida, H. Sirelkhatim, et al. Nucleotide transformer: building and evaluating robust foundation models for human genomics.Nature Methods, 22(2):287–297, 2025

2025
[3]

D. R. Kelley, J. Snoek, and J. L. Rinn. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks.Genome research, 26(7):990–999, 2016

2016
[4]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[5]

Nguyen, M

E. Nguyen, M. Poli, M. G. Durrant, B. Kang, D. Katrekar, D. B. Li, L. J. Bartie, A. W. Thomas, S. H. King, G. Brixi, et al. Sequence modeling and design from molecular to genome scale with evo.Science, 386(6723):eado9336, 2024

2024
[6]

Ronneberger, P

O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

2015
[7]

Routhier and J

E. Routhier and J. Mozziconacci. Genomics enters the deep learning era.PeerJ, 10:e13613, 2022

2022
[8]

Shibata, T

Y . Shibata, T. Kida, S. Fukamachi, M. Takeda, A. Shinohara, T. Shinohara, and S. Arikawa. Byte pair encoding: A text compression scheme that accelerates pattern matching. 1999

1999
[9]

L. N. Smith. Cyclical learning rates for training neural networks. In2017 IEEE winter conference on applications of computer vision (WACV), pages 464–472. IEEE, 2017

2017
[10]

Z. Tang, N. Somia, Y . Yu, and P. K. Koo. Evaluating the representational power of pre-trained dna language models for regulatory genomics.Genome Biology, 26(1):203, 2025

2025
[11]

X. Wu, D. Hong, and J. Chanussot. Uiu-net: U-net in u-net for infrared small object detection.IEEE Transactions on Image Processing, 32:364–376, 2022

2022
[12]

Zhou and O

J. Zhou and O. G. Troyanskaya. Predicting effects of noncoding variants with deep learning–based sequence model.Nature methods, 12(10):931–934, 2015

2015
[13]

Z. Zhou, Y . Ji, W. Li, P. Dutta, R. Davuluri, and H. Liu. Dnabert-2: Efficient foundation model and benchmark for multi-species genome.arXiv preprint arXiv:2306.15006, 2023. 12

work page arXiv 2023

[1] [1]

Y . Bo, W. Mao, Y . Shao, W. Bai, P. Ye, X. Ma, J. Zhao, H. Chen, and C. Shen. Revisiting convolution architecture in the realm of dna foundation models.arXiv preprint arXiv:2502.18538, 2025

work page arXiv 2025

[2] [2]

Dalla Torre, L

H. Dalla Torre, L. Gonzalez, J. Mendoza Revilla, N. Lopez Carranza, A. H. Grzywaczewski, F. Oteri, C. Dallago, E. Trop, B. P. de Almeida, H. Sirelkhatim, et al. Nucleotide transformer: building and evaluating robust foundation models for human genomics.Nature Methods, 22(2):287–297, 2025

2025

[3] [3]

D. R. Kelley, J. Snoek, and J. L. Rinn. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks.Genome research, 26(7):990–999, 2016

2016

[4] [4]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[5] [5]

Nguyen, M

E. Nguyen, M. Poli, M. G. Durrant, B. Kang, D. Katrekar, D. B. Li, L. J. Bartie, A. W. Thomas, S. H. King, G. Brixi, et al. Sequence modeling and design from molecular to genome scale with evo.Science, 386(6723):eado9336, 2024

2024

[6] [6]

Ronneberger, P

O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

2015

[7] [7]

Routhier and J

E. Routhier and J. Mozziconacci. Genomics enters the deep learning era.PeerJ, 10:e13613, 2022

2022

[8] [8]

Shibata, T

Y . Shibata, T. Kida, S. Fukamachi, M. Takeda, A. Shinohara, T. Shinohara, and S. Arikawa. Byte pair encoding: A text compression scheme that accelerates pattern matching. 1999

1999

[9] [9]

L. N. Smith. Cyclical learning rates for training neural networks. In2017 IEEE winter conference on applications of computer vision (WACV), pages 464–472. IEEE, 2017

2017

[10] [10]

Z. Tang, N. Somia, Y . Yu, and P. K. Koo. Evaluating the representational power of pre-trained dna language models for regulatory genomics.Genome Biology, 26(1):203, 2025

2025

[11] [11]

X. Wu, D. Hong, and J. Chanussot. Uiu-net: U-net in u-net for infrared small object detection.IEEE Transactions on Image Processing, 32:364–376, 2022

2022

[12] [12]

Zhou and O

J. Zhou and O. G. Troyanskaya. Predicting effects of noncoding variants with deep learning–based sequence model.Nature methods, 12(10):931–934, 2015

2015

[13] [13]

Z. Zhou, Y . Ji, W. Li, P. Dutta, R. Davuluri, and H. Liu. Dnabert-2: Efficient foundation model and benchmark for multi-species genome.arXiv preprint arXiv:2306.15006, 2023. 12

work page arXiv 2023