Dnabert- 2: Efficient foundation model and benchmark for multi-species genome

Dnabert-2: Efficient foundation model, benchmark for multi-species genome , author= · 2026 · arXiv 2306.15006

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

representative citing papers

GGBound: A Genome-Grounded Agent for Microbial Life-Boundary Prediction

cs.CY · 2026-05-14 · unverdicted · novelty 7.0

A genome-conditioned 4B LLM agent predicts microbial life boundaries and matches larger frontier models via token fusion, tool use, and a counterfactual gene-grounding reward.

How Post-Training Shapes Biological Reasoning Models

cs.LG · 2026-06-15 · unverdicted · novelty 6.0

Post-training stages reshape generalization in biological reasoning models distinctly: CPT aligns with biological language, SFT boosts ID performance but causes OOD to peak early and decline, while RL on strong SFT checkpoints can recover OOD generalization.

Flexible Flows for Biological Sequence Design

cs.LG · 2026-06-09 · unverdicted · novelty 6.0

Enhances Discrete Flow Matching with domain-specific couplings, latent edit-based rates, latent classifier-free guidance, and temperature scaling to reach SOTA on DNA and peptide sequence tasks.

Biological Reasoning-Informed Regression for Interpretable Regulatory DNA Activity Prediction

q-bio.GN · 2026-06-06 · unverdicted · novelty 6.0

R3LM trains LLMs via two-stage reasoning-then-regression on a new dataset CRE-ReasonBench with mechanistic traces, achieving SOTA enhancer activity prediction across three cell types with interpretable outputs.

TadA-Bench: A Million-Variant Benchmark for Future-Round Discovery Toward Agentic Protein Engineering

q-bio.QM · 2026-05-29 · unverdicted · novelty 6.0

TadA-Bench supplies a chronological million-variant wet-lab replay benchmark from 31 TadA directed-evolution rounds that evaluates models on future-round variant ranking given only earlier data.

Set-Aggregated Genome Embeddings for Microbiome Abundance Prediction

q-bio.GN · 2026-05-12 · unverdicted · novelty 6.0

Set-aggregated genome embeddings from genomic language models predict microbiome abundance profiles with improved generalization to novel genomes over classical bioinformatics methods.

Evaluating Post-hoc Explanations of the Transformer-based Genome Language Model DNABERT-2

cs.LG · 2026-04-23 · unverdicted · novelty 5.0

AttnLRP explanations of DNABERT-2 reliably capture known biological patterns in genomic sequences, showing that transformer-based genome language models can yield biologically meaningful insights comparable to CNNs.

In Search of Lost DNA Sequence Pretraining

cs.LG · 2026-04-17 · unverdicted · novelty 5.0

DNA pretraining suffers from inappropriate evaluation datasets, flawed neighbor-masking, and neglected vocabulary design; the authors supply guidelines and a reproducible testbed to fix them.

DNA Language Models: An Assessment of Pre-Training for Fine-Tuning Tasks

q-bio.GN · 2026-06-29 · unverdicted · novelty 4.0

Benchmark assessment of pretraining contribution and BPE tokenization in transformer versus convolutional DNA language models for genomics fine-tuning tasks.

citing papers explorer

Showing 9 of 9 citing papers.

GGBound: A Genome-Grounded Agent for Microbial Life-Boundary Prediction cs.CY · 2026-05-14 · unverdicted · none · ref 15
A genome-conditioned 4B LLM agent predicts microbial life boundaries and matches larger frontier models via token fusion, tool use, and a counterfactual gene-grounding reward.
How Post-Training Shapes Biological Reasoning Models cs.LG · 2026-06-15 · unverdicted · none · ref 36
Post-training stages reshape generalization in biological reasoning models distinctly: CPT aligns with biological language, SFT boosts ID performance but causes OOD to peak early and decline, while RL on strong SFT checkpoints can recover OOD generalization.
Flexible Flows for Biological Sequence Design cs.LG · 2026-06-09 · unverdicted · none · ref 9
Enhances Discrete Flow Matching with domain-specific couplings, latent edit-based rates, latent classifier-free guidance, and temperature scaling to reach SOTA on DNA and peptide sequence tasks.
Biological Reasoning-Informed Regression for Interpretable Regulatory DNA Activity Prediction q-bio.GN · 2026-06-06 · unverdicted · none · ref 44
R3LM trains LLMs via two-stage reasoning-then-regression on a new dataset CRE-ReasonBench with mechanistic traces, achieving SOTA enhancer activity prediction across three cell types with interpretable outputs.
TadA-Bench: A Million-Variant Benchmark for Future-Round Discovery Toward Agentic Protein Engineering q-bio.QM · 2026-05-29 · unverdicted · none · ref 99
TadA-Bench supplies a chronological million-variant wet-lab replay benchmark from 31 TadA directed-evolution rounds that evaluates models on future-round variant ranking given only earlier data.
Set-Aggregated Genome Embeddings for Microbiome Abundance Prediction q-bio.GN · 2026-05-12 · unverdicted · none · ref 16
Set-aggregated genome embeddings from genomic language models predict microbiome abundance profiles with improved generalization to novel genomes over classical bioinformatics methods.
Evaluating Post-hoc Explanations of the Transformer-based Genome Language Model DNABERT-2 cs.LG · 2026-04-23 · unverdicted · none · ref 31
AttnLRP explanations of DNABERT-2 reliably capture known biological patterns in genomic sequences, showing that transformer-based genome language models can yield biologically meaningful insights comparable to CNNs.
In Search of Lost DNA Sequence Pretraining cs.LG · 2026-04-17 · unverdicted · none · ref 40
DNA pretraining suffers from inappropriate evaluation datasets, flawed neighbor-masking, and neglected vocabulary design; the authors supply guidelines and a reproducible testbed to fix them.
DNA Language Models: An Assessment of Pre-Training for Fine-Tuning Tasks q-bio.GN · 2026-06-29 · unverdicted · none · ref 13
Benchmark assessment of pretraining contribution and BPE tokenization in transformer versus convolutional DNA language models for genomics fine-tuning tasks.

Dnabert- 2: Efficient foundation model and benchmark for multi-species genome

fields

years

verdicts

representative citing papers

citing papers explorer