Fast and Low-Cost Genomic Foundation Models via Outlier Removal

Chenghao Qiu; Guo Ye; Han Liu; Haozheng Luo; Jerry Yao-Chieh Hu; Maojiang Su; Zhihan Zhou; Zoe Mehta

arxiv: 2505.00598 · v2 · submitted 2025-05-01 · 💻 cs.LG · cs.AI

Fast and Low-Cost Genomic Foundation Models via Outlier Removal

Haozheng Luo , Chenghao Qiu , Maojiang Su , Zhihan Zhou , Zoe Mehta , Guo Ye , Jerry Yao-Chieh Hu , Han Liu This is my paper

Pith reviewed 2026-05-22 16:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords genomic foundation modelsoutlier removallow-rank adaptationpost-training quantizationcomputational efficiencycontinual learningDNABERT

0 comments

The pith

Removing outliers during training lets genomic models adapt faster and quantize better with little accuracy cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GERM as a genomic foundation model built by stripping out extreme activation values that slow low-rank adaptation and damage post-training quantization. It swaps the usual attention layer for an outlier-free version drawn from associative memory ideas and applies the cleaning step in both pre-training and fine-tuning. A companion strategy called GERM-T adds small-step continual learning from saved checkpoints so new models do not have to start over. The result is claimed to cut fine-tuning time, lower memory use, and raise quantized performance while keeping losses inside acceptable bounds. If the method works as described, genomic models become practical on modest hardware instead of requiring large GPU clusters.

Core claim

GERM improves on models like DNABERT-2 by eliminating outliers that hinder low-rank adaptation and post-training quantization. It replaces the vanilla attention layer with an outlier-free mechanism, removes outliers in both pre-training and fine-tuning phases, and adds GERM-T continual learning from original checkpoints. This produces 37.98 percent better fine-tuning performance and 64.34 percent better quantization over the baseline while cutting average kurtosis by 92.14 percent and maximum infinity norm by 82.77 percent.

What carries the argument

An outlier-free attention mechanism that replaces the standard attention layer to block high-magnitude activations from interfering with adaptation and quantization.

If this is right

LoRA fine-tuning on new genomic tasks completes in fewer steps and with lower memory use.
Quantized versions of the model retain higher accuracy on edge or low-power devices.
GERM-T allows incremental updates to existing checkpoints without full retraining from scratch.
The cleaned models remain competitive with or exceed current leading genomic foundation models on standard benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same outlier-cleaning step could be tested on non-genomic transformers to see whether adaptation and quantization benefits transfer.
If outlier removal proves general, it offers a lightweight alternative to full model redesign for any resource-constrained domain.
Continual learning from checkpoints may reduce the environmental cost of repeatedly training large genomic models from random initialization.

Load-bearing premise

That the measured drops in kurtosis and infinity norm are what directly produce the reported gains in adaptation speed and quantization robustness.

What would settle it

Train an otherwise identical model while deliberately preserving or reintroducing the same outliers and check whether the speed and quantization advantages disappear.

read the original abstract

To address the challenge of scarce computational resources in genomic modeling, we introduce GERM, a genomic foundation model with strong compression performance and fast adaptability. GERM improves upon models like DNABERT-2 by eliminating outliers that hinder low-rank adaptation and post-training quantization, enhancing both efficiency and robustness. We replace the vanilla attention layer with an outlier-free mechanism inspired by associative memory models. By removing outliers during both pre-training and fine-tuning, this approach accelerates adaptation, reduces computational costs, and enhances quantization robustness within acceptable loss margins. Additionally, we propose GERM-T, a strategy that employs small-step continual learning within the outlier-free framework, leveraging original checkpoints to avoid retraining from scratch. Empirically, GERM improves fine-tuning performance by 37.98% and quantization by 64.34% over the baseline model. It also reduces average kurtosis by 92.14% and maximum infinity norm by 82.77%. Compared to leading methods, GERM consistently delivers superior performance, offering a practical solution for genomic modeling in resource-constrained settings. Code is available at https://github.com/MAGICS-LAB/GERM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GERM applies outlier removal plus a new attention layer to genomic models and reports solid efficiency numbers, but the design makes it hard to credit the gains specifically to outlier removal.

read the letter

GERM shows that stripping outliers during pre-training and fine-tuning can speed up adaptation and improve quantization for genomic models, but the gains are hard to pin on outlier removal alone because the attention layer was changed too. They replace the standard attention with something inspired by associative memory models that avoids outliers by design. Then they add GERM-T, which does small-step continual learning from checkpoints so you don't start over. The reported results are a 37.98% boost in fine-tuning performance and 64.34% in quantization, plus sharp reductions in kurtosis and infinity norm. Code is on GitHub, which is good. This is a solid application of existing compression tricks to the genomic setting. The continual learning wrapper is a nice touch for practical use. The main concern is the lack of clean ablations. If the new attention mechanism is doing a lot of the heavy lifting on outlier control, then just removing outliers from a vanilla model might not give the same lift. I'd like to see runs that hold the architecture fixed and toggle only the removal step. Also, check if the outlier threshold was tuned on validation or if multiple seeds were averaged. For readers building resource-light models in genomics or similar sequence tasks, this gives a concrete starting point. It is worth a serious review because the ideas are straightforward to test and the claims are empirical rather than theoretical overreach. Recommend sending it to referees with notes on adding those controls.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces GERM, a genomic foundation model that removes outliers during both pre-training and fine-tuning to improve low-rank adaptation and post-training quantization. It replaces the vanilla attention layer with an outlier-free mechanism inspired by associative memory models. The authors also propose GERM-T, a continual learning strategy that uses small-step updates from original checkpoints. Empirically, GERM is reported to deliver 37.98% better fine-tuning performance and 64.34% better quantization than baselines such as DNABERT-2, together with 92.14% lower average kurtosis and 82.77% lower maximum infinity norm.

Significance. If the reported gains can be attributed specifically to outlier removal after proper controls, the work offers a practical route to compressing and adapting genomic foundation models under limited compute. The code release supports reproducibility and the focus on resource-constrained genomic applications is timely.

major comments (2)

[Method / Experimental setup] The central empirical claims attribute the 37.98% fine-tuning and 64.34% quantization gains, as well as the kurtosis and infinity-norm reductions, to outlier removal. However, the method also replaces vanilla attention with a new associative-memory-inspired outlier-free layer. No ablation is described that holds architecture and data fixed while varying only the outlier removal step; therefore the observed metric improvements cannot yet be confidently ascribed to outlier removal rather than the attention change or other pipeline adjustments.
[Experiments] The experimental section does not state whether baselines were matched for total compute, whether the outlier threshold was tuned on a held-out validation set rather than test data, or whether results are reported over multiple random seeds. These controls are load-bearing for interpreting the precise percentage gains as robust evidence of accelerated adaptation and quantization robustness.

minor comments (1)

[Abstract] The abstract refers to improvements 'within acceptable loss margins' without quantifying the margins or the accuracy-compression trade-off.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our work. We address the major concerns regarding the isolation of outlier removal effects and the experimental controls in our point-by-point responses below. We believe these clarifications and planned revisions will strengthen the manuscript.

read point-by-point responses

Referee: [Method / Experimental setup] The central empirical claims attribute the 37.98% fine-tuning and 64.34% quantization gains, as well as the kurtosis and infinity-norm reductions, to outlier removal. However, the method also replaces vanilla attention with a new associative-memory-inspired outlier-free layer. No ablation is described that holds architecture and data fixed while varying only the outlier removal step; therefore the observed metric improvements cannot yet be confidently ascribed to outlier removal rather than the attention change or other pipeline adjustments.

Authors: We agree that an ablation isolating the contribution of outlier removal from the architectural change would provide stronger evidence. In the original design, the outlier-free attention mechanism is specifically engineered to operate without outliers by incorporating removal during pre-training and fine-tuning. To address this, we will include a new ablation study in the revised manuscript. This ablation will compare the associative-memory-inspired layer with and without the outlier removal component, while keeping all other factors (architecture, data, training procedure) fixed. We expect this to demonstrate that the outlier removal is the key driver of the reported improvements in fine-tuning speed and quantization robustness. revision: yes
Referee: [Experiments] The experimental section does not state whether baselines were matched for total compute, whether the outlier threshold was tuned on a held-out validation set rather than test data, or whether results are reported over multiple random seeds. These controls are load-bearing for interpreting the precise percentage gains as robust evidence of accelerated adaptation and quantization robustness.

Authors: We appreciate this point on experimental rigor. In our experiments, all baseline models including DNABERT-2 were trained and fine-tuned using the same total compute budget, with matched training steps, batch sizes, and hardware. The outlier threshold was determined using a held-out validation set to prevent any leakage from the test data. Furthermore, all performance metrics are reported as averages over three independent random seeds, with standard deviations provided in the supplementary material. We will update the experimental section and add a dedicated paragraph on experimental controls in the revised version to explicitly document these details. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results independent of method definition

full rationale

The paper defines GERM via explicit steps: outlier removal in pre-training/fine-tuning plus replacement of vanilla attention by an associative-memory-inspired outlier-free layer. Reported gains (37.98% fine-tuning improvement, 64.34% quantization improvement, 92.14% kurtosis reduction, 82.77% inf-norm reduction) are presented strictly as measured outcomes on benchmarks versus baselines. No equation, prediction, or first-principles claim is shown that reduces by construction to a fitted parameter, self-citation chain, or renamed input; the performance numbers remain externally falsifiable quantities obtained after the method is applied. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The approach rests on the empirical observation that outliers in attention activations are the primary bottleneck for LoRA and quantization in genomic models; no new mathematical axioms are introduced, only a modeling choice to suppress them.

free parameters (1)

outlier threshold
Value chosen to decide which activations count as outliers; not stated as a fixed constant in the abstract.

pith-pipeline@v0.9.0 · 5753 in / 1202 out tokens · 29883 ms · 2026-05-22T16:57:39.387044+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GERM improves fine-tuning performance by 37.98% and quantization by 64.34%... reduces average kurtosis by 92.14% and maximum infinity norm by 82.77%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.