arxiv: 2605.12286 · v1 · submitted 2026-05-12 · 🧬 q-bio.GN · cs.AI

Recognition: no theorem link

Set-Aggregated Genome Embeddings for Microbiome Abundance Prediction

Georg K. Gerber, Travis E. Gibson, Younhun Kim

Pith reviewed 2026-05-13 02:56 UTC · model grok-4.3

classification 🧬 q-bio.GN cs.AI

keywords microbiome abundance predictiongenome embeddingsgenomic language modelsset aggregationmetagenomicsfew-shot learningcommunity-level representations

0 comments

The pith

Aggregating embeddings from genomic language models allows prediction of microbiome abundances and improves generalization to novel genomes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors ask if microbial community abundances can be predicted directly from the raw DNA sequences of its member genomes. They develop set-aggregated genome embeddings that combine representations learned by genomic language models into a community-level descriptor. This leverages the models' ability to generalize from limited examples. Testing shows better performance on genomes not encountered in training than standard bioinformatics pipelines. Experiments further reveal that the aggregation step and choice of transformations matter for accuracy.

Core claim

In this work, we employ set-aggregated genome embeddings (SAGE) to predict community-level abundance profiles, exploiting the few-shot learning capabilities of genomic language models (GLMs). We benchmark this approach to show improved generalization on novel genomes compared to classical bioinformatics approaches. Model ablation shows that community-level latent representations directly result in improved performance. Lastly, we demonstrate the benefits of intermediate transformations between latent representations and demonstrate the differences between GLM embedding choices.

What carries the argument

Set-aggregated genome embeddings (SAGE), which pool individual genome embeddings produced by genomic language models to form a single community representation used for abundance prediction.

If this is right

If correct, abundance profiles of microbial communities can be forecasted even when some genomes are new to the model.
Community-level aggregation of embeddings yields better results than processing genomes separately.
Applying intermediate transformations to the latent representations enhances predictive accuracy.
The specific genomic language model chosen affects the quality of the resulting embeddings for this task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach might apply to forecasting other community properties such as functional outputs or stability.
It implies that genomic language models encode transferable functional information across different microbial species.
Future work could test the method on real-world metagenomic samples with unknown compositions.

Load-bearing premise

That the embeddings generated by genomic language models contain sufficient functional information about genomes to enable accurate abundance prediction when aggregated, particularly for genomes not present in the training set.

What would settle it

A direct comparison of prediction error rates on a dataset of microbial communities where all test genomes are entirely absent from the training data; superior performance over classical methods would support the claim, while equivalent or worse performance would refute it.

Figures

Figures reproduced from arXiv: 2605.12286 by Georg K. Gerber, Travis E. Gibson, Younhun Kim.

**Figure 1.** Figure 1: Overview of (a) the predictive task and (b) the deeplearning approach. The primary goal is to use genomic sequences directly to produce abundance predictions per microbiome sample. Evo (Nguyen et al., 2024) and DNABERT (Ji et al., 2021) can be feature-extracted for microbiome tasks. The two prototypical tasks which have appeared in literature are environmental source prediction and host phenotype predict… view at source ↗

**Figure 3.** Figure 3: Benchmarking results on the 16S ASV dataset (top row) and the WMS SGB dataset (bottom row). We 2025; Yoo & Rosen, 2025), instead of sum-pooling. Our ablation analysis ( [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 2.** Figure 2: Architectural diagram of the benchmarked SAGE implementation. The architecture is divided into modules implementing successive hierarchies: microbiomes are sets of taxa, taxa are sets of genes, and genes are represented by sequences. Intermediate MLPs are broadcasted across each slice, to serve as latent transformations between represented spaces. pooling of genomic features in each taxa. The architectur… view at source ↗

**Figure 5.** Figure 5: MLP and model capacity ablation analysis. Models are trained with and without the MLP transformations in the taxapooling and community-pooling modules, sweeping across parameter counts. Curves are local fits (LOWESS). 6. Discussion Our results support the main hypothesis: genome embedding aggregations results in better generalization to novel sequences unseen during training. Still, the WMS dataset’s sm… view at source ↗

**Figure 4.** Figure 4: Benchmark of SAGE models for ablating communitylevel pooling. Models are either (a) given zero vectors for community-level representations during evaluation, or (b) are fully re-trained with community representations removed. To verify the effectiveness of including community representations, we conducted an ablation study on the communitylevel pooling module ( [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

read the original abstract

Microbiome functions are encoded within the genes of the community-wide metagenome. A natural question is whether properties of a microbial community can be predicted just from knowing the raw DNA sequences of its members. In this work, we employ set-aggregated genome embeddings (SAGE) to predict community-level abundance profiles, exploiting the few-shot learning capabilities of genomic language models (GLMs). We benchmark this approach to show improved generalization on novel genomes compared to classical bioinformatics approaches. Model ablation shows that community-level latent representations directly result in improved performance. Lastly, we demonstrate the benefits of intermediate transformations between latent representations and demonstrate the differences between GLM embedding choices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAGE applies set aggregation to genomic language model embeddings for microbiome abundance prediction and reports better generalization than classical methods, but the claim needs explicit confirmation that test genomes have no overlap with training communities.

read the letter

The paper's core move is to treat a microbiome as a set of genome embeddings from a GLM, then aggregate them to predict relative abundances. This framing is new in the microbiome prediction space and avoids reducing to prior marker-gene or alignment pipelines. The ablations are useful: they isolate that the community-level aggregation step drives gains, and they compare different GLM embedding choices plus intermediate transformations. Those pieces give a clear picture of where the performance comes from and make the method easier to reproduce or extend.

Referee Report

2 major / 2 minor

Summary. The paper proposes set-aggregated genome embeddings (SAGE) from genomic language models (GLMs) to predict community-level microbiome abundance profiles directly from raw DNA sequences of community members. It benchmarks this against classical bioinformatics methods to claim improved generalization on novel genomes, with ablations demonstrating benefits of community-level latent representations and intermediate transformations between embeddings and predictions.

Significance. If the generalization claim holds under strict no-leakage conditions, the work would offer a valuable few-shot approach for metagenomic prediction by leveraging pre-trained GLM embeddings at the set level, potentially reducing the need for exhaustive training data in microbiome modeling. The ablations on latent representations and transformation choices provide useful empirical grounding for the method.

major comments (2)

[Abstract and Methods] Abstract and Methods: The central generalization claim requires that 'novel genomes' in test communities are entirely absent from all training communities. No details are given on the genome-level partitioning procedure, whether splits are performed at the genome, sample, or read level, or on average sequence similarity between train and test genomes. This is load-bearing, as overlap would allow the set-aggregation model to exploit identity rather than functional embedding similarity, undermining attribution of gains to the GLM approach.
[Results] Results (benchmarks and ablations): The reported improvements lack accompanying statistical tests, error bars, or details on data splits and cross-validation strategy. Without these, it is not possible to determine whether the performance differences versus classical baselines are robust or could be explained by partial genome overlap or split leakage.

minor comments (2)

[Abstract and Introduction] The abstract and introduction would benefit from a clearer definition of 'set-aggregated' and how the aggregation operation is implemented (e.g., mean, attention, or other pooling) before the benchmarking claims.
[Methods] Notation for the SAGE model and the intermediate transformations could be introduced more explicitly with a small diagram or pseudocode to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment point by point below and have revised the manuscript to provide the requested clarifications and statistical details.

read point-by-point responses

Referee: [Abstract and Methods] Abstract and Methods: The central generalization claim requires that 'novel genomes' in test communities are entirely absent from all training communities. No details are given on the genome-level partitioning procedure, whether splits are performed at the genome, sample, or read level, or on average sequence similarity between train and test genomes. This is load-bearing, as overlap would allow the set-aggregation model to exploit identity rather than functional embedding similarity, undermining attribution of gains to the GLM approach.

Authors: We agree that explicit documentation of the partitioning strategy is necessary to support the generalization claims. In the revised manuscript, we have expanded the Methods section with a full description of the genome-level partitioning procedure. Splits were performed at the genome level such that every genome appearing in any test community is entirely absent from all training communities. We have also added the average sequence similarity between train and test genomes, which is low enough to indicate that performance gains derive from functional embedding similarity rather than sequence identity. These additions directly address the concern about potential leakage. revision: yes
Referee: [Results] Results (benchmarks and ablations): The reported improvements lack accompanying statistical tests, error bars, or details on data splits and cross-validation strategy. Without these, it is not possible to determine whether the performance differences versus classical baselines are robust or could be explained by partial genome overlap or split leakage.

Authors: We acknowledge that statistical tests, error bars, and explicit split details are required for rigorous evaluation. The revised manuscript now includes paired statistical tests (with p-values) comparing SAGE against the classical baselines, error bars on all benchmark figures representing standard deviation across cross-validation folds, and a complete description of the data-splitting and cross-validation protocol in the Methods section. These changes allow readers to assess the robustness of the reported improvements independently of any potential overlap issues. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper presents an empirical ML method that aggregates pre-trained genomic language model embeddings at the community level to predict abundances, with benchmarking against classical bioinformatics baselines and ablations on latent representations. No equations, derivations, or predictions are defined in terms of the target result itself. No self-citation chains, uniqueness theorems, or ansatzes are load-bearing for the central generalization claim. The approach is self-contained against external benchmarks and does not reduce any result to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Paper rests on standard assumptions about pre-trained genomic language models and set functions; no explicit free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Genomic language models produce embeddings that encode functional properties relevant to community-level abundance.
Implicit foundation for using GLM embeddings as input to abundance prediction.

pith-pipeline@v0.9.0 · 5404 in / 1017 out tokens · 113437 ms · 2026-05-13T02:56:37.004553+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 4 internal anchors

[1]

Nature Reviews Microbiology , volume=

Gut microbiota in human metabolic health and disease , author=. Nature Reviews Microbiology , volume=. 2021 , publisher=

work page 2021
[2]

Nature medicine , volume=

Current understanding of the human microbiome , author=. Nature medicine , volume=. 2018 , publisher=

work page 2018
[3]

Nature Reviews Microbiology , volume=

Microbiota-mediated colonization resistance: mechanisms and regulation , author=. Nature Reviews Microbiology , volume=. 2023 , publisher=

work page 2023
[4]

Cell research , volume=

Interaction between microbiota and immunity in health and disease , author=. Cell research , volume=. 2020 , publisher=

work page 2020
[5]

Trends in biotechnology , volume=

Synthetic biology tools to engineer microbial communities for biotechnology , author=. Trends in biotechnology , volume=. 2019 , publisher=

work page 2019
[6]

Nature , volume=

Diversity, stability and resilience of the human gut microbiota , author=. Nature , volume=. 2012 , publisher=

work page 2012
[7]

Nature Reviews Microbiology , volume=

Culturing the human microbiota and culturomics , author=. Nature Reviews Microbiology , volume=. 2018 , publisher=

work page 2018
[8]

Cell , volume=

Integration of 168,000 samples reveals global patterns of the human gut microbiome , author=. Cell , volume=. 2025 , publisher=

work page 2025
[9]

Microbiome , volume=

Specialized metabolic functions of keystone taxa sustain soil microbiome stability , author=. Microbiome , volume=. 2021 , publisher=

work page 2021
[10]

Cell host & microbe , volume=

Recovery of the gut microbiota after antibiotics depends on host diet, community context, and environmental reservoirs , author=. Cell host & microbe , volume=. 2019 , publisher=

work page 2019
[11]

Nature genetics , volume=

Large-scale association analyses identify host factors influencing human gut microbiome composition , author=. Nature genetics , volume=. 2021 , publisher=

work page 2021
[12]

Nature biotechnology , volume=

Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4 , author=. Nature biotechnology , volume=. 2023 , publisher=

work page 2023
[13]

Science , volume=

Sequence modeling and design from molecular to genome scale with Evo , author=. Science , volume=. 2024 , publisher=

work page 2024
[14]

arXiv preprint arXiv:2402.08777 , volume=

Dnabert-s: Learning species-aware dna embedding with genome foundation models , author=. arXiv preprint arXiv:2402.08777 , volume=

work page arXiv
[15]

Bioinformatics , volume=

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome , author=. Bioinformatics , volume=. 2021 , publisher=

work page 2021
[16]

arXiv preprint arXiv:2306.15006 , year=

Dnabert-2: Efficient foundation model and benchmark for multi-species genome , author=. arXiv preprint arXiv:2306.15006 , year=

work page arXiv
[17]

BioRxiv , pages=

AlphaGenome: advancing regulatory variant effect prediction with a unified DNA sequence model , author=. BioRxiv , pages=. 2025 , publisher=

work page 2025
[18]

Nature microbiology , volume=

Longitudinal profiling of low-abundance strains in microbiomes with ChronoStrain , author=. Nature microbiology , volume=. 2025 , publisher=

work page 2025
[19]

Nature Microbiology , volume=

Learning ecosystem-scale dynamics from microbiome data with MDSINE2 , author=. Nature Microbiology , volume=. 2025 , publisher=

work page 2025
[20]

Proceedings of the National Academy of Sciences , volume=

Physics-constrained neural ordinary differential equation models to discover and predict microbial community dynamics , author=. Proceedings of the National Academy of Sciences , volume=. 2026 , publisher=

work page 2026
[21]

Microbiome , volume=

Model-free prediction of microbiome compositions , author=. Microbiome , volume=. 2024 , publisher=

work page 2024
[22]

BioRxiv , pages=

Genome modeling and design across all domains of life with Evo 2 , author=. BioRxiv , pages=. 2025 , publisher=

work page 2025
[23]

Nature communications , volume=

Precise phylogenetic analysis of microbial isolates and genomes from metagenomes using PhyloPhlAn 3.0 , author=. Nature communications , volume=. 2020 , publisher=

work page 2020
[24]

Nature , volume=

The human microbiome project , author=. Nature , volume=. 2007 , publisher=

work page 2007
[25]

Bioinformatics , volume=

SetBERT: the deep learning platform for contextualized embeddings and explainable predictions from high-throughput sequencing , author=. Bioinformatics , volume=. 2025 , publisher=

work page 2025
[26]

arXiv preprint arXiv:2508.11075 , year=

Abundance-Aware Set Transformer for Microbiome Sample Embedding , author=. arXiv preprint arXiv:2508.11075 , year=

work page arXiv
[27]

bioRxiv , year=

A data-driven modeling framework for mapping genotypes to synthetic microbial community functions , author=. bioRxiv , year=

work page
[28]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page
[29]

Nature Methods , volume=

Deep-learning-based gene perturbation effect prediction does not yet outperform simple linear baselines , author=. Nature Methods , volume=. 2025 , publisher=

work page 2025
[30]

Msystems , volume=

American gut: an open platform for citizen science microbiome research , author=. Msystems , volume=. 2018 , publisher=

work page 2018
[31]

PloS one , volume=

Assessment of metagenomic assembly using simulated next generation sequencing data , author=. PloS one , volume=. 2012 , publisher=

work page 2012
[32]

International journal of computer vision , volume=

Incremental learning for robust visual tracking , author=. International journal of computer vision , volume=. 2008 , publisher=

work page 2008
[33]

Biometrika , volume=

Some distance properties of latent root and vector methods used in multivariate analysis , author=. Biometrika , volume=. 1966 , publisher=

work page 1966
[34]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Umap: Uniform manifold approximation and projection for dimension reduction , author=. arXiv preprint arXiv:1802.03426 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Journal of translational medicine , volume=

Unraveling metagenomics through long-read sequencing: a comprehensive review , author=. Journal of translational medicine , volume=. 2024 , publisher=

work page 2024
[36]

Nature , volume=

Universality of human microbial dynamics , author=. Nature , volume=. 2016 , publisher=

work page 2016
[37]

Cell , volume=

Virtual Cell Challenge: Toward a Turing test for the virtual cell , author=. Cell , volume=. 2025 , publisher=

work page 2025
[38]

Science , volume=

Gut microbiome heritability is nearly universal but environmentally contingent , author=. Science , volume=. 2021 , publisher=

work page 2021
[39]

bioRxiv , year=

High-resolution metagenome assembly for modern long reads with myloasm , author=. bioRxiv , year=

work page
[40]

Nature Biotechnology , volume=

High-quality metagenome assembly from long accurate reads with metaMDBG , author=. Nature Biotechnology , volume=. 2024 , publisher=

work page 2024
[41]

Nature methods , volume=

metaFlye: scalable long-read metagenome assembly using repeat graphs , author=. Nature methods , volume=. 2020 , publisher=

work page 2020
[42]

Applied and environmental microbiology , volume=

Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities , author=. Applied and environmental microbiology , volume=. 2009 , publisher=

work page 2009
[43]

Nature methods , volume=

DADA2: High-resolution sample inference from Illumina amplicon data , author=. Nature methods , volume=. 2016 , publisher=

work page 2016
[44]

Nucleic acids research , volume=

MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform , author=. Nucleic acids research , volume=. 2002 , publisher=

work page 2002
[45]

Genome biology , volume=

Kraken: ultrafast metagenomic sequence classification using exact alignments , author=. Genome biology , volume=. 2014 , publisher=

work page 2014
[46]

1999 , publisher=

Elements of information theory , author=. 1999 , publisher=

work page 1999
[47]

Ecological monographs , volume=

An ordination of the upland forest communities of southern Wisconsin , author=. Ecological monographs , volume=. 1957 , publisher=

work page 1957
[48]

European conference on computer vision , pages=

Visualizing and understanding convolutional networks , author=. European conference on computer vision , pages=. 2014 , organization=

work page 2014
[49]

arXiv preprint arXiv:1901.08644 , year=

Ablation studies in artificial neural networks , author=. arXiv preprint arXiv:1901.08644 , year=

work page arXiv 1901
[50]

International conference on machine learning , pages=

Set transformer: A framework for attention-based permutation-invariant neural networks , author=. International conference on machine learning , pages=. 2019 , organization=

work page 2019
[51]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001
[52]

Training Compute-Optimal Large Language Models

Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , volume=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

Microbiome , volume=

Comparing genomes recovered from time-series metagenomes using long-and short-read sequencing technologies , author=. Microbiome , volume=. 2023 , publisher=

work page 2023
[54]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

Cell host & microbe , volume=

Strain tracking reveals the determinants of bacterial engraftment in the human gut following fecal microbiota transplantation , author=. Cell host & microbe , volume=. 2018 , publisher=

work page 2018