arxiv: 2604.15520 · v1 · submitted 2026-04-16 · 🧬 q-bio.QM

Recognition: unknown

Sparse regression, classification, and microbial network estimation in QIIME2 with q2-classo and q2-gglasso

Oleg Vlasovets , Fabian Schaipp , Leo Simpson , Evan Bolyen , J. Gregory Caporaso , Christian L. Mueller

Authors on Pith no claims yet

Pith reviewed 2026-05-10 09:03 UTC · model grok-4.3

classification 🧬 q-bio.QM

keywords QIIME 2microbiomecompositional datasparse regressiongraphical modelslog-contrast modelsmicrobial networkspenalized regression

0 comments

The pith

q2-classo and q2-gglasso implement sparse log-contrast regression, classification, and latent graphical models for compositional microbiome data inside QIIME 2.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper supplies two new QIIME 2 plugins to perform statistical modeling on microbial count data while respecting its sparse, compositional, and high-dimensional character. q2-classo fits sparse log-contrast models that predict a continuous or binary outcome from taxon abundances and supports tree-aggregated versions that incorporate phylogenetic structure. q2-gglasso estimates taxon-taxon networks via sparse graphical models, including the SPIEC-EASI framework and latent models that separate direct interactions from a low-rank component. Both plugins are shown on the Atacama soil dataset to produce stable model selection, predictions, and networks that include covariates or latent factors. A reader would care because existing QIIME 2 tools lack these penalized approaches, so analysts have had to export data and risk inconsistent handling of compositionality.

Core claim

We present q2-classo and q2-gglasso, two novel QIIME 2 plugins that implement penalized regression, classification, and graphical modeling approaches for microbial compositional data. q2-classo enables the prediction of a continuous or binary outcome of interest using compositional microbiome data as predictors, with both sparse log-contrast regression and classification as well as tree-aggregated log-contrast models. q2-gglasso enables the estimation of taxon-taxon association networks through sparse graphical model estimation such as the SPIEC-EASI framework, as well as adaptive and latent graphical models; the latent model decomposes associations into a sparse direct interaction matrix.

What carries the argument

Sparse log-contrast models for regression and classification together with latent graphical models that factor taxon-taxon associations into a sparse direct-interaction matrix and a low-rank latent matrix.

If this is right

Analysts can keep all steps of microbiome regression and network inference inside QIIME 2 without exporting data to external packages.
Tree-aggregated log-contrast models let phylogenetic relationships regularize predictions of host or environmental outcomes.
Latent graphical models allow researchers to embed samples via the low-rank component while still obtaining a sparse interpretable interaction graph.
Model selection and cross-validation routines become reproducible artifacts stored alongside the original sequencing data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Routine use of these plugins could reduce the practice of applying Euclidean or naive correlation methods to relative abundances, lowering the rate of spurious taxon associations reported in the literature.
The latent-factor decomposition may prove useful for batch-effect correction in multi-study microbiome meta-analyses.
Future extensions could couple the same log-contrast penalty with other QIIME 2 visualization or differential-abundance tools for end-to-end workflows.

Load-bearing premise

The penalized log-contrast and graphical methods correctly respect the compositional constraints of microbial count data and produce unbiased estimates inside the QIIME 2 pipeline.

What would settle it

Apply the plugins to simulated compositional count tables whose true regression coefficients or direct-interaction graph are known; the recovered coefficients or edges should match the ground truth at rates substantially above chance and without systematic sign flips or spurious dense subgraphs.

Figures

Figures reproduced from arXiv: 2604.15520 by Christian L. Mueller, Evan Bolyen, Fabian Schaipp, J. Gregory Caporaso, Leo Simpson, Oleg Vlasovets.

**Figure 1.** Figure 1: QIIME 2 workflow and its high-dimensional statistics extension. The left panel illustrates typical amplicon processing steps in QIIME 2. Feature Table, Taxonomy objects and available Metadata serve as input to the high-dimensional statistics plugins q2-classo and q2-gglasso (right). 19], our plugins complement existing statistical approaches for regression, classification, and network estimation (see [PIT… view at source ↗

read the original abstract

Motivation: Statistical analysis of microbial count data derived from 16S rRNA or metagenomics sequencing poses unique challenges due to the sparse, compositional, and high-dimensional nature of the data. While QIIME 2 already provides many tools for data pre-processing and analysis, plugins for statistical regression, classification, and microbial network estimation tailored to compositional count data are relatively scarce. Results: We present q2-classo and q2-gglasso, two novel QIIME 2 plugins that implement penalized regression, classification, and graphical modeling approaches for microbial compositional data. q2-classo enables the prediction of a continuous or binary outcome of interest using compositional microbiome data as predictors. Both sparse log-contrast regression and classification, as well as tree-aggregated log-contrast models are available. q2-gglasso enables the estimation of taxon-taxon association networks through sparse graphical model estimation, such as, e.g., the SPIEC-EASI framework, as well as adaptive and latent graphical models. The latent model can decompose taxon-taxon associations into a sparse direct interaction matrix and a latent (low-rank) matrix which enables robust principal component embedding of a data set. Within the QIIME 2 ecosystem we demonstrate their application on the Atacama soil microbiome dataset, illustrating robust model selection, classification, and microbial network estimation with covariates and latent factors. Availability: The software is freely available under the BSD-3-Clause License. Source code is available at https://github.com/bio-datascience/q2-gglasso and https://github.com/bio-datascience/q2-classo-latest, with installation through QIIME 2 and Docker.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

q2-classo and q2-gglasso provide convenient QIIME 2 access to established sparse compositional methods, but the manuscript lacks validation against reference implementations.

read the letter

Two new QIIME 2 plugins wrap existing methods for sparse regression and graphical modeling of compositional microbiome data, providing a convenient interface but without fresh algorithmic work or detailed benchmarking. The plugins implement sparse log-contrast models for prediction and classification, tree-aggregated versions, and several graphical lasso variants including adaptive and latent ones that separate direct interactions from low-rank effects. The paper demonstrates them on the Atacama soil dataset, showing how they can incorporate covariates and produce networks or embeddings. This packaging is genuinely useful for the many labs that already use QIIME 2 for 16S processing. It avoids the friction of moving data between tools and keeps analyses reproducible within one platform. Code is available on GitHub, which is a plus. Where it falls short is in validation. The description outlines the functionality and gives an application example, but there are no side-by-side tests against the original classo or gglasso libraries on identical inputs. No tables compare selected features, coefficient values, or edge sets. Without that, it's hard to confirm the wrappers correctly handle the compositional constraints like log-ratios and zero imputation. The citation pattern looks standard, drawing from the relevant prior papers on these methods. This paper is for practitioners in microbial ecology and bioinformatics who need these specific statistical tools inside QIIME 2. A reader looking for new theory or rigorous performance benchmarks won't find much here, but someone wanting ready-to-use plugins will. I think it deserves peer review. The contribution is practical software integration rather than a methodological breakthrough, but that's a legitimate category for this kind of journal, and the open code allows referees to inspect the implementation directly.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces two QIIME 2 plugins, q2-classo and q2-gglasso, that implement sparse log-contrast regression and classification (including tree-aggregated variants) as well as sparse graphical modeling approaches such as SPIEC-EASI, adaptive, and latent graphical models for compositional microbial count data. These are demonstrated via an application to the Atacama soil microbiome dataset illustrating model selection, classification, and network estimation with covariates and latent factors.

Significance. If the implementations are shown to be faithful, the plugins would address a gap in the QIIME 2 ecosystem by making established penalized regression and graphical modeling tools for compositional data more accessible and integrated with existing microbiome workflows, supporting reproducible analyses of associations and predictions in high-dimensional sparse count data.

major comments (2)

[Results/Demonstration] The demonstration on the Atacama dataset (described in the abstract and results) provides only qualitative illustration of usage and does not include quantitative validation metrics, coefficient comparisons, selected taxa/edge sets, or performance benchmarks against the reference classo, gglasso, or SPIEC-EASI implementations on identical input tables and hyperparameters.
No verification is presented that the plugins correctly replicate key compositional modeling steps such as CLR transformation, zero-handling, tree aggregation, or penalty scaling, which are load-bearing for the claim that the tools respect the sparse, compositional, and high-dimensional properties of 16S count data without introducing systematic biases.

minor comments (1)

[Abstract] The abstract and availability section could specify the exact QIIME 2 version compatibility, required dependencies, and provide a minimal reproducible example workflow or command-line invocation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and positive evaluation of the significance of our work. We address each of the major comments below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: The demonstration on the Atacama dataset (described in the abstract and results) provides only qualitative illustration of usage and does not include quantitative validation metrics, coefficient comparisons, selected taxa/edge sets, or performance benchmarks against the reference classo, gglasso, or SPIEC-EASI implementations on identical input tables and hyperparameters.

Authors: We appreciate the referee's observation. The Atacama dataset application is presented primarily as an illustrative example of plugin usage and integration within QIIME 2 workflows for model selection, classification, and network estimation. We agree that quantitative elements would strengthen the manuscript. In the revised version, we will add direct comparisons of coefficient values, selected taxa/edge sets, and any relevant performance metrics by executing the original classo, gglasso, and SPIEC-EASI implementations on the identical Atacama input tables and hyperparameters, with results incorporated into the Results section. revision: yes
Referee: No verification is presented that the plugins correctly replicate key compositional modeling steps such as CLR transformation, zero-handling, tree aggregation, or penalty scaling, which are load-bearing for the claim that the tools respect the sparse, compositional, and high-dimensional properties of 16S count data without introducing systematic biases.

Authors: We acknowledge that the current manuscript does not include explicit verification of these steps. In the revision, we will add a dedicated verification analysis (as a new Results subsection or supplementary material) that compares outputs from q2-classo and q2-gglasso against the reference implementations for CLR transformation, zero-handling, tree aggregation, and penalty scaling. This will use the Atacama dataset or simulated compositional count data to confirm fidelity and absence of systematic biases. revision: yes

Circularity Check

0 steps flagged

No circularity: software wrapper paper with no new derivations

full rationale

The paper presents two QIIME 2 plugins that implement and expose pre-existing penalized regression, classification, and graphical modeling methods (sparse log-contrast models, SPIEC-EASI-style networks, latent graphical models) for compositional microbiome data. The abstract and description focus on availability, installation, and an illustrative application to the Atacama dataset; no original equations, uniqueness theorems, or predictions are derived. All core statistical components are imported from external frameworks (classo, gglasso, etc.), so no load-bearing step reduces by construction to a fitted parameter, self-definition, or self-citation chain. The derivation chain is empty by design.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a software implementation paper that applies established statistical methods without introducing new free parameters, axioms, or invented entities beyond the existing frameworks it packages.

pith-pipeline@v0.9.0 · 5633 in / 1114 out tokens · 52833 ms · 2026-05-10T09:03:33.857046+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references

[1]

Log contrast models for experiments with mixtures.Biometrika, 71(2):323–330, 1984

John Aitchison and John Bacon-Shone. Log contrast models for experiments with mixtures.Biometrika, 71(2):323–330, 1984

1984
[2]

Tree-aggregated predictive modeling of microbiome data.Scientific Reports, 11(1):14505, 2021

Jacob Bien, Xiaohan Yan, L´ eo Simpson, and Christian L M¨ uller. Tree-aggregated predictive modeling of microbiome data.Scientific Reports, 11(1):14505, 2021

2021
[3]

q2-sample-classifier: machine-learning tools for microbiome classification and regression.Journal of open research software, 3(30), 2018

Nicholas A Bokulich, Matthew R Dillon, Evan Bolyen, Benjamin D Kaehler, Gavin A Huttley, and J Gregory Caporaso. q2-sample-classifier: machine-learning tools for microbiome classification and regression.Journal of open research software, 3(30), 2018

2018
[4]

Reproducible, interactive, scalable and extensible microbiome data science using qiime 2.Nature biotechnology, 37(8):852–857, 2019

Evan Bolyen, Jai Ram Rideout, Matthew R Dillon, Nicholas A Bokulich, Christian C Abnet, Gabriel A Al- Ghalith, Harriet Alexander, Eric J Alm, Manimozhiyan Arumugam, Francesco Asnicar, et al. Reproducible, interactive, scalable and extensible microbiome data science using qiime 2.Nature biotechnology, 37(8):852–857, 2019

2019
[5]

Dada2: High-resolution sample inference from illumina amplicon data.Nature methods, 13(7):581–583, 2016

Benjamin J Callahan, Paul J McMurdie, Michael J Rosen, Andrew W Han, Amy Jo A Johnson, and Susan P Holmes. Dada2: High-resolution sample inference from illumina amplicon data.Nature methods, 13(7):581–583, 2016

2016
[6]

Latent variable graphical model selection via convex optimization

Venkat Chandrasekaran, Pablo A Parrilo, and Alan S Willsky. Latent variable graphical model selection via convex optimization. In2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1610–1613. IEEE, 2010

2010
[7]

Regression models for compositional data: General log-contrast formulations, proximal optimization, and microbiome data applications.Statistics in Biosciences, 13(2):217–242, 2021

Patrick L Combettes and Christian L M¨ uller. Regression models for compositional data: General log-contrast formulations, proximal optimization, and microbiome data applications.Statistics in Biosciences, 13(2):217–242, 2021

2021
[8]

The joint graphical lasso for inverse covariance estimation across multiple classes.Journal of the Royal Statistical Society Series B: Statistical Methodology, 76(2):373–397, 2014

Patrick Danaher, Pei Wang, and Daniela M Witten. The joint graphical lasso for inverse covariance estimation across multiple classes.Journal of the Royal Statistical Society Series B: Statistical Methodology, 76(2):373–397, 2014

2014
[9]

Covariance selection.Biometrics, pages 157–175, 1972

Arthur P Dempster. Covariance selection.Biometrics, pages 157–175, 1972

1972
[10]

gcoda: conditional dependence network inference for compositional data.Journal of Computational Biology, 24(7):699–708, 2017

Huaying Fang, Chengcheng Huang, Hongyu Zhao, and Minghua Deng. gcoda: conditional dependence network inference for compositional data.Journal of Computational Biology, 24(7):699–708, 2017

2017
[11]

Visualizing’omic feature rankings and log-ratios using qurro.NAR genomics and bioinformatics, 2(2):lqaa023, 2020

Marcus W Fedarko, Cameron Martino, James T Morton, Antonio Gonz´ alez, Gibraan Rahman, Clarisse A Marotz, Jeremiah J Minich, Eric E Allen, and Rob Knight. Visualizing’omic feature rankings and log-ratios using qurro.NAR genomics and bioinformatics, 2(2):lqaa023, 2020

2020
[12]

Anova-like differential expression (aldex) analysis for mixed population rna-seq

Andrew D Fernandes, Jean M Macklaim, Thomas G Linn, Gregor Reid, and Gregory B Gloor. Anova-like differential expression (aldex) analysis for mixed population rna-seq. PloS one, 8(7):e67019, 2013

2013
[13]

Sparse inverse covariance estimation with the graphical lasso.Biostatistics, 9(3):432–441, 2008

Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphical lasso.Biostatistics, 9(3):432–441, 2008

2008
[14]

Inferring correlation networks from genomic survey data

Jonathan Friedman and Eric J Alm. Inferring correlation networks from genomic survey data. 2012

2012
[15]

Microbiome datasets are compositional: and this is not optional.Frontiers in microbiology, 8:2224, 2017

Gregory B Gloor, Jean M Macklaim, Vera Pawlowsky- Glahn, and Juan J Egozcue. Microbiome datasets are compositional: and this is not optional.Frontiers in microbiology, 8:2224, 2017

2017
[16]

Utilizing stability criteria in choosing feature selection methods yields reproducible results in microbiome data

Lingjing Jiang, Niina Haiminen, Anna-Paola Carrieri, Shi Huang, Yoshiki V´ azquez-Baeza, Laxmi Parida, Ho-Cheol Kim, Austin D Swafford, Rob Knight, and Loki Natarajan. Utilizing stability criteria in choosing feature selection methods yields reproducible results in microbiome data. Biometrics, 78(3):1155–1167, 2022

2022
[17]

Disentangling microbial associations from hidden environmental and technical factors via latent graphical models.bioRxiv, pages 2019–12, 2019

Zachary D Kurtz, Richard Bonneau, and Christian L M¨ uller. Disentangling microbial associations from hidden environmental and technical factors via latent graphical models.bioRxiv, pages 2019–12, 2019

2019
[18]

Sparse and compositionally robust inference of microbial ecological networks.PLoS computational biology, 11(5):e1004226, 2015

Zachary D Kurtz, Christian L M¨ uller, Emily R Miraldi, Dan R Littman, Martin J Blaser, and Richard A Bonneau. Sparse and compositionally robust inference of microbial ecological networks.PLoS computational biology, 11(5):e1004226, 2015

2015
[19]

Analysis of compositions of microbiomes with bias correction.Nature communications, 11(1):3514, 2020

Huang Lin and Shyamal Das Peddada. Analysis of compositions of microbiomes with bias correction.Nature communications, 11(1):3514, 2020

2020
[20]

Variable selection in regression with compositional covariates

Wei Lin, Pixu Shi, Rui Feng, and Hongzhe Li. Variable selection in regression with compositional covariates. Biometrika, 101(4):785–797, 2014

2014
[21]

Unifrac: an effective distance metric for microbial community comparison.The ISME journal, 5(2):169–172, 2011

Catherine Lozupone, Manuel E Lladser, Dan Knights, Jesse Stombaugh, and Rob Knight. Unifrac: an effective distance metric for microbial community comparison.The ISME journal, 5(2):169–172, 2011

2011
[22]

A novel sparse compositional technique reveals microbial perturbations.MSystems, 4(1):10–1128, 2019

Cameron Martino, James T Morton, Clarisse A Marotz, Luke R Thompson, Anupriya Tripathi, Rob Knight, and Karsten Zengler. A novel sparse compositional technique reveals microbial perturbations.MSystems, 4(1):10–1128, 2019

2019
[23]

Striped unifrac: enabling microbiome analysis at unprecedented scale.Nature methods, 15(11):847–848, 2018

Daniel McDonald, Yoshiki V´ azquez-Baeza, David Koslicki, Jason McClelland, Nicolai Reeve, Zhenjiang Xu, Antonio Gonzalez, and Rob Knight. Striped unifrac: enabling microbiome analysis at unprecedented scale.Nature methods, 15(11):847–848, 2018

2018
[24]

Stability selection.Journal of the Royal Statistical Society Series B: Statistical Methodology, 72(4):417–473, 2010

Nicolai Meinshausen and Peter B¨ uhlmann. Stability selection.Journal of the Royal Statistical Society Series B: Statistical Methodology, 72(4):417–473, 2010

2010
[25]

Robust regression with compositional covariates.Computational Statistics & Data Analysis, 165:107315, 2022

Aditya Mishra and Christian L M¨ uller. Robust regression with compositional covariates.Computational Statistics & Data Analysis, 165:107315, 2022

2022
[26]

Establishing microbial composition measurement standards with reference frames

James T Morton, Clarisse Marotz, Alex Washburne, Justin Silverman, Livia S Zaramela, Anna Edlund, Karsten Zengler, and Rob Knight. Establishing microbial composition measurement standards with reference frames. Nature communications, 10(1):2719, 2019

2019
[27]

Significant impacts of increasing aridity on the arid soil microbiome.MSystems, 2(3):e00195–16, 2017

Julia W Neilson, Katy Califf, Cesar Cardona, Audrey Copeland, Will Van Treuren, Karen L Josephson, Rob Knight, Jack A Gilbert, Jay Quade, J Gregory Caporaso, et al. Significant impacts of increasing aridity on the arid soil microbiome.MSystems, 2(3):e00195–16, 2017

2017
[28]

Netcomi: network construction and comparison for microbiome data in r.Briefings in bioinformatics, 22(4):bbaa290, 2021

Stefanie Peschel, Christian L M¨ uller, Erika Von Mutius, Anne-Laure Boulesteix, and Martin Depner. Netcomi: network construction and comparison for microbiome data in r.Briefings in bioinformatics, 22(4):bbaa290, 2021

2021
[29]

Facilitating bootstrapped and rarefaction-based microbiome diversity analysis with q2- boots.F1000Research, 14:87, 2025

Isaiah Raspet, Elizabeth Gehret, Chloe Herman, Jeff Meilander, Andrew Manley, Anthony Simard, Evan Bolyen, and J Gregory Caporaso. Facilitating bootstrapped and rarefaction-based microbiome diversity analysis with q2- boots.F1000Research, 14:87, 2025. Regression and network estimation in QIIME 2 5

2025
[30]

M¨ uller

Fabian Schaipp, Oleg Vlasovets, and Christian L. M¨ uller. Gglasso - a python package for general graphical lasso computation.Journal of Open Source Software, 6(68):3865, 2021

2021
[31]

Scnic: Sparse correlation network investigation for compositional data.Molecular Ecology Resources, 23(1):312–325, 2023

Michael Shaffer, Kumar Thurimella, John D Sterrett, and Catherine A Lozupone. Scnic: Sparse correlation network investigation for compositional data.Molecular Ecology Resources, 23(1):312–325, 2023

2023
[32]

Regression analysis for microbiome compositional data.The Annals of Applied Statistics, 10(2):1019 – 1040, 2016

Pixu Shi, Anru Zhang, and Hongzhe Li. Regression analysis for microbiome compositional data.The Annals of Applied Statistics, 10(2):1019 – 1040, 2016

2016
[33]

Combettes, and Christian L

L´ eo Simpson, Patrick L. Combettes, and Christian L. M¨ uller. c-lasso - a python package for constrained sparse and robust regression and classification.Journal of Open Source Software, 6(57):2844, 2021

2021
[34]

Microbial networks in spring-semi-parametric rank- based correlation and partial correlation estimation for quantitative microbiome data.Frontiers in genetics, 10:516, 2019

Grace Yoon, Irina Gaynanova, and Christian L M¨ uller. Microbial networks in spring-semi-parametric rank- based correlation and partial correlation estimation for quantitative microbiome data.Frontiers in genetics, 10:516, 2019

2019