pith. machine review for the scientific record. sign in

arxiv: 2604.15520 · v1 · submitted 2026-04-16 · 🧬 q-bio.QM

Recognition: unknown

Sparse regression, classification, and microbial network estimation in QIIME2 with q2-classo and q2-gglasso

Authors on Pith no claims yet

Pith reviewed 2026-05-10 09:03 UTC · model grok-4.3

classification 🧬 q-bio.QM
keywords QIIME 2microbiomecompositional datasparse regressiongraphical modelslog-contrast modelsmicrobial networkspenalized regression
0
0 comments X

The pith

q2-classo and q2-gglasso implement sparse log-contrast regression, classification, and latent graphical models for compositional microbiome data inside QIIME 2.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper supplies two new QIIME 2 plugins to perform statistical modeling on microbial count data while respecting its sparse, compositional, and high-dimensional character. q2-classo fits sparse log-contrast models that predict a continuous or binary outcome from taxon abundances and supports tree-aggregated versions that incorporate phylogenetic structure. q2-gglasso estimates taxon-taxon networks via sparse graphical models, including the SPIEC-EASI framework and latent models that separate direct interactions from a low-rank component. Both plugins are shown on the Atacama soil dataset to produce stable model selection, predictions, and networks that include covariates or latent factors. A reader would care because existing QIIME 2 tools lack these penalized approaches, so analysts have had to export data and risk inconsistent handling of compositionality.

Core claim

We present q2-classo and q2-gglasso, two novel QIIME 2 plugins that implement penalized regression, classification, and graphical modeling approaches for microbial compositional data. q2-classo enables the prediction of a continuous or binary outcome of interest using compositional microbiome data as predictors, with both sparse log-contrast regression and classification as well as tree-aggregated log-contrast models. q2-gglasso enables the estimation of taxon-taxon association networks through sparse graphical model estimation such as the SPIEC-EASI framework, as well as adaptive and latent graphical models; the latent model decomposes associations into a sparse direct interaction matrix.

What carries the argument

Sparse log-contrast models for regression and classification together with latent graphical models that factor taxon-taxon associations into a sparse direct-interaction matrix and a low-rank latent matrix.

If this is right

  • Analysts can keep all steps of microbiome regression and network inference inside QIIME 2 without exporting data to external packages.
  • Tree-aggregated log-contrast models let phylogenetic relationships regularize predictions of host or environmental outcomes.
  • Latent graphical models allow researchers to embed samples via the low-rank component while still obtaining a sparse interpretable interaction graph.
  • Model selection and cross-validation routines become reproducible artifacts stored alongside the original sequencing data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Routine use of these plugins could reduce the practice of applying Euclidean or naive correlation methods to relative abundances, lowering the rate of spurious taxon associations reported in the literature.
  • The latent-factor decomposition may prove useful for batch-effect correction in multi-study microbiome meta-analyses.
  • Future extensions could couple the same log-contrast penalty with other QIIME 2 visualization or differential-abundance tools for end-to-end workflows.

Load-bearing premise

The penalized log-contrast and graphical methods correctly respect the compositional constraints of microbial count data and produce unbiased estimates inside the QIIME 2 pipeline.

What would settle it

Apply the plugins to simulated compositional count tables whose true regression coefficients or direct-interaction graph are known; the recovered coefficients or edges should match the ground truth at rates substantially above chance and without systematic sign flips or spurious dense subgraphs.

Figures

Figures reproduced from arXiv: 2604.15520 by Christian L. Mueller, Evan Bolyen, Fabian Schaipp, J. Gregory Caporaso, Leo Simpson, Oleg Vlasovets.

Figure 1
Figure 1. Figure 1: QIIME 2 workflow and its high-dimensional statistics extension. The left panel illustrates typical amplicon processing steps in QIIME 2. Feature Table, Taxonomy objects and available Metadata serve as input to the high-dimensional statistics plugins q2-classo and q2-gglasso (right). 19], our plugins complement existing statistical approaches for regression, classification, and network estimation (see [PIT… view at source ↗
read the original abstract

Motivation: Statistical analysis of microbial count data derived from 16S rRNA or metagenomics sequencing poses unique challenges due to the sparse, compositional, and high-dimensional nature of the data. While QIIME 2 already provides many tools for data pre-processing and analysis, plugins for statistical regression, classification, and microbial network estimation tailored to compositional count data are relatively scarce. Results: We present q2-classo and q2-gglasso, two novel QIIME 2 plugins that implement penalized regression, classification, and graphical modeling approaches for microbial compositional data. q2-classo enables the prediction of a continuous or binary outcome of interest using compositional microbiome data as predictors. Both sparse log-contrast regression and classification, as well as tree-aggregated log-contrast models are available. q2-gglasso enables the estimation of taxon-taxon association networks through sparse graphical model estimation, such as, e.g., the SPIEC-EASI framework, as well as adaptive and latent graphical models. The latent model can decompose taxon-taxon associations into a sparse direct interaction matrix and a latent (low-rank) matrix which enables robust principal component embedding of a data set. Within the QIIME 2 ecosystem we demonstrate their application on the Atacama soil microbiome dataset, illustrating robust model selection, classification, and microbial network estimation with covariates and latent factors. Availability: The software is freely available under the BSD-3-Clause License. Source code is available at https://github.com/bio-datascience/q2-gglasso and https://github.com/bio-datascience/q2-classo-latest, with installation through QIIME 2 and Docker.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces two QIIME 2 plugins, q2-classo and q2-gglasso, that implement sparse log-contrast regression and classification (including tree-aggregated variants) as well as sparse graphical modeling approaches such as SPIEC-EASI, adaptive, and latent graphical models for compositional microbial count data. These are demonstrated via an application to the Atacama soil microbiome dataset illustrating model selection, classification, and network estimation with covariates and latent factors.

Significance. If the implementations are shown to be faithful, the plugins would address a gap in the QIIME 2 ecosystem by making established penalized regression and graphical modeling tools for compositional data more accessible and integrated with existing microbiome workflows, supporting reproducible analyses of associations and predictions in high-dimensional sparse count data.

major comments (2)
  1. [Results/Demonstration] The demonstration on the Atacama dataset (described in the abstract and results) provides only qualitative illustration of usage and does not include quantitative validation metrics, coefficient comparisons, selected taxa/edge sets, or performance benchmarks against the reference classo, gglasso, or SPIEC-EASI implementations on identical input tables and hyperparameters.
  2. No verification is presented that the plugins correctly replicate key compositional modeling steps such as CLR transformation, zero-handling, tree aggregation, or penalty scaling, which are load-bearing for the claim that the tools respect the sparse, compositional, and high-dimensional properties of 16S count data without introducing systematic biases.
minor comments (1)
  1. [Abstract] The abstract and availability section could specify the exact QIIME 2 version compatibility, required dependencies, and provide a minimal reproducible example workflow or command-line invocation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and positive evaluation of the significance of our work. We address each of the major comments below and outline the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: The demonstration on the Atacama dataset (described in the abstract and results) provides only qualitative illustration of usage and does not include quantitative validation metrics, coefficient comparisons, selected taxa/edge sets, or performance benchmarks against the reference classo, gglasso, or SPIEC-EASI implementations on identical input tables and hyperparameters.

    Authors: We appreciate the referee's observation. The Atacama dataset application is presented primarily as an illustrative example of plugin usage and integration within QIIME 2 workflows for model selection, classification, and network estimation. We agree that quantitative elements would strengthen the manuscript. In the revised version, we will add direct comparisons of coefficient values, selected taxa/edge sets, and any relevant performance metrics by executing the original classo, gglasso, and SPIEC-EASI implementations on the identical Atacama input tables and hyperparameters, with results incorporated into the Results section. revision: yes

  2. Referee: No verification is presented that the plugins correctly replicate key compositional modeling steps such as CLR transformation, zero-handling, tree aggregation, or penalty scaling, which are load-bearing for the claim that the tools respect the sparse, compositional, and high-dimensional properties of 16S count data without introducing systematic biases.

    Authors: We acknowledge that the current manuscript does not include explicit verification of these steps. In the revision, we will add a dedicated verification analysis (as a new Results subsection or supplementary material) that compares outputs from q2-classo and q2-gglasso against the reference implementations for CLR transformation, zero-handling, tree aggregation, and penalty scaling. This will use the Atacama dataset or simulated compositional count data to confirm fidelity and absence of systematic biases. revision: yes

Circularity Check

0 steps flagged

No circularity: software wrapper paper with no new derivations

full rationale

The paper presents two QIIME 2 plugins that implement and expose pre-existing penalized regression, classification, and graphical modeling methods (sparse log-contrast models, SPIEC-EASI-style networks, latent graphical models) for compositional microbiome data. The abstract and description focus on availability, installation, and an illustrative application to the Atacama dataset; no original equations, uniqueness theorems, or predictions are derived. All core statistical components are imported from external frameworks (classo, gglasso, etc.), so no load-bearing step reduces by construction to a fitted parameter, self-definition, or self-citation chain. The derivation chain is empty by design.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a software implementation paper that applies established statistical methods without introducing new free parameters, axioms, or invented entities beyond the existing frameworks it packages.

pith-pipeline@v0.9.0 · 5633 in / 1114 out tokens · 52833 ms · 2026-05-10T09:03:33.857046+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references

  1. [1]

    Log contrast models for experiments with mixtures.Biometrika, 71(2):323–330, 1984

    John Aitchison and John Bacon-Shone. Log contrast models for experiments with mixtures.Biometrika, 71(2):323–330, 1984

  2. [2]

    Tree-aggregated predictive modeling of microbiome data.Scientific Reports, 11(1):14505, 2021

    Jacob Bien, Xiaohan Yan, L´ eo Simpson, and Christian L M¨ uller. Tree-aggregated predictive modeling of microbiome data.Scientific Reports, 11(1):14505, 2021

  3. [3]

    q2-sample-classifier: machine-learning tools for microbiome classification and regression.Journal of open research software, 3(30), 2018

    Nicholas A Bokulich, Matthew R Dillon, Evan Bolyen, Benjamin D Kaehler, Gavin A Huttley, and J Gregory Caporaso. q2-sample-classifier: machine-learning tools for microbiome classification and regression.Journal of open research software, 3(30), 2018

  4. [4]

    Reproducible, interactive, scalable and extensible microbiome data science using qiime 2.Nature biotechnology, 37(8):852–857, 2019

    Evan Bolyen, Jai Ram Rideout, Matthew R Dillon, Nicholas A Bokulich, Christian C Abnet, Gabriel A Al- Ghalith, Harriet Alexander, Eric J Alm, Manimozhiyan Arumugam, Francesco Asnicar, et al. Reproducible, interactive, scalable and extensible microbiome data science using qiime 2.Nature biotechnology, 37(8):852–857, 2019

  5. [5]

    Dada2: High-resolution sample inference from illumina amplicon data.Nature methods, 13(7):581–583, 2016

    Benjamin J Callahan, Paul J McMurdie, Michael J Rosen, Andrew W Han, Amy Jo A Johnson, and Susan P Holmes. Dada2: High-resolution sample inference from illumina amplicon data.Nature methods, 13(7):581–583, 2016

  6. [6]

    Latent variable graphical model selection via convex optimization

    Venkat Chandrasekaran, Pablo A Parrilo, and Alan S Willsky. Latent variable graphical model selection via convex optimization. In2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1610–1613. IEEE, 2010

  7. [7]

    Regression models for compositional data: General log-contrast formulations, proximal optimization, and microbiome data applications.Statistics in Biosciences, 13(2):217–242, 2021

    Patrick L Combettes and Christian L M¨ uller. Regression models for compositional data: General log-contrast formulations, proximal optimization, and microbiome data applications.Statistics in Biosciences, 13(2):217–242, 2021

  8. [8]

    The joint graphical lasso for inverse covariance estimation across multiple classes.Journal of the Royal Statistical Society Series B: Statistical Methodology, 76(2):373–397, 2014

    Patrick Danaher, Pei Wang, and Daniela M Witten. The joint graphical lasso for inverse covariance estimation across multiple classes.Journal of the Royal Statistical Society Series B: Statistical Methodology, 76(2):373–397, 2014

  9. [9]

    Covariance selection.Biometrics, pages 157–175, 1972

    Arthur P Dempster. Covariance selection.Biometrics, pages 157–175, 1972

  10. [10]

    gcoda: conditional dependence network inference for compositional data.Journal of Computational Biology, 24(7):699–708, 2017

    Huaying Fang, Chengcheng Huang, Hongyu Zhao, and Minghua Deng. gcoda: conditional dependence network inference for compositional data.Journal of Computational Biology, 24(7):699–708, 2017

  11. [11]

    Visualizing’omic feature rankings and log-ratios using qurro.NAR genomics and bioinformatics, 2(2):lqaa023, 2020

    Marcus W Fedarko, Cameron Martino, James T Morton, Antonio Gonz´ alez, Gibraan Rahman, Clarisse A Marotz, Jeremiah J Minich, Eric E Allen, and Rob Knight. Visualizing’omic feature rankings and log-ratios using qurro.NAR genomics and bioinformatics, 2(2):lqaa023, 2020

  12. [12]

    Anova-like differential expression (aldex) analysis for mixed population rna-seq

    Andrew D Fernandes, Jean M Macklaim, Thomas G Linn, Gregor Reid, and Gregory B Gloor. Anova-like differential expression (aldex) analysis for mixed population rna-seq. PloS one, 8(7):e67019, 2013

  13. [13]

    Sparse inverse covariance estimation with the graphical lasso.Biostatistics, 9(3):432–441, 2008

    Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphical lasso.Biostatistics, 9(3):432–441, 2008

  14. [14]

    Inferring correlation networks from genomic survey data

    Jonathan Friedman and Eric J Alm. Inferring correlation networks from genomic survey data. 2012

  15. [15]

    Microbiome datasets are compositional: and this is not optional.Frontiers in microbiology, 8:2224, 2017

    Gregory B Gloor, Jean M Macklaim, Vera Pawlowsky- Glahn, and Juan J Egozcue. Microbiome datasets are compositional: and this is not optional.Frontiers in microbiology, 8:2224, 2017

  16. [16]

    Utilizing stability criteria in choosing feature selection methods yields reproducible results in microbiome data

    Lingjing Jiang, Niina Haiminen, Anna-Paola Carrieri, Shi Huang, Yoshiki V´ azquez-Baeza, Laxmi Parida, Ho-Cheol Kim, Austin D Swafford, Rob Knight, and Loki Natarajan. Utilizing stability criteria in choosing feature selection methods yields reproducible results in microbiome data. Biometrics, 78(3):1155–1167, 2022

  17. [17]

    Disentangling microbial associations from hidden environmental and technical factors via latent graphical models.bioRxiv, pages 2019–12, 2019

    Zachary D Kurtz, Richard Bonneau, and Christian L M¨ uller. Disentangling microbial associations from hidden environmental and technical factors via latent graphical models.bioRxiv, pages 2019–12, 2019

  18. [18]

    Sparse and compositionally robust inference of microbial ecological networks.PLoS computational biology, 11(5):e1004226, 2015

    Zachary D Kurtz, Christian L M¨ uller, Emily R Miraldi, Dan R Littman, Martin J Blaser, and Richard A Bonneau. Sparse and compositionally robust inference of microbial ecological networks.PLoS computational biology, 11(5):e1004226, 2015

  19. [19]

    Analysis of compositions of microbiomes with bias correction.Nature communications, 11(1):3514, 2020

    Huang Lin and Shyamal Das Peddada. Analysis of compositions of microbiomes with bias correction.Nature communications, 11(1):3514, 2020

  20. [20]

    Variable selection in regression with compositional covariates

    Wei Lin, Pixu Shi, Rui Feng, and Hongzhe Li. Variable selection in regression with compositional covariates. Biometrika, 101(4):785–797, 2014

  21. [21]

    Unifrac: an effective distance metric for microbial community comparison.The ISME journal, 5(2):169–172, 2011

    Catherine Lozupone, Manuel E Lladser, Dan Knights, Jesse Stombaugh, and Rob Knight. Unifrac: an effective distance metric for microbial community comparison.The ISME journal, 5(2):169–172, 2011

  22. [22]

    A novel sparse compositional technique reveals microbial perturbations.MSystems, 4(1):10–1128, 2019

    Cameron Martino, James T Morton, Clarisse A Marotz, Luke R Thompson, Anupriya Tripathi, Rob Knight, and Karsten Zengler. A novel sparse compositional technique reveals microbial perturbations.MSystems, 4(1):10–1128, 2019

  23. [23]

    Striped unifrac: enabling microbiome analysis at unprecedented scale.Nature methods, 15(11):847–848, 2018

    Daniel McDonald, Yoshiki V´ azquez-Baeza, David Koslicki, Jason McClelland, Nicolai Reeve, Zhenjiang Xu, Antonio Gonzalez, and Rob Knight. Striped unifrac: enabling microbiome analysis at unprecedented scale.Nature methods, 15(11):847–848, 2018

  24. [24]

    Stability selection.Journal of the Royal Statistical Society Series B: Statistical Methodology, 72(4):417–473, 2010

    Nicolai Meinshausen and Peter B¨ uhlmann. Stability selection.Journal of the Royal Statistical Society Series B: Statistical Methodology, 72(4):417–473, 2010

  25. [25]

    Robust regression with compositional covariates.Computational Statistics & Data Analysis, 165:107315, 2022

    Aditya Mishra and Christian L M¨ uller. Robust regression with compositional covariates.Computational Statistics & Data Analysis, 165:107315, 2022

  26. [26]

    Establishing microbial composition measurement standards with reference frames

    James T Morton, Clarisse Marotz, Alex Washburne, Justin Silverman, Livia S Zaramela, Anna Edlund, Karsten Zengler, and Rob Knight. Establishing microbial composition measurement standards with reference frames. Nature communications, 10(1):2719, 2019

  27. [27]

    Significant impacts of increasing aridity on the arid soil microbiome.MSystems, 2(3):e00195–16, 2017

    Julia W Neilson, Katy Califf, Cesar Cardona, Audrey Copeland, Will Van Treuren, Karen L Josephson, Rob Knight, Jack A Gilbert, Jay Quade, J Gregory Caporaso, et al. Significant impacts of increasing aridity on the arid soil microbiome.MSystems, 2(3):e00195–16, 2017

  28. [28]

    Netcomi: network construction and comparison for microbiome data in r.Briefings in bioinformatics, 22(4):bbaa290, 2021

    Stefanie Peschel, Christian L M¨ uller, Erika Von Mutius, Anne-Laure Boulesteix, and Martin Depner. Netcomi: network construction and comparison for microbiome data in r.Briefings in bioinformatics, 22(4):bbaa290, 2021

  29. [29]

    Facilitating bootstrapped and rarefaction-based microbiome diversity analysis with q2- boots.F1000Research, 14:87, 2025

    Isaiah Raspet, Elizabeth Gehret, Chloe Herman, Jeff Meilander, Andrew Manley, Anthony Simard, Evan Bolyen, and J Gregory Caporaso. Facilitating bootstrapped and rarefaction-based microbiome diversity analysis with q2- boots.F1000Research, 14:87, 2025. Regression and network estimation in QIIME 2 5

  30. [30]

    M¨ uller

    Fabian Schaipp, Oleg Vlasovets, and Christian L. M¨ uller. Gglasso - a python package for general graphical lasso computation.Journal of Open Source Software, 6(68):3865, 2021

  31. [31]

    Scnic: Sparse correlation network investigation for compositional data.Molecular Ecology Resources, 23(1):312–325, 2023

    Michael Shaffer, Kumar Thurimella, John D Sterrett, and Catherine A Lozupone. Scnic: Sparse correlation network investigation for compositional data.Molecular Ecology Resources, 23(1):312–325, 2023

  32. [32]

    Regression analysis for microbiome compositional data.The Annals of Applied Statistics, 10(2):1019 – 1040, 2016

    Pixu Shi, Anru Zhang, and Hongzhe Li. Regression analysis for microbiome compositional data.The Annals of Applied Statistics, 10(2):1019 – 1040, 2016

  33. [33]

    Combettes, and Christian L

    L´ eo Simpson, Patrick L. Combettes, and Christian L. M¨ uller. c-lasso - a python package for constrained sparse and robust regression and classification.Journal of Open Source Software, 6(57):2844, 2021

  34. [34]

    Microbial networks in spring-semi-parametric rank- based correlation and partial correlation estimation for quantitative microbiome data.Frontiers in genetics, 10:516, 2019

    Grace Yoon, Irina Gaynanova, and Christian L M¨ uller. Microbial networks in spring-semi-parametric rank- based correlation and partial correlation estimation for quantitative microbiome data.Frontiers in genetics, 10:516, 2019