pith. machine review for the scientific record. sign in

arxiv: 2603.25561 · v2 · submitted 2026-03-26 · 💻 cs.LG

Recognition: unknown

An Integrative Genome-Scale Metabolic Modeling and Machine Learning Framework for Predicting and Optimizing Single-Cell Protein Production in Saccharomyces cerevisiae

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:06 UTC · model grok-4.3

classification 💻 cs.LG
keywords single-cell proteinSaccharomyces cerevisiaegenome-scale metabolic modelflux balance analysismachine learningBayesian optimizationbiomass fluxYeast9
0
0 comments X

The pith

Bayesian optimization on Yeast9 metabolic simulations raises biomass flux more than twelve-fold for single-cell protein production.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a framework that runs flux balance analysis on the Yeast9 genome-scale model of Saccharomyces cerevisiae to generate thousands of flux profiles, then trains machine learning models to predict biomass output. These surrogates identify key reactions and metabolic clusters while Bayesian optimization searches for nutrient uptake rates that maximize growth. The resulting conditions increase biomass flux from 0.0858 to 1.041 gDW/hr under fixed glucose, oxygen, and ammonium supplies.

Core claim

Integration of the Yeast9 GEM (4,131 reactions) with random forest and XGBoost regressors yields R2 scores above 0.999 on simulated data; a variational autoencoder partitions fluxes into four clusters with distinct biomass means; Bayesian optimization then locates an uptake vector (glucose -20.0, oxygen -20.0, ammonium -8.9 mmol/gDW/hr) that produces a 12.13-fold biomass increase to 1.041 gDW/hr.

What carries the argument

Bayesian optimization performed on machine-learning surrogates trained from flux-balance-analysis simulations of the Yeast9 genome-scale model

If this is right

  • Surrogate models with R2 greater than 0.999 can replace repeated full FBA runs during optimization loops.
  • Twenty reactions concentrated in glycolysis, TCA cycle, and amino-acid biosynthesis control most of the biomass variance.
  • A Pareto front between biomass flux and amino-acid biosynthesis score identifies a practical operating point at 0.0858 gDW/hr biomass.
  • The variational autoencoder reveals four distinct metabolic regimes whose mean biomass fluxes differ by up to 11 percent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same surrogate-plus-optimization pipeline could be reused for other microbial hosts once a comparable genome-scale model exists.
  • The reported GAN failure to generate stoichiometrically feasible profiles indicates that future work must embed mass-balance constraints directly inside the generator.
  • The twenty high-impact reactions supply concrete targets for genetic interventions that could further raise the experimental ceiling beyond the current uptake optimum.

Load-bearing premise

That machine-learning models fitted only to simulated fluxes will accurately predict real cellular behavior and that the identified uptake rates can be supplied to living cells without violating other unmodeled constraints.

What would settle it

Measure actual biomass accumulation rate in S. cerevisiae chemostat cultures supplied with the exact predicted uptake rates of glucose -20.0, oxygen -20.0, and ammonium -8.9 mmol/gDW/hr and test whether the rate reaches or exceeds 1.041 gDW/hr.

Figures

Figures reproduced from arXiv: 2603.25561 by Aaron D'Souza, Neha K. Nair.

Figure 2
Figure 2. Figure 2: Cluster number selection diagnostics. Left: Elbow method (inertia vs. k), decreasing from ≈ 4350 at k=2 to ≈ 1500 at k=9. Right: Silhouette score vs. k, with peak at k=2 (≈ 0.341) and secondary peak at k=6 (≈ 0.326), supporting the selection of k=4 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: K-means clustering of flux profiles in latent space ( [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 1
Figure 1. Figure 1: Two-dimensional latent space learned by the VAE. Each point [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 4
Figure 4. Figure 4: Scatter plot of Random Forest-predicted versus true biomass flux [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Scatter plot of FFNN-predicted versus true biomass flux values on [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Random Forest feature importance scores across all 4,131 reactions. A [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cluster-specific metabolic activity heatmap showing mean flux values [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Effect of metabolic interventions on biomass flux. Baseline ( [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Enriched pathways among top upregulated reactions in Cluster 1. [PITH_FULL_IMAGE:figures/full_fig_p007_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: The GAN-generated flux activity across the top ten metabolic [PITH_FULL_IMAGE:figures/full_fig_p007_12.png] view at source ↗
read the original abstract

Saccharomyces cerevisiae is increasingly recognised as a key source for single-cell protein (SCP) production, a rising solution to global protein-supply challenges. This study presents a computational framework combining the Yeast9 genome-scale metabolic model (GEM) with machine learning and optimisation to predict and enhance biomass flux for SCP yield. The Yeast9 GEM, comprising 4,131 reactions, 2,806 metabolites, and 1,161 genes, was simulated using flux balance analysis (FBA) across 2,000 Latin Hypercube-sampled flux profiles. Random Forest and XGBoost regressors achieved R2 values of 0.9999760 and 0.9997702, respectively. A variational autoencoder (VAE) identified four metabolic clusters with mean biomass fluxes of 0.472, 0.493, 0.527, and 0.505 gDW/hr. SHAP-based feature attribution identified twenty key reactions in glycolysis, the TCA cycle, and amino-acid biosynthesis; 18/20 (90%) were confirmed essential by in silico knockout. Bayesian optimisation produced a 12.13-fold improvement in biomass flux (0.0858 to 1.041 gDW/hr) at glucose = -20.0, oxygen = -20.0, and ammonium = -8.9 mmol/gDW/hr. A generative adversarial network (GAN) generated novel flux configurations (variance = 0.124); stoichiometric feasibility verification returned 0/100 feasible profiles due to incomplete generator convergence, reported as a limitation. Pareto front analysis identified an optimal SCP operating point at 0.0858 gDW/hr biomass flux with amino-acid biosynthesis score of 1000.029 mmol/gDW/hr.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript presents an integrative framework combining the Yeast9 genome-scale metabolic model with flux balance analysis (FBA) on 2000 Latin Hypercube samples, machine learning regressors (Random Forest and XGBoost achieving R² > 0.999 on held-out data), variational autoencoders for clustering, SHAP for feature importance, Bayesian optimization, and a generative adversarial network to predict and optimize biomass flux for single-cell protein production in Saccharomyces cerevisiae, claiming a 12.13-fold improvement to 1.041 gDW/hr at specific uptake rates.

Significance. If the central optimization result holds under direct verification, the work demonstrates a practical surrogate-modeling pipeline for in silico metabolic engineering that could accelerate SCP strain design by identifying high-biomass flux configurations within GEM constraints. Credit is due for the high predictive accuracy of the RF/XGBoost models on simulated data and the 90% concordance between SHAP-identified reactions and in silico knockouts, which provides internal consistency checks.

major comments (3)
  1. [Bayesian optimisation results] Bayesian optimisation results: The reported 12.13-fold biomass flux improvement to 1.041 gDW/hr at glucose = -20.0, oxygen = -20.0, ammonium = -8.9 mmol/gDW/hr lacks independent confirmation via direct FBA solution of the Yeast9 model at these exact uptake rates. The training data used only 2000 Latin Hypercube samples, so without this verification the improvement may reflect surrogate extrapolation error rather than a true model optimum.
  2. [Generative adversarial network analysis] Generative adversarial network analysis: The GAN produced 0/100 feasible flux profiles due to incomplete generator convergence. This failure is load-bearing for claims about the framework's ability to generate novel configurations and requires either architectural fixes or explicit quantification of training instability to support the overall integrative approach.
  3. [Machine learning surrogate models] Machine learning surrogate models: The RF and XGBoost models are trained directly on FBA-generated flux profiles from the same Yeast9 GEM, so the Bayesian optimum represents an optimized point inside the original stoichiometric space rather than an extrapolation to new biology. This circularity must be explicitly framed as a scope limitation for the central claim of predictive optimization.
minor comments (3)
  1. [Abstract] The abstract reports a Pareto front optimum at the baseline biomass flux of 0.0858 gDW/hr; clarify the relationship between this point and the Bayesian-optimized value of 1.041 gDW/hr to avoid apparent inconsistency.
  2. Specify the exact sampling bounds for the 2000 Latin Hypercube points so readers can determine whether the reported optimal uptake rates lie inside or outside the training distribution.
  3. [Bayesian optimisation results] Add uncertainty quantification (e.g., posterior variance or additional FBA cross-validation) for the Bayesian optimization result, as the current reporting provides only a point estimate.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, indicating where revisions will be made to strengthen the work.

read point-by-point responses
  1. Referee: Bayesian optimisation results: The reported 12.13-fold biomass flux improvement to 1.041 gDW/hr at glucose = -20.0, oxygen = -20.0, ammonium = -8.9 mmol/gDW/hr lacks independent confirmation via direct FBA solution of the Yeast9 model at these exact uptake rates. The training data used only 2000 Latin Hypercube samples, so without this verification the improvement may reflect surrogate extrapolation error rather than a true model optimum.

    Authors: We agree that direct verification via FBA is required to confirm the reported optimum and exclude surrogate extrapolation artifacts. In the revised manuscript we will add the results of direct FBA on the Yeast9 model at the exact uptake rates (glucose = -20.0, oxygen = -20.0, ammonium = -8.9 mmol/gDW/hr), which yields the stated biomass flux of 1.041 gDW/hr and thereby substantiates the 12.13-fold improvement within the model constraints. revision: yes

  2. Referee: Generative adversarial network analysis: The GAN produced 0/100 feasible flux profiles due to incomplete generator convergence. This failure is load-bearing for claims about the framework's ability to generate novel configurations and requires either architectural fixes or explicit quantification of training instability to support the overall integrative approach.

    Authors: The manuscript already states that the GAN returned 0/100 feasible profiles owing to incomplete generator convergence and presents this explicitly as a limitation. To address the concern, we will expand the revised text with explicit quantification of training instability (generator and discriminator loss curves and convergence metrics) while retaining the reported outcome and its implications for the generative component. revision: partial

  3. Referee: Machine learning surrogate models: The RF and XGBoost models are trained directly on FBA-generated flux profiles from the same Yeast9 GEM, so the Bayesian optimum represents an optimized point inside the original stoichiometric space rather than an extrapolation to new biology. This circularity must be explicitly framed as a scope limitation for the central claim of predictive optimization.

    Authors: We accept this observation. The revised manuscript will include an explicit scope limitation statement clarifying that the surrogate models and Bayesian optimization identify high-flux points strictly inside the stoichiometric space defined by the Yeast9 GEM and the 2000 Latin Hypercube samples, rather than extrapolating beyond the model to new biology. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper describes a standard computational workflow: generate FBA data from Yeast9 GEM via Latin Hypercube sampling, train surrogate ML models (RF/XGBoost) on those data to approximate biomass flux, then apply Bayesian optimization on the surrogates to locate a high-flux point. This is a self-contained surrogate-optimization pipeline whose output (the reported 1.041 gDW/hr value) is produced by the optimization step rather than being definitionally identical to any input datum or fitted parameter. No equations reduce to prior results by construction, no load-bearing self-citations appear, and the GAN failure is explicitly noted as a limitation rather than hidden. The framework therefore does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard FBA assumptions plus the premise that ML models trained on simulated fluxes can guide real optimization. No new entities are postulated.

free parameters (3)
  • glucose_uptake
    Upper bound set to -20 mmol/gDW/hr during Bayesian optimization; chosen as part of the search rather than derived.
  • oxygen_uptake
    Upper bound set to -20 mmol/gDW/hr; part of the optimized input vector.
  • ammonium_uptake
    Upper bound set to -8.9 mmol/gDW/hr; part of the optimized input vector.
axioms (2)
  • standard math Steady-state mass balance: S·v = 0 for all metabolites
    Invoked implicitly by every FBA simulation in the Yeast9 model.
  • domain assumption Biomass reaction flux equals growth rate under the chosen objective
    Standard assumption when maximizing biomass in GEMs for SCP yield.

pith-pipeline@v0.9.0 · 5625 in / 1632 out tokens · 49079 ms · 2026-05-15T00:06:24.176041+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 1 internal anchor

  1. [1]

    cerevisiaemetabolic model Yeast8 and its ecosystem for comprehensively probing cellular metabolism.Nature Communications 10:3586

    Lu H, Li C, Sanchez BJ, Zhu Z, Liljenbacka G, Nielsen J (2019) A consensusS. cerevisiaemetabolic model Yeast8 and its ecosystem for comprehensively probing cellular metabolism.Nature Communications 10:3586. https://doi.org/10.1038/s41467-019-11581-3

  2. [2]

    cerevisiaecurated by the community.Molecular Systems Biology20

    Zhang C et al (2024) Yeast9: A consensus genome-scale metabolic model forS. cerevisiaecurated by the community.Molecular Systems Biology20. https://doi.org/10.1038/s44320-024-00060-7

  3. [3]

    https://doi.org/10.1186/1752-0509-7-74

    Ebrahim A, Lerman JA, Palsson BØ, Hyduke DR (2013) COBRApy: Constraints-Based Reconstruction and Analysis for Python.BMC Sys- tems Biology7:74. https://doi.org/10.1186/1752-0509-7-74

  4. [4]

    https://doi.org/10.1038/nbt.1614

    Orth JD, Thiele I, Palsson BØ (2010) What is flux balance analysis? Nature Biotechnology.28:245–248. https://doi.org/10.1038/nbt.1614

  5. [5]

    https://doi.org/10.1093/femsyr/foac003

    Chen Y , Li F, Nielsen J (2022) Genome-scale modeling of yeast metabolism: retrospectives and perspectives.FEMS Yeast Research22. https://doi.org/10.1093/femsyr/foac003

  6. [6]

    https://doi.org/10.1016/j.coisb.2021.03.001

    Kim WJ, Kim HU, Lee SY (2021) Machine learning applications in genome-scale metabolic modeling.Current Opinion in Systems Biology 25:42–49. https://doi.org/10.1016/j.coisb.2021.03.001

  7. [7]

    https://doi.org/10.1371/journal.pcbi.1007084

    Zampieri G, Vijayakumar S, Yaneske E, Angione C (2019) Machine and deep learning meet genome-scale metabolic modeling.PLOS Computa- tional Biology15. https://doi.org/10.1371/journal.pcbi.1007084

  8. [8]

    Sahu A, Blatke MA, Szyma ´nski JJ, T ¨opfer N (2021) Advances in flux balance analysis by integrating machine learning and mechanism-based models.Computational and Structural Biotechnology Journal19:4626–

  9. [9]

    https://doi.org/10.1016/j.csbj.2021.08.004

  10. [10]

    Proceedings of the National Academy of Sciences120(33) (2023) https://doi.org/10.1073/pnas

    Culley J, Vijayakumar A, Zampieri G, Angione C (2020) A mechanism- aware and multiomic machine-learning pipeline characterizes yeast cell growth.PNAS117:18338–18348. https://doi.org/10.1073/pnas. 2002959117

  11. [11]

    Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions.Advances in Neural Information Processing Systems30 https://arxiv.org/abs/1705.07874

  12. [12]

    Gonc ¸alves, Rui Henriques, Rafael S

    Daniel M. Gonc ¸alves, Rui Henriques, Rafael S. Costa (2023) Predicting metabolic fluxes from omics data via machine learning: Moving from knowledge-driven towards data-driven approaches.Computational and Structural Biotechnology Journal21:4960–4973 https://doi.org/10.1016/ j.csbj.2023.10.002

  13. [13]

    https://doi.org/10.1038/ s41467-020-18008-4

    Radivojevic T, Costello Z, Workman K, Garcia Martin H (2020) A machine learning automated recommendation tool for synthetic biology.Nature Communications11:4879. https://doi.org/10.1038/ s41467-020-18008-4

  14. [14]

    Nicolás, The bar derived category of a curved dg algebra, Journal of Pure and Applied Algebra 212 (2008) 2633–2659

    Lawson C et al (2021) Machine learning for metabolic engineering: A review.Metabolic Engineering63:34–60. https://doi.org/10.1016/j. ymben.2020.10.005

  15. [15]

    https://doi.org/10.1038/ s41467-020-17910-1

    Zhang J et al (2020) Combining mechanistic and machine learn- ing models for predictive engineering and optimization of tryptophan metabolism.Nature Communications11:4880. https://doi.org/10.1038/ s41467-020-17910-1

  16. [16]

    doi: 10.1038/s42003-022-03579-3

    Gomari DP, Schweickart A, Cerchietti L, Paietta E, Fernandez H, Al- Amin H, Suhre K, Krumsiek J (2022) Variational autoencoders learn transferrable representations of metabolomics data.Communications Biology5:659. doi: 10.1038/s42003-022-03579-3

  17. [17]

    doi: 10.1042/BST20221542

    Merzbacher C, Oyarzun DA (2023) Applications of artificial intelligence and machine learning in dynamic pathway engineering.Biochemical Society Transactions51:1871–1879. doi: 10.1042/BST20221542

  18. [18]

    Nature Communications14:7932

    Baig Y , Ma HR, Xu H, You L (2023) Autoencoder neural networks en- able low dimensional structure analyses of microbial growth dynamics. Nature Communications14:7932. doi: 10.1038/s41467-023-43455-0

  19. [19]

    doi: 10.1038/s42256-022-00519-y

    Choudhury S, Moret M, Salvy P, Weilandt D, Hatzimanikatis V , Miskovic L (2022) Reconstructing kinetic models for dynamical studies of metabolism using generative adversarial networks.Nature Machine Intelligence4:710–719. doi: 10.1038/s42256-022-00519-y

  20. [20]

    doi: 10.1038/ s41540-018-0054-3

    Costello Z, Garcia Martin H (2018) A machine learning approach to predict metabolic pathway dynamics from time-series multiomics data.npj Systems Biology and Applications4:19. doi: 10.1038/ s41540-018-0054-3

  21. [21]

    doi: 10.1371/ journal.pcbi.1013862

    Razmpour T, Tabibian M, Roohi A, Saha R (2026) GAN-enhanced machine learning and metabolic modeling identify reprogramming in pancreatic cancer.PLOS Computational Biology22. doi: 10.1371/ journal.pcbi.1013862

  22. [22]

    https://doi.org/10.1093/femsyr/foaf072

    Akaraphol Watcharawipas, Weerawat Runguphan, Peerapat Khamwachi- rapithak, Thanaporn Laothanachareon (2025) Integrating yeast biodiver- sity and machine learning for predictive metabolic engineering.FEMS Yeast Research. https://doi.org/10.1093/femsyr/foaf072

  23. [23]

    ACS Synthetic Biology13:1193–1203

    Moreno-Paz S, van der Hoek R, Eliana E, Zwartjens P, Gosiewska S, Martins dos Santos V AP, Schmitz J, Suarez-Diez M (2024) Machine learning-guided optimization of p-coumaric acid production in yeast. ACS Synthetic Biology13:1193–1203. doi: 10.1021/acssynbio.4c00035

  24. [24]

    https://doi.org/10.1016/j.csbj.2023.03.045

    Cheng Y , Bi X, Xu Y , Liu Y , Li J, Du G, Lv X, Liu L (2023) Machine learning for metabolic pathway optimization: A review.BMC Bioinformatics. https://doi.org/10.1016/j.csbj.2023.03.045

  25. [25]

    Current Opinion in Biotechnology73:101–107

    Jang WD, Kim GB, Kim Y , Lee SY (2021) Applications of artificial intelligence to enzyme and pathway design for metabolic engineering. Current Opinion in Biotechnology73:101–107. doi: 10.1016/j.copbio. 2021.07.024

  26. [26]

    https://doi.org/10.1038/ s41929-024-01220-6

    Masid S, Ataman M, Hatzimanikatis V (2024) Generative machine learn- ing produces kinetic models that accurately characterize intracellular metabolic states.Nature Catalysis7:1086–1099. https://doi.org/10.1038/ s41929-024-01220-6