pith. sign in

arxiv: 2605.31296 · v1 · pith:QYDKRD7Qnew · submitted 2026-05-29 · 🧬 q-bio.BM · cs.LG

mRNAutilus: Multi-Objective-Guided Discrete Generation of mRNA with Optimized Therapeutic Properties

Pith reviewed 2026-06-28 19:52 UTC · model grok-4.3

classification 🧬 q-bio.BM cs.LG
keywords mRNA designdiscrete diffusionmulti-objective optimizationtherapeutic mRNAcodon optimizationUTR designprotein expressionsequence generation
0
0 comments X

The pith

mRNAutilus generates complete mRNA transcripts in one diffusion process that achieve over 400-fold higher protein expression than wild-type sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces mRNAutilus, a framework that trains a masked discrete diffusion model on millions of full-length mRNAs and applies Monte Carlo Tree Guidance to create entire transcripts including codons and UTRs together. Lightweight regressors operating on the model's embeddings predict half-life, translation efficiency, and protein abundance, allowing the generation process to optimize multiple properties at once toward Pareto-efficient sequences. This unified approach replaces separate codon optimization and UTR design followed by assembly. The resulting zero-shot sequences for luciferase show more than 400-fold expression gains over wild-type and beat commercial and other machine-learning baselines; similar gains appear for SARS-CoV-2 Spike and other therapeutic targets. A sympathetic reader would see value in a method that directly produces functional mRNAs tailored to specific biological uses without extensive post-design screening.

Core claim

mRNAutilus combines a masked discrete diffusion model trained on full-length mRNAs with Monte Carlo Tree Guidance and embedding-based regressors to generate complete transcripts optimized simultaneously for stability, translation efficiency, and protein abundance, yielding zero-shot constructs that exceed wild-type expression by over 400-fold for P. pyralis luciferase and outperform baselines for SARS-CoV-2 Spike, prime editing, and proteome modulation applications.

What carries the argument

Masked discrete diffusion model with Monte Carlo Tree Guidance that uses lightweight regressors on embeddings to score and steer generation toward multi-objective optima for complete mRNA sequences.

If this is right

  • Complete mRNA transcripts can be produced without separate design of coding sequences and UTRs followed by post-hoc assembly.
  • Multiple functional objectives can be balanced during generation rather than optimized independently.
  • The same sequence-based framework extends to mRNAs for prime editing constructs and programmable proteome modulators.
  • Zero-shot performance can exceed both wild-type and existing commercial or lab-optimized designs across diverse targets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the embedding regressors generalize, the method could shorten design cycles by reducing reliance on large-scale experimental screening of candidate sequences.
  • The unified diffusion-plus-guidance structure might transfer to design of other nucleic-acid therapeutics such as siRNA or circular RNA.
  • Performance on protein abundance and durability objectives suggests the approach could support personalized mRNA constructs where rapid iteration is needed.

Load-bearing premise

Lightweight regressors trained on model embeddings can reliably predict half-life, translation efficiency, and protein abundance for novel generated sequences outside the training distribution.

What would settle it

Synthesizing the zero-shot generated sequences and assaying their actual protein expression levels in cells, then finding they fall short of the reported 400-fold gains or fail to beat the listed commercial and machine-learning baselines.

Figures

Figures reproduced from arXiv: 2605.31296 by Cedric Wu, Divya Srijay, Fengmei Pi, Ping-Jung Lin, Pranam Chatterjee, Sawan Patel, Shambhavi Shubham, Sherwood Yao, Sophia Tang, Yesol Kim, Yinuo Zhang.

Figure 1
Figure 1. Figure 1: Overview of mRNAutilus. (A) Masked diffusion model pretraining and de novo mRNA sequence generation. (B) mRNA function and property prediction using XGBoost regression analysis from model embeddings. (C) Multi￾objective codon optimization and UTR generation using Monte Carlo Tree Guidance, which leverages tree search to optimize a set of Pareto-optimal sequences. (D) In vitro experiments demonstrate that z… view at source ↗
Figure 2
Figure 2. Figure 2: Correlation plots for property regressors. (A) Half-life, (B) translation efficiency, and (C) protein abundance correlation for validation (left) and training (right) data. Optimization curves for (D) half-life, (E) translation efficiency, and (F) protein abundance for conditional generation of P. pyralis luciferase mRNA via Monte Carlo Tree Guidance (MCTG). Shown are Pareto front scores with standard erro… view at source ↗
Figure 3
Figure 3. Figure 3: Metric-driven evaluation of in-silico full-sequence mRNA generation methods. A library of 250 luciferase mRNAs was produced using mRNAutilus, randomness, and GEMORNA. All three methods were prompted to design a luciferase mRNA encoding the GenScript F-Luc protein. (A) Coding sequence comparison using all three methods. Considered metrics are codon adaptation index (CAI), length-normalized minimum-free ener… view at source ↗
Figure 4
Figure 4. Figure 4: mRNAs designed by mRNAutilus achieve superior zero-shot performance over competing methods in vitro. (A) Schematic of the experimental validation workflow for mRNAutilus-designed sequences. (B) Absolute expression levels for P. pyralis luciferase mRNAs. (C) Normalized expression levels of P. pyralis luciferase mRNAs over time in HEK293T cells (n=3 biological replicates). Error bars denote the standard devi… view at source ↗
Figure 5
Figure 5. Figure 5: Design of therapeutic mRNAs with mRNAutilus and in vitro evaluation. (A) Architecture and mechanism of the uAb degradation system with sequences optimized by mRNAutilus. CHIP∆TPR fused to the C-terminus of β-catenin-specific peptides targets endogenous proteins for ubiquitin-mediated degradation via the proteasome following mRNA transfection. Created with BioRender. (B) Quantitative PCR analysis of uAb mRN… view at source ↗
read the original abstract

Therapeutic mRNA design requires coordinating multiple interacting sequence features across the full transcript, where codon usage, untranslated regions (UTRs), and their coupling jointly determine stability, translation efficiency, and protein expression. Here, we present mRNA generation via unrolled trajectories and informed latent updates (mRNAutilus), a framework for simultaneous codon optimization and de novo UTR design directly from sequence. mRNAutilus combines a masked discrete diffusion model trained on millions of full-length mRNAs with Monte Carlo Tree Guidance to generate Pareto-efficient sequences under multiple functional objectives, using lightweight regressors over model embeddings to predict half-life, translation efficiency, and protein abundance. Unlike recent methods that design coding sequences and UTRs separately or rely on post hoc assembly and screening, mRNAutilus generates complete transcripts in a single process optimized across properties. Across diverse targets, zero-shot mRNAs encoding P. pyralis luciferase achieve over 400-fold higher expression than wild-type and outperform commercial and machine learning-designed baselines, including zero-shot generative approaches. Zero-shot SARS-CoV-2 Spike mRNAs exceed clinically used and commercial constructs and match or surpass lab-optimized designs with improved durability. We further demonstrate generality in therapeutic settings, including prime editing (PEMax) and programmable proteome modulation, where mRNAutilus-designed constructs enhance expression of peptide-guided E3 ligases (uAbs) for beta-catenin degradation. These results establish a sequence-based, multi-objective framework for generating functional mRNAs tailored to diverse biological applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces mRNAutilus, a masked discrete diffusion model trained on millions of full-length mRNAs, combined with Monte Carlo Tree Guidance that uses lightweight regressors on model embeddings to jointly optimize codon usage, UTRs, half-life, translation efficiency, and protein abundance. It claims zero-shot generation of P. pyralis luciferase mRNAs achieving >400-fold higher expression than wild-type, outperforming commercial, ML-designed, and other zero-shot baselines, with similar gains for SARS-CoV-2 Spike and applications to prime editing and uAb constructs for beta-catenin degradation.

Significance. If the zero-shot experimental gains are robustly attributable to the multi-objective guidance rather than post-hoc selection or regressor artifacts, the unified sequence-based framework would represent a meaningful advance over separate CDS/UTR design pipelines, with potential for broader therapeutic mRNA applications.

major comments (3)
  1. [Results] Results (luciferase and Spike experiments): the central 400-fold expression claim and outperformance of baselines rest on zero-shot measurements whose exact replicate counts, statistical tests, data filtering criteria, and baseline sequence constructions are not verifiable from the provided text; without these, attribution to the guidance procedure cannot be confirmed.
  2. [Methods] Methods (regressor training and guidance): the lightweight regressors for half-life, translation efficiency, and abundance are trained on model embeddings, yet no section demonstrates their calibration or ranking accuracy on sequences whose embedding distance or property values lie outside the original training support; if miscalibrated on OOD points generated by the diffusion process, the Monte Carlo Tree Guidance signal is unreliable and the reported gains cannot be attributed to optimization.
  3. [Methods] Methods (diffusion model and guidance): the independence between the regressor training data and the sequences used to train or sample from the diffusion model is not explicitly stated; overlap would introduce circularity that undermines the claim of external validation for the multi-objective scores.
minor comments (2)
  1. [Methods] Notation for the unrolled trajectories and informed latent updates is introduced without a clear equation reference or pseudocode, making the precise update rule difficult to reconstruct.
  2. [Figures] Figure legends for the Pareto-front and expression plots do not specify the number of independent biological replicates or error-bar definitions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on the clarity of experimental details and methodological independence. We address each major point below and will revise the manuscript accordingly where needed.

read point-by-point responses
  1. Referee: [Results] Results (luciferase and Spike experiments): the central 400-fold expression claim and outperformance of baselines rest on zero-shot measurements whose exact replicate counts, statistical tests, data filtering criteria, and baseline sequence constructions are not verifiable from the provided text; without these, attribution to the guidance procedure cannot be confirmed.

    Authors: We agree that the manuscript text lacks sufficient detail for independent verification of the reported expression gains. In the revised version we will add a dedicated experimental methods subsection and supplementary table specifying replicate counts (n=3 biological replicates per condition for luciferase assays), statistical tests (two-tailed Student's t-test with multiple-comparison correction), outlier filtering criteria, and exact sequences for all baselines (including commercial constructs and their accession or catalog numbers). These additions will allow direct assessment of attribution to the guidance procedure. revision: yes

  2. Referee: [Methods] Methods (regressor training and guidance): the lightweight regressors for half-life, translation efficiency, and abundance are trained on model embeddings, yet no section demonstrates their calibration or ranking accuracy on sequences whose embedding distance or property values lie outside the original training support; if miscalibrated on OOD points generated by the diffusion process, the Monte Carlo Tree Guidance signal is unreliable and the reported gains cannot be attributed to optimization.

    Authors: The manuscript does not currently include explicit OOD calibration results for the regressors. We will add a supplementary figure and analysis that evaluates regressor ranking accuracy and calibration error on sequences sampled from the diffusion model whose embedding distances and predicted property values fall outside the original training support. This will directly test whether the Monte Carlo Tree Guidance signal remains reliable for the generated sequences. revision: yes

  3. Referee: [Methods] Methods (diffusion model and guidance): the independence between the regressor training data and the sequences used to train or sample from the diffusion model is not explicitly stated; overlap would introduce circularity that undermines the claim of external validation for the multi-objective scores.

    Authors: The regressor training data consists of experimentally measured sequences drawn from independent public and internal datasets that do not overlap with the diffusion model's training corpus. We will add an explicit statement and data-source table in the revised Methods section documenting this separation and the overlap checks performed, thereby removing any ambiguity regarding circularity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; experimental validation is independent of internal predictors

full rationale

The paper trains a masked discrete diffusion model on mRNA sequences and lightweight regressors on embeddings to guide Monte Carlo Tree search toward multi-objective optima. However, the load-bearing claims (400-fold luciferase expression gains, outperformance of baselines, and results in prime editing/uAb settings) are established via direct experimental assays on the synthesized transcripts, not by re-using the regressor scores as the success metric. No equation or section reduces a reported outcome to a fitted parameter by construction, and no self-citation chain is invoked to justify uniqueness or the guidance procedure. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework depends on the accuracy of embedding-based regressors and the generalization of the diffusion model; these are treated as domain assumptions without independent external benchmarks supplied in the abstract.

free parameters (1)
  • Diffusion model and guidance hyperparameters
    Typical ML training choices that control generation behavior and are not derived from first principles.
axioms (1)
  • domain assumption Embedding-based regressors accurately predict functional properties for out-of-distribution sequences
    This assumption enables the Monte Carlo Tree Guidance step and is invoked to justify Pareto-efficient generation.

pith-pipeline@v0.9.1-grok · 5844 in / 1308 out tokens · 23777 ms · 2026-06-28T19:52:10.599193+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 6 canonical work pages · 1 internal anchor

  1. [1]

    Helm: Hierarchical encoding for mrna language modeling

    Mehdi Yazdani-Jahromi, Mangal Prakash, Tommaso Mansi, Artem Moskalev, and Rui Liao. Helm: Hierarchical encoding for mrna language modeling. InInternational Conference on Learning Representations, volume 2025, pages 94402–94425,

  2. [2]

    mrna2vec: mrna embedding with language model in the 5’utr-cds for mrna design

    Honggen Zhang, Xiangrui Gao, June Zhang, and Lipeng Lai. mrna2vec: mrna embedding with language model in the 5’utr-cds for mrna design. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 1057–1065, 2025a. Matthew Wood, Mathieu Klop, and Maxime Allard. Helix-mrna: A hybrid foundation model for full sequence mrna therapeutics....

  3. [3]

    Evoflow-rna: Generating and representing non-coding rna with a language model.bioRxiv, pages 2025–02,

    Sawan Patel, Fred Zhangzhi Peng, Keith Fraser, Adam D Friedman, Pranam Chatterjee, and Sherwood Yao. Evoflow-rna: Generating and representing non-coding rna with a language model.bioRxiv, pages 2025–02,

  4. [4]

    mrna-gpt: A generative model for full-length mrna design and optimization.bioRxiv, pages 2026–03,

    Sizhen Li, Paul Chauvin, Ofek Gross, Michael Bailey, and Sven Jager. mrna-gpt: A generative model for full-length mrna design and optimization.bioRxiv, pages 2026–03,

  5. [5]

    Peptune: De novo generation of therapeutic peptides with multi-objective-guided discrete diffusion.International Conference on Machine Learning, 2025a

    Sophia Tang, Yinuo Zhang, and Pranam Chatterjee. Peptune: De novo generation of therapeutic peptides with multi-objective-guided discrete diffusion.International Conference on Machine Learning, 2025a. Sophia Tang, Yuchen Zhu, Molei Tao, and Pranam Chatterjee. Tr2-d2: Tree search guided trajectory-aware fine-tuning for discrete diffusion.arXiv preprint arX...

  6. [6]

    Multi-objective-guided discrete flow matching for controllable biological sequence design

    Tong Chen, Yinuo Zhang, Sophia Tang, and Pranam Chatterjee. Multi-objective-guided discrete flow matching for controllable biological sequence design. InICML 2025 Generative AI and Biology (GenBio) Workshop, 2025b. Tong Chen, Yinuo Zhang, and Pranam Chatterjee. Areuredi: Annealed rectified updates for refining discrete flows with multi-objective guidance....

  7. [7]

    Pepreps: Peptide-retargeted phosphatases via generative language models

    Lauren Hong, Tai-Chen Ho, Yi-Shiuan Tseng, Sophia Vincoff, Tong Chen, Pohan chen, and Pranam Chatterjee. Pepreps: Peptide-retargeted phosphatases via generative language models. InICLR 2026 Workshop on Foundation Models for Science: Real-World Impact and Science-First Design,

  8. [8]

    Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K. Titsias. Simplified and generalized masked diffusion for discrete data.Advances in Neural Information Processing Systems, 2024b. Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distri...

  9. [9]

    GLU Variants Improve Transformer

    Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. In International Conference on Learning Representations, volume 2025, pages 63186–63227, 2025b. Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:20...

  10. [10]

    Rnacentral 2021: secondary structure integration, improved sequence search and new member databases.Nucleic acids research, 49(D1):D212–D220,

    RNACentral Consortium. Rnacentral 2021: secondary structure integration, improved sequence search and new member databases.Nucleic acids research, 49(D1):D212–D220,

  11. [11]

    CleanCap ® FLuc mRNA (5moU)

    TriLink BioTechnologies. CleanCap ® FLuc mRNA (5moU). https://www.trilinkbiotech.com/ cleancap-fluc-mrna-5mou.html, n.d. Accessed: 2025-06-21. National Center for Biotechnology Information. SARS-CoV-2 isolate Wuhan-Hu-1, complete genome,

  12. [12]

    WO Patent WO2021159040A2. 24 Supplementary Information Supplement A presents additional results, including unconditional generation metric distributions for CDSs ( A.1) and UTRs ( A.2), an investigation of the mRNA pretraining dataset, principal component analysis of the latent embeddings learned by our unconditional MDM ( A.4), the hyperparameters for th...

  13. [13]

    It contains several subunits, notably the receptor- binding domain (RBD), which binds the human ACE2 receptor

    SARS-CoV-2 S-ProteinThe S-Protein is a trimeric class I fusion protein on the surface of the SARS-CoV-2 virus, which directly mediates viral entry into host cells. It contains several subunits, notably the receptor- binding domain (RBD), which binds the human ACE2 receptor. This protein, albeit across different isoforms, is the primary target for all curr...

  14. [14]

    28 Number of Principal Components Cumulative Variance Explained (%) Explained Variance by PCA Components PCA Projected Embeddings of RNA Sequences Principal Component 1 Principal Component 2 A B Figure S4:PCA Analysis of mRNAutilus representations. (A)mRNAutilus embeddings are collected for 100 mRNAs and ncRNAs each, projected onto the two-dimensional vec...

  15. [15]

    Error bars denote the standard error across all sequences in the Pareto front

    Navy and teal lines correspond to the regressor dataset medians and the human alpha-globin (HAB)-UTR mRNA regressor scores, respectively. Error bars denote the standard error across all sequences in the Pareto front. A.9In vitro-tested mRNA sequences To further evaluate our generated mRNAs, we generated libraries (N=200) encodingP. pyralisluciferase, SARS...