pith. machine review for the scientific record. sign in

arxiv: 2604.05181 · v1 · submitted 2026-04-06 · 💻 cs.LG

Recognition: no theorem link

General Multimodal Protein Design Enables DNA-Encoding of Chemistry

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:27 UTC · model grok-4.3

classification 💻 cs.LG
keywords protein designenzyme engineeringdiffusion modelsgenerative modelscarbene transferheme enzymesmultimodal generationdirected evolution
0
0 comments X

The pith

A multimodal diffusion model designs heme enzymes that catalyze new-to-nature carbene-transfer reactions when given only reactive intermediates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DISCO, a generative model that jointly creates protein sequences and their 3D structures conditioned on any chosen biomolecule. Conditioned only on reactive intermediates, the model produces diverse heme enzymes that feature novel active-site shapes. These enzymes perform carbene-transfer reactions such as alkene cyclopropanation, spirocyclopropanation, B-H insertions, and C(sp3)-H insertions, with measured activities that exceed those of enzymes produced by conventional engineering. Random mutagenesis of one design shows that the enzymes remain evolvable through standard directed evolution. The work supplies a general method for encoding previously inaccessible chemical transformations directly in DNA.

Core claim

DISCO (DIffusion for Sequence-structure CO-design) is a multimodal model that co-designs protein sequence and 3D structure around arbitrary biomolecules, together with inference-time scaling methods that optimize objectives across both modalities. Conditioned solely on reactive intermediates, DISCO designs diverse heme enzymes with novel active-site geometries. These enzymes catalyze new-to-nature carbene-transfer reactions, including alkene cyclopropanation, spirocyclopropanation, B-H, and C(sp3)-H insertions, with high activities exceeding those of engineered enzymes. Random mutagenesis of a selected design further confirmed that enzyme activity can be improved through directed evolution.

What carries the argument

DISCO, the multimodal diffusion model that jointly generates protein sequences and 3D structures around user-specified molecules, using inference-time scaling to optimize across both modalities.

If this is right

  • Enzymes for new-to-nature reactions can be created without pre-specifying catalytic residues.
  • Designed enzymes can be further improved by random mutagenesis and directed evolution.
  • A scalable route exists for expanding the set of chemical transformations that DNA can encode.
  • General conditioning on arbitrary biomolecules enables protein design beyond binding or catalysis.
  • Inference-time scaling across sequence and structure improves optimization of multimodal objectives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could produce biocatalysts for industrial reactions that lie outside the chemistry explored by natural evolution.
  • The same co-design approach might extend to creating proteins that perform other functions such as transport or signaling.
  • High-throughput screening combined with DISCO designs could accelerate discovery of enzymes for additional reaction classes.
  • Conditioning directly on reaction intermediates offers a way to target specific chemical mechanisms in protein design.

Load-bearing premise

The generated sequence-structure pairs fold into stable, functional enzymes in vitro and show the claimed catalytic activity without any pre-specification of catalytic residues or further experimental tuning.

What would settle it

Laboratory synthesis and assay of the top DISCO designs yielding proteins that either fail to fold or exhibit no measurable activity above background in the described carbene-transfer reactions.

read the original abstract

Evolution is an extraordinary engine for enzymatic diversity, yet the chemistry it has explored remains a narrow slice of what DNA can encode. Deep generative models can design new proteins that bind ligands, but none have created enzymes without pre-specifying catalytic residues. We introduce DISCO (DIffusion for Sequence-structure CO-design), a multimodal model that co-designs protein sequence and 3D structure around arbitrary biomolecules, as well as inference-time scaling methods that optimize objectives across both modalities. Conditioned solely on reactive intermediates, DISCO designs diverse heme enzymes with novel active-site geometries. These enzymes catalyze new-to-nature carbene-transfer reactions, including alkene cyclopropanation, spirocyclopropanation, B-H, and C(sp$^3$)-H insertions, with high activities exceeding those of engineered enzymes. Random mutagenesis of a selected design further confirmed that enzyme activity can be improved through directed evolution. By providing a scalable route to evolvable enzymes, DISCO broadens the potential scope of genetically encodable transformations. Code is available at https://github.com/DISCO-design/DISCO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces DISCO, a multimodal diffusion model for co-designing protein sequences and 3D structures conditioned on arbitrary biomolecules (specifically reactive intermediates). It claims to generate diverse heme enzymes with novel active-site geometries that catalyze new-to-nature carbene-transfer reactions (alkene cyclopropanation, spirocyclopropanation, B-H and C(sp3)-H insertions) at activities exceeding engineered enzymes, with further activity gains via random mutagenesis and directed evolution. The work emphasizes that designs require no pre-specification of catalytic residues and provides open code.

Significance. If the central claims hold, the result would be significant for protein engineering by demonstrating scalable design of evolvable enzymes for chemistries outside natural scope, without pre-specifying catalytic residues. The open code repository is a clear strength that supports reproducibility and follow-up work. However, the absence of direct experimental validation of the designed geometries limits the strength of the mechanistic interpretation.

major comments (2)
  1. [Results section (enzyme design and activity assays)] Results section (enzyme design and activity assays): The headline claim that DISCO designs enzymes with 'novel active-site geometries' responsible for the observed carbene-transfer activities is not supported by any experimental structural data (X-ray, cryo-EM, or NMR) on the expressed and purified designs. Computational structure predictions alone cannot confirm that the proteins fold as co-designed or that the claimed geometries (rather than partial folding, alternative modes, or contaminants) produce the reported activities.
  2. [Methods section (experimental procedures)] Methods section (experimental procedures): The manuscript reports activity assays and directed evolution but provides insufficient detail on replicates, negative controls, background activity subtraction, quantitative benchmarks against specific engineered enzymes, and expression/purification protocols. This makes it difficult to evaluate the robustness of the 'high activities exceeding those of engineered enzymes' claim and the assumption that the generated sequence-structure pairs yield stable, functional enzymes without additional optimization.
minor comments (1)
  1. [Abstract and Methods] The description of inference-time scaling methods in the abstract is not accompanied by a clear algorithmic outline or pseudocode in the main text, which would aid readers in reproducing the optimization across modalities.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We are grateful to the referee for their positive assessment of the significance of our work and for highlighting areas where additional clarity and detail would strengthen the manuscript. We address each major comment below.

read point-by-point responses
  1. Referee: Results section (enzyme design and activity assays): The headline claim that DISCO designs enzymes with 'novel active-site geometries' responsible for the observed carbene-transfer activities is not supported by any experimental structural data (X-ray, cryo-EM, or NMR) on the expressed and purified designs. Computational structure predictions alone cannot confirm that the proteins fold as co-designed or that the claimed geometries (rather than partial folding, alternative modes, or contaminants) produce the reported activities.

    Authors: We agree that experimental structural characterization would provide definitive evidence for the designed geometries. However, our claims are based on the co-design process, high-confidence structure predictions (e.g., via AlphaFold), and crucially, the observed catalytic activities for multiple new-to-nature reactions, which are unlikely to occur without properly formed active sites. Additionally, the success of directed evolution on one design further supports that the proteins are functional and evolvable as designed. We will revise the manuscript to more explicitly state that the geometries are computationally predicted and to include additional analyses, such as molecular dynamics simulations or comparison to known structures, to bolster the interpretation. We note that obtaining high-resolution structures for de novo designs is a common challenge in the field and often follows initial functional validation. revision: partial

  2. Referee: Methods section (experimental procedures): The manuscript reports activity assays and directed evolution but provides insufficient detail on replicates, negative controls, background activity subtraction, quantitative benchmarks against specific engineered enzymes, and expression/purification protocols. This makes it difficult to evaluate the robustness of the 'high activities exceeding those of engineered enzymes' claim and the assumption that the generated sequence-structure pairs yield stable, functional enzymes without additional optimization.

    Authors: We thank the referee for pointing this out. We will substantially expand the Methods section in the revised manuscript to provide full details on: (i) the number of biological and technical replicates for all assays (typically n=3), (ii) negative controls including no-enzyme controls, apo-protein controls, and non-heme controls, (iii) the exact methods for background subtraction and data normalization, (iv) direct quantitative comparisons to specific previously engineered enzymes with literature references and activity values, and (v) complete expression (e.g., E. coli strain, plasmid, induction conditions) and purification protocols (e.g., chromatography steps, buffers). These additions will allow better assessment of the robustness of our results. revision: yes

standing simulated objections not resolved
  • Obtaining experimental structural data (X-ray, cryo-EM, or NMR) to directly validate the designed active-site geometries, which is not available in the current study.

Circularity Check

0 steps flagged

No circularity: experimental validation is independent of model inputs

full rationale

The paper trains a multimodal diffusion model (DISCO) on existing sequence-structure-ligand data and uses it to generate novel protein designs conditioned on reactive intermediates. These designs are then expressed, subjected to random mutagenesis, and assayed for catalytic activity in a separate experimental workflow. No equations, fitted parameters, or self-citations are shown to reduce the reported activities or novel geometries to the training inputs by construction. The central claims rest on wet-lab results rather than any definitional or statistical equivalence to the generative process itself.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the trained parameters of the diffusion model and the domain assumption that generated designs will produce stable, active enzymes when expressed; no new physical entities are postulated.

free parameters (1)
  • diffusion model parameters
    Large number of weights learned from protein sequence-structure data; central to generating the designs.
axioms (1)
  • domain assumption Joint sequence-structure diffusion conditioned on a ligand can produce functional enzymes without pre-specified catalytic residues.
    Core modeling assumption invoked to justify the design pipeline.

pith-pipeline@v0.9.0 · 5562 in / 1269 out tokens · 48747 ms · 2026-05-10T19:27:16.570689+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 4 canonical work pages

  1. [1]

    Buller,et al., From nature to industry: Harnessing enzymes for biocatalysis.Science 382(6673), eadh8615 (2023)

    R. Buller,et al., From nature to industry: Harnessing enzymes for biocatalysis.Science 382(6673), eadh8615 (2023)

  2. [2]

    J. C. Reisenbauer, K. M. Sicinski, F. H. Arnold, Catalyzing the future: recent advances in chemical synthesis using enzymes.Curr. Opin. Chem. Biol.83, 102536 (2024)

  3. [3]

    K. Chen, F. H. Arnold, Engineering new catalytic activities in enzymes.Nat. Catal.3(3), 203–213 (2020)

  4. [4]

    J. L. Watson,et al., De novo design of protein structure and function with RFdiffusion. Nature620(7976), 1089–1100 (2023)

  5. [5]

    Wang,et al., Scaffolding protein functional sites using deep learning.Science377(6604), 387–394 (2022)

    J. Wang,et al., Scaffolding protein functional sites using deep learning.Science377(6604), 387–394 (2022)

  6. [6]

    Pacesa,et al., One-shot design of functional protein binders with BindCraft.Nature 646(8084), 483–492 (2025)

    M. Pacesa,et al., One-shot design of functional protein binders with BindCraft.Nature 646(8084), 483–492 (2025)

  7. [7]

    Ahern,et al., Atom-level enzyme active site scaffolding using RFdiffusion2.Nat

    W. Ahern,et al., Atom-level enzyme active site scaffolding using RFdiffusion2.Nat. Methods 23(1), 96–105 (2026)

  8. [8]

    Lauko,et al., Computational design of serine hydrolases.Science388(6744), eadu2454 (2025)

    A. Lauko,et al., Computational design of serine hydrolases.Science388(6744), eadu2454 (2025)

  9. [9]

    A. H.-W. Yeh,et al., De novo design of luciferases using deep learning.Nature614(7949), 774–780 (2023)

  10. [10]

    Butcher,et al., De novo design of all-atom biomolecular interactions with rfdiffusion3

    J. Butcher,et al., De novo design of all-atom biomolecular interactions with rfdiffusion3. bioRxiv(2025)

  11. [11]

    Stark,et al., BoltzGen: Toward Universal Binder Design.bioRxiv(2025)

    H. Stark,et al., BoltzGen: Toward Universal Binder Design.bioRxiv(2025)

  12. [12]

    Rojas, Y

    K. Rojas, Y. Zhu, S. Zhu, F. X. F. Ye, M. Tao, Diffuse Everything: Multimodal Diffusion Models on Arbitrary State Spaces, inInternational Conference on Machine Learning(2025)

  13. [13]

    Wang,et al., Diffusion language models are versatile protein learners, inInternational Conference on Machine Learning(2024)

    X. Wang,et al., Diffusion language models are versatile protein learners, inInternational Conference on Machine Learning(2024)

  14. [14]

    Abramson,et al., Accurate structure prediction of biomolecular interactions with Al- phaFold 3.Nature630(8016), 493–500 (2024)

    J. Abramson,et al., Accurate structure prediction of biomolecular interactions with Al- phaFold 3.Nature630(8016), 493–500 (2024)

  15. [15]

    Dauparas,et al., Atomic context-conditioned protein sequence design using Lig- andMPNN.Nat

    J. Dauparas,et al., Atomic context-conditioned protein sequence design using Lig- andMPNN.Nat. Methods22(4), 717–723 (2025)

  16. [16]

    S. S. Sahoo,et al., Simple and Effective Masked Diffusion Language Models, inAdvances in Neural Information Processing Systems 37(2024)

  17. [17]

    F. Z. Peng,et al., Path planning for masked diffusion model sampling.arXiv preprint arXiv:2502.03540(2025). 13

  18. [18]

    G. Wang, Y. Schiff, S. S. Sahoo, V . Kuleshov, Remasking Discrete Diffusion Models with Inference-Time Scaling, inAdvances in Neural Information Processing Systems 38(2025)

  19. [19]

    Lin,et al., Evolutionary-scale prediction of atomic-level protein structure with a language model.Science379(6637), 1123–1130 (2023)

    Z. Lin,et al., Evolutionary-scale prediction of atomic-level protein structure with a language model.Science379(6637), 1123–1130 (2023)

  20. [20]

    C. D. team,et al., Chai-1: Decoding the molecular interactions of life.BioRxivpp. 2024–10 (2024)

  21. [21]

    Skreta,et al., Feynman-Kac correctors in diffusion: Annealing, guidance, and product of experts, inInternational Conference of Machine Learning(2025)

    M. Skreta,et al., Feynman-Kac correctors in diffusion: Annealing, guidance, and product of experts, inInternational Conference of Machine Learning(2025)

  22. [22]

    Hasan,et al., Discrete feynman-kac correctors.arXiv preprint arXiv:2601.10403(2026)

    M. Hasan,et al., Discrete feynman-kac correctors.arXiv preprint arXiv:2601.10403(2026)

  23. [23]

    Listov, C

    D. Listov, C. A. Goverde, B. E. Correia, S. J. Fleishman, Opportunities and challenges in design and optimization of protein function.Nat. Rev. Mol. Cell Biol.25(8), 639–653 (2024)

  24. [24]

    H. Kim, R. S. Kim, M. Mirdita, J. Yoon, M. Steinegger, Structural motif search across the protein-universe with Folddisco.bioRxivpp. 2025–07 (2025)

  25. [25]

    Y. Yang, F. H. Arnold, Navigating the Unnatural Reaction Space: Directed Evolution of Heme Proteins for Selective Carbene and Nitrene Transfer.Acc. Chem. Res.54(5), 1209–1225 (2021)

  26. [26]

    Kalvet,et al., Design of Heme Enzymes with a Tunable Substrate Binding Pocket Adjacent to an Open Metal Coordination Site.J

    I. Kalvet,et al., Design of Heme Enzymes with a Tunable Substrate Binding Pocket Adjacent to an Open Metal Coordination Site.J. Am. Chem. Soc.145(26), 14307–14315 (2023)

  27. [27]

    Hou,et al., De novo design of porphyrin-containing proteins as efficient and stereoselec- tive catalysts.Science388(6747), 665–670 (2025)

    K. Hou,et al., De novo design of porphyrin-containing proteins as efficient and stereoselec- tive catalysts.Science388(6747), 665–670 (2025)

  28. [28]

    Huang,et al.,De NovoDesign, Directed Evolution and Computational Study of Heme- Binding Helical Bundle Protein Catalysts for Biocatalytic Enantioselective Ge–H Insertion

    W. Huang,et al.,De NovoDesign, Directed Evolution and Computational Study of Heme- Binding Helical Bundle Protein Catalysts for Biocatalytic Enantioselective Ge–H Insertion. J. Am. Chem. Soc.147(44), 40869–40878 (2025)

  29. [29]

    Damiano, P

    C. Damiano, P . Sonzini, E. Gallo, Iron catalysts with N-ligands for carbene transfer of diazo reagents.Chem. Soc. Rev.49(14), 4867–4905 (2020)

  30. [30]

    Garcia-Borràs,et al., Origin and Control of Chemoselectivity in CytochromecCatalyzed Carbene Transfer into Si–H and N–H bonds.J

    M. Garcia-Borràs,et al., Origin and Control of Chemoselectivity in CytochromecCatalyzed Carbene Transfer into Si–H and N–H bonds.J. Am. Chem. Soc.143(18), 7114–7123 (2021)

  31. [31]

    P . S. Coelho, E. M. Brustad, A. Kannan, F. H. Arnold, Olefin Cyclopropanation via Carbene Transfer Catalyzed by Engineered Cytochrome P450 Enzymes.Science339(6117), 307–310 (2013)

  32. [32]

    S. B. J. Kan, X. Huang, Y. Gumulya, K. Chen, F. H. Arnold, Genetically programmed chiral organoborane synthesis.Nature552(7683), 132–136 (2017)

  33. [33]

    R. K. Zhang,et al., Enzymatic assembly of carbon–carbon bonds via iron-catalysed sp 3 C–H functionalization.Nature565(7737), 67–72 (2019)

  34. [34]

    J. L. Kennemur, Y. Long, C. J. Ko, A. Das, F. H. Arnold, Enzymatic stereodivergent synthesis of azaspiro [2. y] alkanes.J. Am. Chem. Soc.147(31), 27165–27171 (2025). 14

  35. [35]

    J. Ho, A. Jain, P . Abbeel, Denoising Diffusion Probabilistic Models, inAdvances in Neural Information Processing Systems 33(2020)

  36. [36]

    D. T. Gillespie, A general method for numerically simulating the stochastic time evolution of coupled chemical reactions.Journal of Computational Physics22(4), 403–434 (1976)

  37. [37]

    M. Ren, T. Zhu, H. Zhang, Carbonnovo: Joint design of protein structure and sequence using a unified energy-based model, inInternational Conference on Machine Learning(2024)

  38. [38]

    J. Shi, K. Han, Z. Wang, A. Doucet, M. Titsias, Simplified and generalized masked diffusion for discrete data, inAdvances in Neural Information Processing Systems 37(2024)

  39. [39]

    Jumper,et al., Highly accurate protein structure prediction with AlphaFold.Nature 596(7873), 583–589 (2021)

    J. Jumper,et al., Highly accurate protein structure prediction with AlphaFold.Nature 596(7873), 583–589 (2021)

  40. [40]

    Nie,et al., Large Language Diffusion Models, inAdvances in Neural Information Processing Systems 38(2025)

    S. Nie,et al., Large Language Diffusion Models, inAdvances in Neural Information Processing Systems 38(2025)

  41. [41]

    Karras,et al., Guiding a diffusion model with a bad version of itself, inAdvances in Neural Information Processing Systems 37(2024)

    T. Karras,et al., Guiding a diffusion model with a bad version of itself, inAdvances in Neural Information Processing Systems 37(2024)

  42. [42]

    Chang, H

    H. Chang, H. Zhang, L. Jiang, C. Liu, W. T. Freeman, Maskgit: Masked generative im- age transformer, inProceedings of the IEEE/CVF conference on computer vision and pattern recognition(2022), pp. 11315–11325

  43. [43]

    S. Lin, B. Liu, J. Li, X. Yang, Common diffusion noise schedules and sample steps are flawed, inProceedings of the IEEE/CVF winter conference on applications of computer vision (2024), pp. 5404–5411

  44. [44]

    Dauparas,et al., Robust deep learning–based protein sequence design using Protein- MPNN.Science378(6615), 49–56 (2022)

    J. Dauparas,et al., Robust deep learning–based protein sequence design using Protein- MPNN.Science378(6615), 49–56 (2022)

  45. [45]

    Steinegger, J

    M. Steinegger, J. Söding, MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.Nat. Biotechnol.35(11), 1026–1028 (2017)

  46. [46]

    Van Kempen,et al., Fast and accurate protein structure search with Foldseek.Nat

    M. Van Kempen,et al., Fast and accurate protein structure search with Foldseek.Nat. Biotechnol.42(2), 243–246 (2024)

  47. [47]

    Bannwarth, S

    C. Bannwarth, S. Ehlert, S. Grimme, GFN2-xTB—An accurate and broadly parametrized self-consistent tight-binding quantum chemical method with multipole electrostatics and density-dependent dispersion contributions.J. Chem. Theory Comp.15(3), 1652–1671 (2019)

  48. [48]

    La- Proteina: Atomistic protein generation via partially latent flow matching.Arxiv e-print, arXiv:2507.09466 [cs.LG], 2025

    T. Geffner,et al., La-proteina: Atomistic protein generation via partially latent flow match- ing.arXiv preprint arXiv:2507.09466(2025)

  49. [49]

    S. L. Lisanza,et al., Multistate and functional protein design using RoseTTAFold sequence space diffusion.Nat. Biotechnol.43(8), 1288–1298 (2025)

  50. [50]

    Hayes,et al., Simulating 500 million years of evolution with a language model.Science 387(6736), 850–858 (2025)

    T. Hayes,et al., Simulating 500 million years of evolution with a language model.Science 387(6736), 850–858 (2025). 15

  51. [51]

    Krishna,et al., Generalized biomolecular modeling and design with RoseTTAFold All-Atom.Science384(6693), eadl2528 (2024)

    R. Krishna,et al., Generalized biomolecular modeling and design with RoseTTAFold All-Atom.Science384(6693), eadl2528 (2024)

  52. [52]

    Kabsch, C

    W. Kabsch, C. Sander, Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features.Biopolymers22(12), 2577–2637 (1983)

  53. [53]

    Neese, The ORCA program system.WIRES Comput

    F. Neese, The ORCA program system.WIRES Comput. Molec. Sci.2(1), 73–78 (2012)

  54. [54]

    T. H. Allen, T. Kawamoto, S. Gardner, S. J. Geib, D. P . Curran, N-heterocyclic carbene boryl iodides catalyze insertion reactions of N-heterocyclic carbene boranes and diazoesters.Org. Lett.19(13), 3680–3683 (2017)

  55. [55]

    M. P . Doyle, D. Van Leusen, W. H. Tamblyn, Efficient alternative catalysts and methods for the synthesis of cyclopropanes from olefins and diazo compounds.Synthesis1981(10), 787–789 (1981)

  56. [56]

    D. G. Gibson,et al., Enzymatic assembly of DNA molecules up to several hundred kilobases. Nat. Methods6(5), 343–345 (2009)

  57. [57]

    Long,et al., LevSeq: Rapid generation of sequence-function data for directed evolution and machine learning.ACS Synth

    Y. Long,et al., LevSeq: Rapid generation of sequence-function data for directed evolution and machine learning.ACS Synth. Biol.14(1), 230–238 (2024)

  58. [58]

    C. E. Boville,et al., Engineered Biosynthesis of β-Alkyl Tryptophan Analogues.Angew. Chem. Int. Ed.57(45), 14764–14768 (2018). 16 Acknowledgments The authors thank Sabine Brinkmann-Chen, Olexa Bilaniuk, Almer van der Sloot, Luca Ambro- gioni, Jessica Wu, Dejan Stanˇ cevi´ c, Danyal Rehman, Guillaume Huguet, Ella Miray Rajaonson, Fred Zhangzhi Peng, Andrei...

  59. [59]

    = L ∏ i=1 Cat(xi r;α rδ(x i

  60. [60]

    (S5) where αr describes the noise schedule similarly to gt in the continuous case, and is specified such that α0 = 1 and α1 = 0

    + (1−α r)δ(m)). (S5) where αr describes the noise schedule similarly to gt in the continuous case, and is specified such that α0 = 1 and α1 = 0. This describes a process where at time r a fraction αr of the tokens are in their original state and 1 −α r of the tokens are in the masked state. The simplest conditional reverse transition kernel for a tokenx i...

  61. [61]

    (S6) This leads to a natural parameterization for the transitions of a discrete time generative model

    =    Cat(xi r−1;δ(x i r))x i r ̸=m Cat xi r−1; (1−αr−1 )δ(m)+(α r−1 −αr)δ(x i 0) 1−αr xi r =m. (S6) This leads to a natural parameterization for the transitions of a discrete time generative model. We first define a (possibly) time-dependent denoiser network Dr,θ :V L ×[ 0, 1]→∆ L that predicts the probabilities of a clean samplex 0 ∼D r,θ (xr): pr,θ (...

  62. [62]

    Lack of Order Control. An analytic Gillespie-style sampler [36] shows that the denoising proceeds by uniformly randomly selecting a masked position, offering no control over the generation order of tokens

  63. [63]

    mistakes

    Token Commitment. (S7) reveals that once unmasked, a specific token is committed to and never revisited. This means that if “mistakes” are made in the generation process, they can never be corrected. These two issues have led to a variety of solutions involving alternative reverse-time processes for discrete generation. We use apath planning(P2) approach ...

  64. [64]

    Structure Update:Run with seq_mode = False to compute coordinate updates (rupdate l )

  65. [65]

    Sequence Logits:Run with seq_mode = True to predict sequence logits (xlogits l ), where in sequence mode we scatter add the atom-level embeddings back to the token level before finally predicting the sequence logits. Finally, the predicted positional update rupdate l is rescaled and combined with the input noisy positions using the following schedule equa...

  66. [66]

    (S16) (continuous) or Eq

    Update its state xt based on the reverse process using Eq. (S16) (continuous) or Eq. (S15) (discrete)

  67. [67]

    Compute its log-weight update logw k t+dt =logw k t +h t(xk)dt

  68. [68]

    The state and weight updates are derived based on the target distribution we wish to sample from

    Normalize the weights and use them to resample the particles in the batch. The state and weight updates are derived based on the target distribution we wish to sample from. In Section A.7.1 and Section A.7.4, we demonstrate two use cases of FKC steering and write the state and weight updates for each. A.7.1 Feynman-Kac Correctors - Specificity Guidance Th...

  69. [69]

    ∑i<j (DP ij −D Q π(i),π(j) )2. (v) Motif cluster diversity.With all pairwise Cα-RMSD values between motifs using iterative Hungarian matching with Kabsch alignment, we perform complete-linkage agglomerative clus- tering with a distance threshold of t= 2.0 Å. The cluster ratio is defined as r=n clusters/ndesigns, where r≈ 1 indicates that nearly every desi...

  70. [70]

    AlphaFold 3 predictions were additionally performed with only the cofactor (omitting the intermediate)

    In both rounds, the protein was co-folded with the reactive intermediate to assess active- site geometry. AlphaFold 3 predictions were additionally performed with only the cofactor (omitting the intermediate). A design was retained only if it was co-designable under both Chai-1 and AlphaFold 3. Dual-oracle passing designs were subjected to the following filters:

  71. [71]

    Contacts.The number of protein heavy atoms within 4.0 Å of any intermediate heavy atom was required to be≥5

  72. [72]

    First, the SASA burial fraction, fburial =1− Acomplex Afree , (S55) was required to exceed 0.5

    Active-site burial.Three complementary metrics assessed ligand burial. First, the SASA burial fraction, fburial =1− Acomplex Afree , (S55) was required to exceed 0.5. Second, we computed a solid-angleenclosurescore. From each of up to 15 ligand heavy-atom origins (selected by farthest-point sampling), n= 500 rays were cast uniformly over S2 via a Fibonacc...

  73. [73]

    Metal coordination.For metalloenzyme campaigns, proper coordination geometry of cat- alytic metal ions was verified (correct number and identity of coordinating residues, bond lengths within±0.5 Å of ideal values)

  74. [74]

    5.Net charge.The formal net charge at pH 7 was checked to be within±15

    Surface hydrophobicity.The fraction of solvent-exposed hydrophobic residues (Ala, Val, Ile, Leu, Met, Phe, Trp, Pro) was required to be<50% to reduce aggregation propensity. 5.Net charge.The formal net charge at pH 7 was checked to be within±15. Designs were then ranked within each cluster by a simple composite confidence score: S=0.5·ipTM+0.25·pTM−0.5·PA...