Emyx: Fast and efficient all-atom protein generation

Andrew L. Hopkins; Christian D. Madsen; Constantin Schneider; Edward O. Pyzer-Knapp; Matteo P. Ferla; Nicholas B. Woodall; Nicholas J. Williams; Ruby Sedgwick; Ward Haddadin

arxiv: 2606.19377 · v1 · pith:DL5GAA77new · submitted 2026-06-12 · 💻 cs.LG · cs.AI

Emyx: Fast and efficient all-atom protein generation

Nicholas J. Williams , Ward Haddadin , Matteo P. Ferla , Constantin Schneider , Nicholas B. Woodall , Ruby Sedgwick , Christian D. Madsen , Andrew L. Hopkins

show 1 more author

Edward O. Pyzer-Knapp

This is my paper

Pith reviewed 2026-06-27 04:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords protein generationenzyme designflow matchingall-atom modelgenerative modelsconditional generationdiffusion models

0 comments

The pith

A 140M-parameter flow matching model generates enzyme scaffolds more successfully than larger diffusion models while training in one-quarter the compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Computational enzyme design needs generative models that place catalytic residues and ligands with high geometric precision while producing diverse, novel structures. Existing all-atom generators borrow heavy architectures from structure prediction and therefore train slowly and sample with limited variety. Emyx concentrates model capacity inside ordinary transformer blocks, uses lightweight conditional inputs instead of rich embeddings, and exactly reparametrises its flow-matching interpolant into the EDM noise schedule so that state-of-the-art sampling methods become available without retraining. On the AME enzyme benchmark, which demands both global fold recovery and precise catalytic geometry, the model records higher success rates, greater structural novelty, and better scaffold diversity than Proteína-Complexa and RFdiffusion3, yet completes training in 682 GPU-hours—roughly four times less than RFdiffusion3.

Core claim

Emyx is a 140M-parameter conditional flow matching model that replaces expensive embedding stacks with lightweight conditional representations and sparse connectivity, then reparametrises the flow-matching interpolant exactly into the EDM framework; this architecture achieves higher success rates on the strict AME enzyme-design benchmark (global fold recovery plus catalytic geometry) together with improved novelty, diversity and validity, while training in 682 GPU-hours.

What carries the argument

Conditional flow matching model that concentrates capacity in standard transformer blocks with lightweight conditional representations and an exact EDM reparametrisation of the interpolant.

If this is right

Enzyme design campaigns can reach comparable or higher success rates with substantially lower training budgets.
Smaller all-atom generators become competitive once heavy co-evolutionary embedding stacks are removed.
Flow-matching models can adopt diffusion-style samplers without retraining after the EDM reparametrisation.
Training time reductions of this magnitude make iterative model refinement feasible on modest hardware clusters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same lightweight conditioning approach could be tested on other sparse-constraint tasks such as motif scaffolding or binder design.
If the reparametrisation preserves exact likelihoods, it may allow direct comparison of flow-matching and diffusion likelihoods on protein data.
Lower training cost could enable larger-scale ablation studies on the relative importance of transformer depth versus conditioning richness.
Success on AME may generalise to multi-ligand or multi-catalytic-site problems if the sparse connectivity pattern scales.

Load-bearing premise

The AME benchmark's strict criteria of global fold recovery plus catalytic geometry accuracy serve as a reliable proxy for real-world enzyme design utility.

What would settle it

A controlled head-to-head experiment in which Emyx-generated designs are expressed, purified and assayed for catalytic activity at rates no higher than those from RFdiffusion3 or Proteína-Complexa under identical experimental protocols.

read the original abstract

Computational enzyme design requires generating proteins that scaffold catalytic residues and ligands, a task that demands both geometric accuracy and structural diversity from the underlying generative model. Current all-atom generators inherit expensive architectures from structure prediction, leading to high training costs and limited sample diversity. We argue that much of this complexity is unnecessary for generators, which condition on sparse geometric constraints rather than rich co-evolutionary signals. Emyx is a 140M-parameter conditional flow matching model that concentrates capacity within standard transformer blocks, replacing heavy embedding stacks with lightweight conditional representations and sparse connectivity. We additionally derive an exact reparametrisation of the flow matching interpolant into the EDM noise-level framework, bridging flow matching training efficiency with state-of-the-art sampling methods designed for diffusion models without retraining. Despite being the smallest model, Emyx outperforms both Prote\'ina-Complexa and RFdiffusion3 against the AME enzyme design benchmark across success rate under strict evaluation requiring both global fold recovery and catalytic geometry accuracy, structural novelty, scaffold diversity, and geometric validity, while training in just $682$ GPU-hours, roughly $4\times$ less than RFdiffusion3.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Emyx is a smaller flow-matching model that claims benchmark wins and 4x lower training cost via an exact EDM reparametrization and lightweight conditionals, but the numbers need checking.

read the letter

Emyx presents a 140M-parameter conditional flow matching model for all-atom protein generation aimed at enzyme design. The main technical moves are an exact reparametrization of the flow matching interpolant into the EDM framework and the replacement of heavy embedding stacks with lightweight conditional representations plus sparse connectivity inside standard transformer blocks.

Those choices look like a reasonable attempt to strip out complexity that structure-prediction models carry but generators may not need. The reported training cost of 682 GPU-hours is a concrete figure, and the claim of roughly 4x savings relative to RFdiffusion3 is the sort of empirical detail that can be tested directly. If the outperformance on the strict AME criteria (global fold recovery plus catalytic geometry) holds with proper controls, the efficiency angle would be useful for groups that cannot afford the larger models.

The soft spots are straightforward. The abstract supplies no success rates, error bars, ablation results, or data-split details, so the central performance claim cannot be evaluated from the summary alone. The reparametrization is described as exact, which is good if the derivation is correct, but that still needs verification in the methods. The assumption that the AME benchmark with its combined fold-and-geometry requirements is a strong proxy for practical enzyme utility is plausible yet unproven, and fair comparison to the baselines requires confirming that evaluation protocols match.

This paper is for researchers working on generative models for proteins who care about training efficiency and sparse conditioning. A reader already familiar with flow matching and EDM would get the most from the architectural specifics.

It deserves peer review so the empirical results and the reparametrization can be examined in detail.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Emyx, a 140M-parameter conditional flow-matching model for all-atom protein generation focused on enzyme design. It claims an exact reparametrization of the flow-matching interpolant into the EDM noise-level framework, enabling use of advanced sampling methods without retraining, and reports that this smallest model outperforms Proteína-Complexa and RFdiffusion3 on the AME benchmark under strict criteria requiring global fold recovery plus catalytic geometry accuracy, while also improving structural novelty, scaffold diversity, and geometric validity, all at a training cost of 682 GPU-hours (roughly 4× less than RFdiffusion3).

Significance. If the reported empirical results prove robust, the work would indicate that concentrating capacity in standard transformer blocks with lightweight conditional representations and sparse connectivity can yield both higher performance and substantially lower training costs than larger inherited architectures from structure prediction, offering a practical route to more accessible all-atom generative models for computational enzyme design.

major comments (2)

[Abstract] Abstract: The central claims of outperformance on the AME benchmark (success rate, novelty, diversity, validity) and reduced training cost are stated without any accompanying quantitative results, tables of metrics, error bars, statistical tests, ablation studies, or details on data splits and evaluation protocol. This absence is load-bearing for the primary empirical claim and prevents verification of whether the results support the stated superiority.
[Abstract] Abstract: The reparametrization is described as 'exact' and bridging flow matching with EDM sampling methods, yet no derivation, equations, or proof of exactness is supplied in the provided text, leaving the technical foundation for the claimed efficiency gains unverified.

minor comments (1)

[Abstract] Abstract: The baseline name appears as 'Prote\'ina-Complexa'; confirm correct spelling and formatting for 'Proteína-Complexa'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity and verifiability.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of outperformance on the AME benchmark (success rate, novelty, diversity, validity) and reduced training cost are stated without any accompanying quantitative results, tables of metrics, error bars, statistical tests, ablation studies, or details on data splits and evaluation protocol. This absence is load-bearing for the primary empirical claim and prevents verification of whether the results support the stated superiority.

Authors: We agree that the abstract would be strengthened by including specific quantitative results. The full manuscript already contains detailed tables with success rates (under the strict global fold + catalytic geometry criteria), novelty, diversity, and validity metrics, along with error bars, statistical comparisons, data splits, and the full evaluation protocol in the results and methods sections. To make the primary claims immediately verifiable from the abstract, we will revise it to report key numbers such as the success rates for Emyx versus the baselines and the precise training cost comparison. revision: yes
Referee: [Abstract] Abstract: The reparametrization is described as 'exact' and bridging flow matching with EDM sampling methods, yet no derivation, equations, or proof of exactness is supplied in the provided text, leaving the technical foundation for the claimed efficiency gains unverified.

Authors: The manuscript derives the exact reparametrization in the methods section, showing the mathematical equivalence of the flow-matching interpolant to the EDM noise-level framework. However, we acknowledge that a self-contained derivation with equations and a brief proof of exactness would improve accessibility. We will add this derivation (including the key equations) to the main text or a new subsection in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central claims rest on an empirical benchmark comparison and a stated exact reparametrization of the flow-matching interpolant to the EDM framework. No load-bearing derivation reduces by construction to fitted inputs, self-citations, or ansatzes imported from prior author work. The reparametrization is presented as mathematically exact rather than a fitted or renamed result, and performance metrics are reported as direct experimental outcomes on the AME benchmark, not quantities forced by internal parameter choices. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard generative modeling assumptions rather than new physical postulates; no free parameters are explicitly fitted beyond the trained model weights, and no new entities are introduced.

axioms (1)

domain assumption Conditional flow matching on sparse geometric constraints is sufficient for all-atom protein generation without rich co-evolutionary signals
Invoked in the model design and comparison to structure-prediction architectures.

pith-pipeline@v0.9.1-grok · 5762 in / 1417 out tokens · 40647 ms · 2026-06-27T04:48:00.311975+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 12 canonical work pages

[1]

Frances H. Arnold. Design by directed evolution.Accounts of Chemical Research, 31(3):125–131, 1998. doi: 10.1021/ar960017f

work page doi:10.1021/ar960017f 1998
[2]

, month =

Dina Listov and Sarel J. Fleishman. De novo enzyme design: Controlling structure to design function.Current Opinion in Structural Biology, 98:103252, 2026. ISSN 0959-440X. doi: 10.1016/j.sbi.2026.103252

work page doi:10.1016/j.sbi.2026.103252 2026
[3]

Pellock, Declan Evans, Pengchen Ma, Gyu Rie Lee, Jason Z

Andy Hsien-Wei Yeh, Christoffer Norn, Yakov Kipnis, Doug Tischer, Samuel J. Pellock, Declan Evans, Pengchen Ma, Gyu Rie Lee, Jason Z. Zhang, Ivan Anishchenko, Brian Coventry, Longxing Cao, Justas Dauparas, Samer Halabiya, Michelle DeWitt, Lauren Carter, K. N. Houk, and David Baker. De novo design of luciferases using deep learning.Nature, 614(7949):774–78...

work page doi:10.1038/s41586-023-05696-3 2023
[4]

Computational design of serine hydrolases.Science, 388 (6744):eadu2454, 2025

Anna Lauko, Samuel J Pellock, Kiera H Sumida, Ivan Anishchenko, David Juergens, Woody Ahern, Jihun Jeung, Alexander F Shida, Andrew Hunt, Indrek Kalvet, et al. Computational design of serine hydrolases.Science, 388 (6744):eadu2454, 2025

2025
[5]

Computational design of metallohydrolases.Nature, 649(8095): 246–253, 2026

Donghyo Kim, Seth M Woodbury, Woody Ahern, Doug Tischer, Alex Kang, Emily Joyce, Asim K Bera, Nikita Hanikel, Saman Salike, Rohith Krishna, et al. Computational design of metallohydrolases.Nature, 649(8095): 246–253, 2026

2026
[6]

Watson, David Juergens, Nathaniel R

Joseph L. Watson, David Juergens, Nathaniel R. Bennett, Brian L. Trippe, Jason Yim, Helen E. Eisenach, Woody Ahern, Andrew J. Borst, Robert J. Ragotte, Lukas F. Milles, et al. De novo design of protein structure and function with RFdiffusion.Nature, 620:1089–1100, 2023

2023
[7]

Generalized biomolecular modeling and design with RoseTTAFold All-Atom.Science, 384(6693):eadl2528, April 2024

Rohith Krishna, Jue Wang, Woody Ahern, Pascal Sturmfels, Preetham Venkatesh, Indrek Kalvet, Gyu Rie Lee, Felix S Morey-Burrows, Ivan Anishchenko, Ian R Humphreys, Ryan McHugh, Dionne Vafeados, Xinting Li, George A Sutherland, Andrew Hitchcock, C Neil Hunter, Alex Kang, Evans Brackenbrough, Asim K Bera, Minkyung Baek, Frank DiMaio, and David Baker. General...

2024
[8]

Atom-level enzyme active site scaffolding using RFdiffusion2.Nat

Woody Ahern, Jason Yim, Doug Tischer, Saman Salike, Seth M Woodbury, Donghyo Kim, Indrek Kalvet, Yakov Kipnis, Brian Coventry, Han Raut Altae-Tran, Magnus S Bauer, Regina Barzilay, Tommi S Jaakkola, Rohith Krishna, and David Baker. Atom-level enzyme active site scaffolding using RFdiffusion2.Nat. Methods, 23(1): 96–105, January 2026

2026
[9]

Bronstein, Martin Steinegger, Emine Kucukbenli, Arash Vahdat, and Karsten Kreis

Kieran Didi, Zuobai Zhang, Guoqing Zhou, Danny Reidenbach, Zhonglin Cao, Sooyoung Cha, Tomas Geffner, Christian Dallago, Jian Tang, Michael M. Bronstein, Martin Steinegger, Emine Kucukbenli, Arash Vahdat, and Karsten Kreis. Scaling atomistic protein binder design with generative pretraining and test-time compute. InThe Fourteenth International Conference ...

2026
[10]

Ragotte, Lukas F

Justas Dauparas, Ivan Anishchenko, Nathaniel Bennett, Hua Bai, Robert J. Ragotte, Lukas F. Milles, Basile I. M. Wicky, Alexis Courbet, Rob J. de Haas, Neville Bethel, Philip J. Y. Leung, Timothy F. Huddy, Samuel Pellock, Doug Tischer, Frank Chan, Brian Koepnick, Huong Nguyen, Alex Kang, Banumathi Sankaran, Asim K. Bera, Neil P. King, and David Baker. Robu...

work page doi:10.1126/science.add2187 2022
[11]

Atomic context-conditioned protein sequence design using LigandMPNN.Nature Methods, 22:717–723,

Justas Dauparas, Gyu Rie Lee, Robert Pecoraro, Linna An, Ivan Anishchenko, Cameron Glasscock, and David Baker. Atomic context-conditioned protein sequence design using LigandMPNN.Nature Methods, 22:717–723,
[12]

doi: 10.1038/s41592-025-02626-1. 9

work page doi:10.1038/s41592-025-02626-1
[13]

La-Proteina: Atomistic protein generation via partially latent flow matching.arXiv preprint arXiv:2507.09466, 2025

Tomas Geffner, Kieran Didi, Zhonglin Cao, Danny Reidenbach, Zuobai Zhang, Christian Dallago, Emine Kucukbenli, Karsten Kreis, and Arash Vahdat. La-Proteina: Atomistic protein generation via partially latent flow matching.arXiv preprint arXiv:2507.09466, 2025

Pith/arXiv arXiv 2025
[14]

De novo design of all-atom biomolecular interactions with RFdiffusion3.bioRxiv, 2025

Jasper Butcher, Rohith Krishna, Raktim Mitra, Rafael I Brent, Yanjing Li, Nathaniel Corley, Paul T Kim, Jonathan Funk, Simon Mathis, Saman Salike, et al. De novo design of all-atom biomolecular interactions with RFdiffusion3.bioRxiv, 2025. doi: 10.1101/2025.09.18.676967

work page doi:10.1101/2025.09.18.676967 2025
[15]

BoltzGen: Toward universal binder design

Hannes Stark, Felix Faltings, MinGyu Choi, Yuxin Xie, Eunsu Hur, Timothy O’Donnell, Anton Bushuiev, Talip Ucar, Saro Passaro, Weian Mao, Mateo Reveiz, Roman Bushuiev, et al. BoltzGen: Toward universal binder design. bioRxiv, 2025. doi: 10.1101/2025.11.20.689494

work page doi:10.1101/2025.11.20.689494 2025
[16]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021

2021
[17]

Highly accurate protein structure prediction with AlphaFold.Nature, 596:583–589, 2021

John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with AlphaFold.Nature, 596:583–589, 2021

2021
[18]

Boltz-1 democratizing biomolecular interaction modeling

Jeremy Wohlwend, Gabriele Corso, Saro Passaro, Noah Getz, Mateo Reveiz, Ken Leidal, Wojtek Swiderski, Liam Atkinson, Tally Portnoi, Itamar Chinn, Jacob Silterra, Tommi Jaakkola, and Regina Barzilay. Boltz-1 democratizing biomolecular interaction modeling. May 2025

2025
[19]

URL https://www.biorxiv.org/content/10.1101/2024.10.10.615955v1

ChaiDiscovery, JacquesBoitreaud, JackDent, MatthewMcPartlon, JoshuaMeier, ViniciusReis, AlexRogozhnikov, and Kevin Wu. Chai-1: Decoding the molecular interactions of life.bioRxiv, 2024. doi: 10.1101/2024.10.10.615955

work page doi:10.1101/2024.10.10.615955 2024
[20]

Nature , author =

Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, Lindsay Willmore, Andrew J. Ballard, Joshua Bambrick, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3.Nature, 630(8016):493–500, 2024. doi: 10.1038/s41586-024-07487-w

work page doi:10.1038/s41586-024-07487-w 2024
[21]

Boltz-2: Towards accurate and efficient binding affinity prediction.bioRxiv, 2025

Saro Passaro, Gabriele Corso, Jeremy Wohlwend, Mateo Reveiz, Stephan Thaler, Vignesh Ram Somnath, Noah Getz, Tally Portnoi, Julien Roy, Hannes Stark, David Kwabi-Addo, Dominique Beaini, Tommi Jaakkola, and Regina Barzilay. Boltz-2: Towards accurate and efficient binding affinity prediction.bioRxiv, 2025. doi: 10.1101/2025.06.14.659707

work page doi:10.1101/2025.06.14.659707 2025
[22]

Proteina: Scaling flow-based protein structure generative models

Tomas Geffner, Kieran Didi, Zuobai Zhang, Danny Reidenbach, Zhonglin Cao, Jason Yim, Mario Geiger, Christian Dallago, Emine Kucukbenli, Arash Vahdat, and Karsten Kreis. Proteina: Scaling flow-based protein structure generative models. InInternational Conference on Learning Representations, 2025

2025
[23]

Susskind, and Miguel Ángel Bautista

Yuyang Wang, Jiarui Lu, Navdeep Jaitly, Joshua M. Susskind, and Miguel Ángel Bautista. Simplefold: Folding proteins is simpler than you think. InThe Fourteenth International Conference on Learning Representations,
[24]

URLhttps://openreview.net/forum?id=0j0MmK7EMA
[25]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InInternational Conference on Computer Vision, 2023

2023
[26]

Rao, Jason Liu, Robert Verkuil, Joshua Meier, John Canny, Pieter Abbeel, Tom Sercu, and Alexander Rives

Roshan M. Rao, Jason Liu, Robert Verkuil, Joshua Meier, John Canny, Pieter Abbeel, Tom Sercu, and Alexander Rives. MSA transformer. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 8844–8856. PMLR, 2021

2021
[27]

Science , volume =

Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Salvatore Candido, and Alexander Rives. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023. doi: ...

work page doi:10.1126/science.ade2574 2023
[28]

Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q

Thomas Hayes, Roshan Rao, Halil Akin, Nicholas J. Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q. Tran, Jonathan Deaton, Marius Wiggert, Rohil Badkundri, Irhum Shafkat, Jun Gong, Alexander Derry, Raul S. Molina, Neil Thomas, Yousuf A. Khan, Chetan Mishra, Carolyn Kim, Liam J. Bartie, Matthew Nemeth, Patrick D. Hsu, Tom Sercu, Salvatore Cand...

work page doi:10.1126/science.ads0018 2025
[29]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InAdvances in Neural Information Processing Systems, 2022. 10

2022
[30]

Building normalizing flows with stochastic interpolants

Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. In International Conference on Learning Representations, 2023

2023
[31]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations, 2023

2023
[32]

Albergo, Nicholas M

Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, and Saining Xie. SiT: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, 2024

2024
[33]

GLU variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

Noam Shazeer. GLU variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

Pith/arXiv arXiv 2002
[34]

Deep networks with stochastic depth

Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. InComputer Vision – ECCV 2016, Lecture Notes in Computer Science, pages 646–661. Springer International Publishing, Cham, 2016

2016
[35]

lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests.Bioinformatics, 29(21):2722–2728, 2013

Valerio Mariani, Marco Biasini, Alessandro Barbato, Florian Kiefer, and Torsten Schwede. lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests.Bioinformatics, 29(21):2722–2728, 2013

2013
[36]

Kim, Charlotte Tumescheit, Milot Mirdita, Jeongjae Lee, Cameron L.M

Michel van Kempen, Stephanie S. Kim, Charlotte Tumescheit, Milot Mirdita, Jeongjae Lee, Cameron L.M. Gilchrist, Johannes Söding, and Martin Steinegger. Fast and accurate protein structure search with Foldseek. Nature Biotechnology, 42:243–246, 2024

2024
[37]

PeptideBuilder: A simple python library to generate model peptides.PeerJ, 1:e80, May 2013

Matthew Z Tien, Dariya K Sydykova, Austin G Meyer, and Claus O Wilke. PeptideBuilder: A simple python library to generate model peptides.PeerJ, 1:e80, May 2013

2013
[38]

A solution for the best rotation to relate two sets of vectors.Foundations of Crystallography, 32(5):922–923, 1976

Wolfgang Kabsch. A solution for the best rotation to relate two sets of vectors.Foundations of Crystallography, 32(5):922–923, 1976

1976
[39]

Joosten, Fei Long, Garib N

Robbie P. Joosten, Fei Long, Garib N. Murshudov, and Anastassis Perrakis. The PDB_REDO server for macromolecular structure model optimization.IUCrJ, 1(4):213–220, 2014

2014
[40]

Berman, John Westbrook, Zukang Feng, Gary Gilliland, T

Helen M. Berman, John Westbrook, Zukang Feng, Gary Gilliland, T. N. Bhat, Helge Weissig, Ilya N. Shindyalov, and Philip E. Bourne. The Protein Data Bank.Nucleic Acids Research, 28(1):235–242, 2000

2000
[41]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

2019
[42]

Stochastic interpolants: A unifying framework for flows and diffusions.Journal of Machine Learning Research, 26(209):1–80, 2025

Michael Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.Journal of Machine Learning Research, 26(209):1–80, 2025

2025
[43]

Stochastic sampling from deterministic flow models.arXiv preprint arXiv:2410.02217, 2024

Saurabh Singh and Ian Fischer. Stochastic sampling from deterministic flow models.arXiv preprint arXiv:2410.02217, 2024

arXiv 2024
[44]

Scoring function for automated assessment of protein structure template quality

Yang Zhang and Jeffrey Skolnick. Scoring function for automated assessment of protein structure template quality. Proteins: Structure, Function, and Bioinformatics, 57(4):702–710, 2004

2004
[45]

Martin, Tongsu Peng, and Michael W

Charles H. Martin, Tongsu Peng, and Michael W. Mahoney. Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data.Nature Communications, 12:4639, 2021. 11 Supplementary Information A Protein representation 13 A.1 Rep14 tokenisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....

2021
[46]

Since each ligand atom is its own token, ligand atom-atom bonds directly become token-token edges at the token level, preserving ligand connectivity in both graphs

Chemical bonds: all atom-atom and token-token bonds from the structure. Since each ligand atom is its own token, ligand atom-atom bonds directly become token-token edges at the token level, preserving ligand connectivity in both graphs
[47]

Ligand k-NN: the klig nearest HETATM neighbours of each HETATM atom, ensuring ligand-internal connectivity is maintained throughout the generation
[48]

This connectivity ensures that the scaffold always has direct access to the motif geometry

Motif connectivity(token level only): edges from every non-motif token to every motif token. This connectivity ensures that the scaffold always has direct access to the motif geometry. 5.k -NN fill: the remaining budget per node (atom or token) is filled by nearest neighbours in Euclidean distance. B Training details B.1 Base distribution The base distrib...
[49]

A seed atom (from the HETATMs if available) is selected, and whole residues or whole ligands are greedily added by increasing distance from the seed

Spatial cropping: Structures are cropped to a maximum token budget. A seed atom (from the HETATMs if available) is selected, and whole residues or whole ligands are greedily added by increasing distance from the seed. Unlike other models which crop based on spatial and sequence distance with equal probability, we only crop by spatial distance. We suspect ...
[50]

With equal probability, motif residues are sampled as contiguous 3-residue segments along the chain or as random individual residues

Motif sampling: With some probability, between 2 and 20 residues are selected as a frozen motif. With equal probability, motif residues are sampled as contiguous 3-residue segments along the chain or as random individual residues. For each selected residue, the atoms included in the motif are determined by one of two modes:(a)all mode, which includes all ...
[51]

HETATM freezing: With some probability, we freeze HETATM atoms (ligands, cofactors, metals) alongside the motif residues as part of the motif condition
[52]

For flowing (non-motif) atoms, all features apart from the atom index in the Rep14 representation (§A.1) are always masked

Conditional feature masking: Features are stochastically masked during training to allow for both unconditional and conditional inference. For flowing (non-motif) atoms, all features apart from the atom index in the Rep14 representation (§A.1) are always masked. For motif residues, features are only masked half of the time. Global features are also each i...
[53]

SE(3) augmentation: All structures are initially centred at the origin and then randomly rotated and translated stochastically for learning 3D equivariance
[54]

Coordinate scaling: The noise and data coordinates are scaled down byσdata = 10before being passed to the model for numerical and training stability. B.5 Dataset preparation Sources and filtering.Starting protein structures were obtained from (i) PDB-REDO [37], which provides re-refined and rebuilt X-ray crystallographic models, and (ii) the RCSB Protein ...
[55]

Backbone generation.For each motif target, Emyx generates 200 backbone structures using the EDM sampler (Algorithm 1) with 200 integration steps.10
[56]

The ligand and motif residue identities are held fixed; only scaffold positions are redesigned

Inverse folding.LigandMPNN [11] redesigns the sequence for each generated backbone, producing 8 candidate sequences per structure at a sampling temperature of0.1. The ligand and motif residue identities are held fixed; only scaffold positions are redesigned. 9Active sites are derived from the Mechanism and Catalytic Site Atlas (M-CSA) cross-referenced wit...
[57]

The prediction with the highest pTM score is retained as the representative fold for that sequence

Structure prediction.Each candidate sequence is folded 5 times by Boltz-2 [20] with MSA input enabled. The prediction with the highest pTM score is retained as the representative fold for that sequence
[58]

All three additionally requireno ligand clash: every inter-atomic distance between HETATM (ligand) atoms and protein atoms in the generated structure exceeds1.5Å

Success criteria.As discussed in §3.1, we evaluate three approaches to the success critera that differ in (i) how the designed and re-predicted structures are aligned and (ii) which RMSD values are checked. All three additionally requireno ligand clash: every inter-atomic distance between HETATM (ligand) atoms and protein atoms in the generated structure ...

[1] [1]

Frances H. Arnold. Design by directed evolution.Accounts of Chemical Research, 31(3):125–131, 1998. doi: 10.1021/ar960017f

work page doi:10.1021/ar960017f 1998

[2] [2]

, month =

Dina Listov and Sarel J. Fleishman. De novo enzyme design: Controlling structure to design function.Current Opinion in Structural Biology, 98:103252, 2026. ISSN 0959-440X. doi: 10.1016/j.sbi.2026.103252

work page doi:10.1016/j.sbi.2026.103252 2026

[3] [3]

Pellock, Declan Evans, Pengchen Ma, Gyu Rie Lee, Jason Z

Andy Hsien-Wei Yeh, Christoffer Norn, Yakov Kipnis, Doug Tischer, Samuel J. Pellock, Declan Evans, Pengchen Ma, Gyu Rie Lee, Jason Z. Zhang, Ivan Anishchenko, Brian Coventry, Longxing Cao, Justas Dauparas, Samer Halabiya, Michelle DeWitt, Lauren Carter, K. N. Houk, and David Baker. De novo design of luciferases using deep learning.Nature, 614(7949):774–78...

work page doi:10.1038/s41586-023-05696-3 2023

[4] [4]

Computational design of serine hydrolases.Science, 388 (6744):eadu2454, 2025

Anna Lauko, Samuel J Pellock, Kiera H Sumida, Ivan Anishchenko, David Juergens, Woody Ahern, Jihun Jeung, Alexander F Shida, Andrew Hunt, Indrek Kalvet, et al. Computational design of serine hydrolases.Science, 388 (6744):eadu2454, 2025

2025

[5] [5]

Computational design of metallohydrolases.Nature, 649(8095): 246–253, 2026

Donghyo Kim, Seth M Woodbury, Woody Ahern, Doug Tischer, Alex Kang, Emily Joyce, Asim K Bera, Nikita Hanikel, Saman Salike, Rohith Krishna, et al. Computational design of metallohydrolases.Nature, 649(8095): 246–253, 2026

2026

[6] [6]

Watson, David Juergens, Nathaniel R

Joseph L. Watson, David Juergens, Nathaniel R. Bennett, Brian L. Trippe, Jason Yim, Helen E. Eisenach, Woody Ahern, Andrew J. Borst, Robert J. Ragotte, Lukas F. Milles, et al. De novo design of protein structure and function with RFdiffusion.Nature, 620:1089–1100, 2023

2023

[7] [7]

Generalized biomolecular modeling and design with RoseTTAFold All-Atom.Science, 384(6693):eadl2528, April 2024

Rohith Krishna, Jue Wang, Woody Ahern, Pascal Sturmfels, Preetham Venkatesh, Indrek Kalvet, Gyu Rie Lee, Felix S Morey-Burrows, Ivan Anishchenko, Ian R Humphreys, Ryan McHugh, Dionne Vafeados, Xinting Li, George A Sutherland, Andrew Hitchcock, C Neil Hunter, Alex Kang, Evans Brackenbrough, Asim K Bera, Minkyung Baek, Frank DiMaio, and David Baker. General...

2024

[8] [8]

Atom-level enzyme active site scaffolding using RFdiffusion2.Nat

Woody Ahern, Jason Yim, Doug Tischer, Saman Salike, Seth M Woodbury, Donghyo Kim, Indrek Kalvet, Yakov Kipnis, Brian Coventry, Han Raut Altae-Tran, Magnus S Bauer, Regina Barzilay, Tommi S Jaakkola, Rohith Krishna, and David Baker. Atom-level enzyme active site scaffolding using RFdiffusion2.Nat. Methods, 23(1): 96–105, January 2026

2026

[9] [9]

Bronstein, Martin Steinegger, Emine Kucukbenli, Arash Vahdat, and Karsten Kreis

Kieran Didi, Zuobai Zhang, Guoqing Zhou, Danny Reidenbach, Zhonglin Cao, Sooyoung Cha, Tomas Geffner, Christian Dallago, Jian Tang, Michael M. Bronstein, Martin Steinegger, Emine Kucukbenli, Arash Vahdat, and Karsten Kreis. Scaling atomistic protein binder design with generative pretraining and test-time compute. InThe Fourteenth International Conference ...

2026

[10] [10]

Ragotte, Lukas F

Justas Dauparas, Ivan Anishchenko, Nathaniel Bennett, Hua Bai, Robert J. Ragotte, Lukas F. Milles, Basile I. M. Wicky, Alexis Courbet, Rob J. de Haas, Neville Bethel, Philip J. Y. Leung, Timothy F. Huddy, Samuel Pellock, Doug Tischer, Frank Chan, Brian Koepnick, Huong Nguyen, Alex Kang, Banumathi Sankaran, Asim K. Bera, Neil P. King, and David Baker. Robu...

work page doi:10.1126/science.add2187 2022

[11] [11]

Atomic context-conditioned protein sequence design using LigandMPNN.Nature Methods, 22:717–723,

Justas Dauparas, Gyu Rie Lee, Robert Pecoraro, Linna An, Ivan Anishchenko, Cameron Glasscock, and David Baker. Atomic context-conditioned protein sequence design using LigandMPNN.Nature Methods, 22:717–723,

[12] [12]

doi: 10.1038/s41592-025-02626-1. 9

work page doi:10.1038/s41592-025-02626-1

[13] [13]

La-Proteina: Atomistic protein generation via partially latent flow matching.arXiv preprint arXiv:2507.09466, 2025

Tomas Geffner, Kieran Didi, Zhonglin Cao, Danny Reidenbach, Zuobai Zhang, Christian Dallago, Emine Kucukbenli, Karsten Kreis, and Arash Vahdat. La-Proteina: Atomistic protein generation via partially latent flow matching.arXiv preprint arXiv:2507.09466, 2025

Pith/arXiv arXiv 2025

[14] [14]

De novo design of all-atom biomolecular interactions with RFdiffusion3.bioRxiv, 2025

Jasper Butcher, Rohith Krishna, Raktim Mitra, Rafael I Brent, Yanjing Li, Nathaniel Corley, Paul T Kim, Jonathan Funk, Simon Mathis, Saman Salike, et al. De novo design of all-atom biomolecular interactions with RFdiffusion3.bioRxiv, 2025. doi: 10.1101/2025.09.18.676967

work page doi:10.1101/2025.09.18.676967 2025

[15] [15]

BoltzGen: Toward universal binder design

Hannes Stark, Felix Faltings, MinGyu Choi, Yuxin Xie, Eunsu Hur, Timothy O’Donnell, Anton Bushuiev, Talip Ucar, Saro Passaro, Weian Mao, Mateo Reveiz, Roman Bushuiev, et al. BoltzGen: Toward universal binder design. bioRxiv, 2025. doi: 10.1101/2025.11.20.689494

work page doi:10.1101/2025.11.20.689494 2025

[16] [16]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021

2021

[17] [17]

Highly accurate protein structure prediction with AlphaFold.Nature, 596:583–589, 2021

John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with AlphaFold.Nature, 596:583–589, 2021

2021

[18] [18]

Boltz-1 democratizing biomolecular interaction modeling

Jeremy Wohlwend, Gabriele Corso, Saro Passaro, Noah Getz, Mateo Reveiz, Ken Leidal, Wojtek Swiderski, Liam Atkinson, Tally Portnoi, Itamar Chinn, Jacob Silterra, Tommi Jaakkola, and Regina Barzilay. Boltz-1 democratizing biomolecular interaction modeling. May 2025

2025

[19] [19]

URL https://www.biorxiv.org/content/10.1101/2024.10.10.615955v1

ChaiDiscovery, JacquesBoitreaud, JackDent, MatthewMcPartlon, JoshuaMeier, ViniciusReis, AlexRogozhnikov, and Kevin Wu. Chai-1: Decoding the molecular interactions of life.bioRxiv, 2024. doi: 10.1101/2024.10.10.615955

work page doi:10.1101/2024.10.10.615955 2024

[20] [20]

Nature , author =

Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, Lindsay Willmore, Andrew J. Ballard, Joshua Bambrick, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3.Nature, 630(8016):493–500, 2024. doi: 10.1038/s41586-024-07487-w

work page doi:10.1038/s41586-024-07487-w 2024

[21] [21]

Boltz-2: Towards accurate and efficient binding affinity prediction.bioRxiv, 2025

Saro Passaro, Gabriele Corso, Jeremy Wohlwend, Mateo Reveiz, Stephan Thaler, Vignesh Ram Somnath, Noah Getz, Tally Portnoi, Julien Roy, Hannes Stark, David Kwabi-Addo, Dominique Beaini, Tommi Jaakkola, and Regina Barzilay. Boltz-2: Towards accurate and efficient binding affinity prediction.bioRxiv, 2025. doi: 10.1101/2025.06.14.659707

work page doi:10.1101/2025.06.14.659707 2025

[22] [22]

Proteina: Scaling flow-based protein structure generative models

Tomas Geffner, Kieran Didi, Zuobai Zhang, Danny Reidenbach, Zhonglin Cao, Jason Yim, Mario Geiger, Christian Dallago, Emine Kucukbenli, Arash Vahdat, and Karsten Kreis. Proteina: Scaling flow-based protein structure generative models. InInternational Conference on Learning Representations, 2025

2025

[23] [23]

Susskind, and Miguel Ángel Bautista

Yuyang Wang, Jiarui Lu, Navdeep Jaitly, Joshua M. Susskind, and Miguel Ángel Bautista. Simplefold: Folding proteins is simpler than you think. InThe Fourteenth International Conference on Learning Representations,

[24] [24]

URLhttps://openreview.net/forum?id=0j0MmK7EMA

[25] [25]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InInternational Conference on Computer Vision, 2023

2023

[26] [26]

Rao, Jason Liu, Robert Verkuil, Joshua Meier, John Canny, Pieter Abbeel, Tom Sercu, and Alexander Rives

Roshan M. Rao, Jason Liu, Robert Verkuil, Joshua Meier, John Canny, Pieter Abbeel, Tom Sercu, and Alexander Rives. MSA transformer. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 8844–8856. PMLR, 2021

2021

[27] [27]

Science , volume =

Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Salvatore Candido, and Alexander Rives. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023. doi: ...

work page doi:10.1126/science.ade2574 2023

[28] [28]

Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q

Thomas Hayes, Roshan Rao, Halil Akin, Nicholas J. Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q. Tran, Jonathan Deaton, Marius Wiggert, Rohil Badkundri, Irhum Shafkat, Jun Gong, Alexander Derry, Raul S. Molina, Neil Thomas, Yousuf A. Khan, Chetan Mishra, Carolyn Kim, Liam J. Bartie, Matthew Nemeth, Patrick D. Hsu, Tom Sercu, Salvatore Cand...

work page doi:10.1126/science.ads0018 2025

[29] [29]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InAdvances in Neural Information Processing Systems, 2022. 10

2022

[30] [30]

Building normalizing flows with stochastic interpolants

Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. In International Conference on Learning Representations, 2023

2023

[31] [31]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations, 2023

2023

[32] [32]

Albergo, Nicholas M

Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, and Saining Xie. SiT: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, 2024

2024

[33] [33]

GLU variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

Noam Shazeer. GLU variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

Pith/arXiv arXiv 2002

[34] [34]

Deep networks with stochastic depth

Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. InComputer Vision – ECCV 2016, Lecture Notes in Computer Science, pages 646–661. Springer International Publishing, Cham, 2016

2016

[35] [35]

lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests.Bioinformatics, 29(21):2722–2728, 2013

Valerio Mariani, Marco Biasini, Alessandro Barbato, Florian Kiefer, and Torsten Schwede. lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests.Bioinformatics, 29(21):2722–2728, 2013

2013

[36] [36]

Kim, Charlotte Tumescheit, Milot Mirdita, Jeongjae Lee, Cameron L.M

Michel van Kempen, Stephanie S. Kim, Charlotte Tumescheit, Milot Mirdita, Jeongjae Lee, Cameron L.M. Gilchrist, Johannes Söding, and Martin Steinegger. Fast and accurate protein structure search with Foldseek. Nature Biotechnology, 42:243–246, 2024

2024

[37] [37]

PeptideBuilder: A simple python library to generate model peptides.PeerJ, 1:e80, May 2013

Matthew Z Tien, Dariya K Sydykova, Austin G Meyer, and Claus O Wilke. PeptideBuilder: A simple python library to generate model peptides.PeerJ, 1:e80, May 2013

2013

[38] [38]

A solution for the best rotation to relate two sets of vectors.Foundations of Crystallography, 32(5):922–923, 1976

Wolfgang Kabsch. A solution for the best rotation to relate two sets of vectors.Foundations of Crystallography, 32(5):922–923, 1976

1976

[39] [39]

Joosten, Fei Long, Garib N

Robbie P. Joosten, Fei Long, Garib N. Murshudov, and Anastassis Perrakis. The PDB_REDO server for macromolecular structure model optimization.IUCrJ, 1(4):213–220, 2014

2014

[40] [40]

Berman, John Westbrook, Zukang Feng, Gary Gilliland, T

Helen M. Berman, John Westbrook, Zukang Feng, Gary Gilliland, T. N. Bhat, Helge Weissig, Ilya N. Shindyalov, and Philip E. Bourne. The Protein Data Bank.Nucleic Acids Research, 28(1):235–242, 2000

2000

[41] [41]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

2019

[42] [42]

Stochastic interpolants: A unifying framework for flows and diffusions.Journal of Machine Learning Research, 26(209):1–80, 2025

Michael Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.Journal of Machine Learning Research, 26(209):1–80, 2025

2025

[43] [43]

Stochastic sampling from deterministic flow models.arXiv preprint arXiv:2410.02217, 2024

Saurabh Singh and Ian Fischer. Stochastic sampling from deterministic flow models.arXiv preprint arXiv:2410.02217, 2024

arXiv 2024

[44] [44]

Scoring function for automated assessment of protein structure template quality

Yang Zhang and Jeffrey Skolnick. Scoring function for automated assessment of protein structure template quality. Proteins: Structure, Function, and Bioinformatics, 57(4):702–710, 2004

2004

[45] [45]

Martin, Tongsu Peng, and Michael W

Charles H. Martin, Tongsu Peng, and Michael W. Mahoney. Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data.Nature Communications, 12:4639, 2021. 11 Supplementary Information A Protein representation 13 A.1 Rep14 tokenisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....

2021

[46] [46]

Since each ligand atom is its own token, ligand atom-atom bonds directly become token-token edges at the token level, preserving ligand connectivity in both graphs

Chemical bonds: all atom-atom and token-token bonds from the structure. Since each ligand atom is its own token, ligand atom-atom bonds directly become token-token edges at the token level, preserving ligand connectivity in both graphs

[47] [47]

Ligand k-NN: the klig nearest HETATM neighbours of each HETATM atom, ensuring ligand-internal connectivity is maintained throughout the generation

[48] [48]

This connectivity ensures that the scaffold always has direct access to the motif geometry

Motif connectivity(token level only): edges from every non-motif token to every motif token. This connectivity ensures that the scaffold always has direct access to the motif geometry. 5.k -NN fill: the remaining budget per node (atom or token) is filled by nearest neighbours in Euclidean distance. B Training details B.1 Base distribution The base distrib...

[49] [49]

A seed atom (from the HETATMs if available) is selected, and whole residues or whole ligands are greedily added by increasing distance from the seed

Spatial cropping: Structures are cropped to a maximum token budget. A seed atom (from the HETATMs if available) is selected, and whole residues or whole ligands are greedily added by increasing distance from the seed. Unlike other models which crop based on spatial and sequence distance with equal probability, we only crop by spatial distance. We suspect ...

[50] [50]

With equal probability, motif residues are sampled as contiguous 3-residue segments along the chain or as random individual residues

Motif sampling: With some probability, between 2 and 20 residues are selected as a frozen motif. With equal probability, motif residues are sampled as contiguous 3-residue segments along the chain or as random individual residues. For each selected residue, the atoms included in the motif are determined by one of two modes:(a)all mode, which includes all ...

[51] [51]

HETATM freezing: With some probability, we freeze HETATM atoms (ligands, cofactors, metals) alongside the motif residues as part of the motif condition

[52] [52]

For flowing (non-motif) atoms, all features apart from the atom index in the Rep14 representation (§A.1) are always masked

Conditional feature masking: Features are stochastically masked during training to allow for both unconditional and conditional inference. For flowing (non-motif) atoms, all features apart from the atom index in the Rep14 representation (§A.1) are always masked. For motif residues, features are only masked half of the time. Global features are also each i...

[53] [53]

SE(3) augmentation: All structures are initially centred at the origin and then randomly rotated and translated stochastically for learning 3D equivariance

[54] [54]

Coordinate scaling: The noise and data coordinates are scaled down byσdata = 10before being passed to the model for numerical and training stability. B.5 Dataset preparation Sources and filtering.Starting protein structures were obtained from (i) PDB-REDO [37], which provides re-refined and rebuilt X-ray crystallographic models, and (ii) the RCSB Protein ...

[55] [55]

Backbone generation.For each motif target, Emyx generates 200 backbone structures using the EDM sampler (Algorithm 1) with 200 integration steps.10

[56] [56]

The ligand and motif residue identities are held fixed; only scaffold positions are redesigned

Inverse folding.LigandMPNN [11] redesigns the sequence for each generated backbone, producing 8 candidate sequences per structure at a sampling temperature of0.1. The ligand and motif residue identities are held fixed; only scaffold positions are redesigned. 9Active sites are derived from the Mechanism and Catalytic Site Atlas (M-CSA) cross-referenced wit...

[57] [57]

The prediction with the highest pTM score is retained as the representative fold for that sequence

Structure prediction.Each candidate sequence is folded 5 times by Boltz-2 [20] with MSA input enabled. The prediction with the highest pTM score is retained as the representative fold for that sequence

[58] [58]

All three additionally requireno ligand clash: every inter-atomic distance between HETATM (ligand) atoms and protein atoms in the generated structure exceeds1.5Å

Success criteria.As discussed in §3.1, we evaluate three approaches to the success critera that differ in (i) how the designed and re-predicted structures are aligned and (ii) which RMSD values are checked. All three additionally requireno ligand clash: every inter-atomic distance between HETATM (ligand) atoms and protein atoms in the generated structure ...