Emyx: Fast and efficient all-atom protein generation
Pith reviewed 2026-06-27 04:48 UTC · model grok-4.3
The pith
A 140M-parameter flow matching model generates enzyme scaffolds more successfully than larger diffusion models while training in one-quarter the compute.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Emyx is a 140M-parameter conditional flow matching model that replaces expensive embedding stacks with lightweight conditional representations and sparse connectivity, then reparametrises the flow-matching interpolant exactly into the EDM framework; this architecture achieves higher success rates on the strict AME enzyme-design benchmark (global fold recovery plus catalytic geometry) together with improved novelty, diversity and validity, while training in 682 GPU-hours.
What carries the argument
Conditional flow matching model that concentrates capacity in standard transformer blocks with lightweight conditional representations and an exact EDM reparametrisation of the interpolant.
If this is right
- Enzyme design campaigns can reach comparable or higher success rates with substantially lower training budgets.
- Smaller all-atom generators become competitive once heavy co-evolutionary embedding stacks are removed.
- Flow-matching models can adopt diffusion-style samplers without retraining after the EDM reparametrisation.
- Training time reductions of this magnitude make iterative model refinement feasible on modest hardware clusters.
Where Pith is reading between the lines
- The same lightweight conditioning approach could be tested on other sparse-constraint tasks such as motif scaffolding or binder design.
- If the reparametrisation preserves exact likelihoods, it may allow direct comparison of flow-matching and diffusion likelihoods on protein data.
- Lower training cost could enable larger-scale ablation studies on the relative importance of transformer depth versus conditioning richness.
- Success on AME may generalise to multi-ligand or multi-catalytic-site problems if the sparse connectivity pattern scales.
Load-bearing premise
The AME benchmark's strict criteria of global fold recovery plus catalytic geometry accuracy serve as a reliable proxy for real-world enzyme design utility.
What would settle it
A controlled head-to-head experiment in which Emyx-generated designs are expressed, purified and assayed for catalytic activity at rates no higher than those from RFdiffusion3 or Proteína-Complexa under identical experimental protocols.
read the original abstract
Computational enzyme design requires generating proteins that scaffold catalytic residues and ligands, a task that demands both geometric accuracy and structural diversity from the underlying generative model. Current all-atom generators inherit expensive architectures from structure prediction, leading to high training costs and limited sample diversity. We argue that much of this complexity is unnecessary for generators, which condition on sparse geometric constraints rather than rich co-evolutionary signals. Emyx is a 140M-parameter conditional flow matching model that concentrates capacity within standard transformer blocks, replacing heavy embedding stacks with lightweight conditional representations and sparse connectivity. We additionally derive an exact reparametrisation of the flow matching interpolant into the EDM noise-level framework, bridging flow matching training efficiency with state-of-the-art sampling methods designed for diffusion models without retraining. Despite being the smallest model, Emyx outperforms both Prote\'ina-Complexa and RFdiffusion3 against the AME enzyme design benchmark across success rate under strict evaluation requiring both global fold recovery and catalytic geometry accuracy, structural novelty, scaffold diversity, and geometric validity, while training in just $682$ GPU-hours, roughly $4\times$ less than RFdiffusion3.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Emyx, a 140M-parameter conditional flow-matching model for all-atom protein generation focused on enzyme design. It claims an exact reparametrization of the flow-matching interpolant into the EDM noise-level framework, enabling use of advanced sampling methods without retraining, and reports that this smallest model outperforms Proteína-Complexa and RFdiffusion3 on the AME benchmark under strict criteria requiring global fold recovery plus catalytic geometry accuracy, while also improving structural novelty, scaffold diversity, and geometric validity, all at a training cost of 682 GPU-hours (roughly 4× less than RFdiffusion3).
Significance. If the reported empirical results prove robust, the work would indicate that concentrating capacity in standard transformer blocks with lightweight conditional representations and sparse connectivity can yield both higher performance and substantially lower training costs than larger inherited architectures from structure prediction, offering a practical route to more accessible all-atom generative models for computational enzyme design.
major comments (2)
- [Abstract] Abstract: The central claims of outperformance on the AME benchmark (success rate, novelty, diversity, validity) and reduced training cost are stated without any accompanying quantitative results, tables of metrics, error bars, statistical tests, ablation studies, or details on data splits and evaluation protocol. This absence is load-bearing for the primary empirical claim and prevents verification of whether the results support the stated superiority.
- [Abstract] Abstract: The reparametrization is described as 'exact' and bridging flow matching with EDM sampling methods, yet no derivation, equations, or proof of exactness is supplied in the provided text, leaving the technical foundation for the claimed efficiency gains unverified.
minor comments (1)
- [Abstract] Abstract: The baseline name appears as 'Prote\'ina-Complexa'; confirm correct spelling and formatting for 'Proteína-Complexa'.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity and verifiability.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claims of outperformance on the AME benchmark (success rate, novelty, diversity, validity) and reduced training cost are stated without any accompanying quantitative results, tables of metrics, error bars, statistical tests, ablation studies, or details on data splits and evaluation protocol. This absence is load-bearing for the primary empirical claim and prevents verification of whether the results support the stated superiority.
Authors: We agree that the abstract would be strengthened by including specific quantitative results. The full manuscript already contains detailed tables with success rates (under the strict global fold + catalytic geometry criteria), novelty, diversity, and validity metrics, along with error bars, statistical comparisons, data splits, and the full evaluation protocol in the results and methods sections. To make the primary claims immediately verifiable from the abstract, we will revise it to report key numbers such as the success rates for Emyx versus the baselines and the precise training cost comparison. revision: yes
-
Referee: [Abstract] Abstract: The reparametrization is described as 'exact' and bridging flow matching with EDM sampling methods, yet no derivation, equations, or proof of exactness is supplied in the provided text, leaving the technical foundation for the claimed efficiency gains unverified.
Authors: The manuscript derives the exact reparametrization in the methods section, showing the mathematical equivalence of the flow-matching interpolant to the EDM noise-level framework. However, we acknowledge that a self-contained derivation with equations and a brief proof of exactness would improve accessibility. We will add this derivation (including the key equations) to the main text or a new subsection in the revision. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper's central claims rest on an empirical benchmark comparison and a stated exact reparametrization of the flow-matching interpolant to the EDM framework. No load-bearing derivation reduces by construction to fitted inputs, self-citations, or ansatzes imported from prior author work. The reparametrization is presented as mathematically exact rather than a fitted or renamed result, and performance metrics are reported as direct experimental outcomes on the AME benchmark, not quantities forced by internal parameter choices. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Conditional flow matching on sparse geometric constraints is sufficient for all-atom protein generation without rich co-evolutionary signals
Reference graph
Works this paper leans on
-
[1]
Frances H. Arnold. Design by directed evolution.Accounts of Chemical Research, 31(3):125–131, 1998. doi: 10.1021/ar960017f
-
[2]
Dina Listov and Sarel J. Fleishman. De novo enzyme design: Controlling structure to design function.Current Opinion in Structural Biology, 98:103252, 2026. ISSN 0959-440X. doi: 10.1016/j.sbi.2026.103252
-
[3]
Pellock, Declan Evans, Pengchen Ma, Gyu Rie Lee, Jason Z
Andy Hsien-Wei Yeh, Christoffer Norn, Yakov Kipnis, Doug Tischer, Samuel J. Pellock, Declan Evans, Pengchen Ma, Gyu Rie Lee, Jason Z. Zhang, Ivan Anishchenko, Brian Coventry, Longxing Cao, Justas Dauparas, Samer Halabiya, Michelle DeWitt, Lauren Carter, K. N. Houk, and David Baker. De novo design of luciferases using deep learning.Nature, 614(7949):774–78...
-
[4]
Computational design of serine hydrolases.Science, 388 (6744):eadu2454, 2025
Anna Lauko, Samuel J Pellock, Kiera H Sumida, Ivan Anishchenko, David Juergens, Woody Ahern, Jihun Jeung, Alexander F Shida, Andrew Hunt, Indrek Kalvet, et al. Computational design of serine hydrolases.Science, 388 (6744):eadu2454, 2025
2025
-
[5]
Computational design of metallohydrolases.Nature, 649(8095): 246–253, 2026
Donghyo Kim, Seth M Woodbury, Woody Ahern, Doug Tischer, Alex Kang, Emily Joyce, Asim K Bera, Nikita Hanikel, Saman Salike, Rohith Krishna, et al. Computational design of metallohydrolases.Nature, 649(8095): 246–253, 2026
2026
-
[6]
Watson, David Juergens, Nathaniel R
Joseph L. Watson, David Juergens, Nathaniel R. Bennett, Brian L. Trippe, Jason Yim, Helen E. Eisenach, Woody Ahern, Andrew J. Borst, Robert J. Ragotte, Lukas F. Milles, et al. De novo design of protein structure and function with RFdiffusion.Nature, 620:1089–1100, 2023
2023
-
[7]
Generalized biomolecular modeling and design with RoseTTAFold All-Atom.Science, 384(6693):eadl2528, April 2024
Rohith Krishna, Jue Wang, Woody Ahern, Pascal Sturmfels, Preetham Venkatesh, Indrek Kalvet, Gyu Rie Lee, Felix S Morey-Burrows, Ivan Anishchenko, Ian R Humphreys, Ryan McHugh, Dionne Vafeados, Xinting Li, George A Sutherland, Andrew Hitchcock, C Neil Hunter, Alex Kang, Evans Brackenbrough, Asim K Bera, Minkyung Baek, Frank DiMaio, and David Baker. General...
2024
-
[8]
Atom-level enzyme active site scaffolding using RFdiffusion2.Nat
Woody Ahern, Jason Yim, Doug Tischer, Saman Salike, Seth M Woodbury, Donghyo Kim, Indrek Kalvet, Yakov Kipnis, Brian Coventry, Han Raut Altae-Tran, Magnus S Bauer, Regina Barzilay, Tommi S Jaakkola, Rohith Krishna, and David Baker. Atom-level enzyme active site scaffolding using RFdiffusion2.Nat. Methods, 23(1): 96–105, January 2026
2026
-
[9]
Bronstein, Martin Steinegger, Emine Kucukbenli, Arash Vahdat, and Karsten Kreis
Kieran Didi, Zuobai Zhang, Guoqing Zhou, Danny Reidenbach, Zhonglin Cao, Sooyoung Cha, Tomas Geffner, Christian Dallago, Jian Tang, Michael M. Bronstein, Martin Steinegger, Emine Kucukbenli, Arash Vahdat, and Karsten Kreis. Scaling atomistic protein binder design with generative pretraining and test-time compute. InThe Fourteenth International Conference ...
2026
-
[10]
Justas Dauparas, Ivan Anishchenko, Nathaniel Bennett, Hua Bai, Robert J. Ragotte, Lukas F. Milles, Basile I. M. Wicky, Alexis Courbet, Rob J. de Haas, Neville Bethel, Philip J. Y. Leung, Timothy F. Huddy, Samuel Pellock, Doug Tischer, Frank Chan, Brian Koepnick, Huong Nguyen, Alex Kang, Banumathi Sankaran, Asim K. Bera, Neil P. King, and David Baker. Robu...
-
[11]
Atomic context-conditioned protein sequence design using LigandMPNN.Nature Methods, 22:717–723,
Justas Dauparas, Gyu Rie Lee, Robert Pecoraro, Linna An, Ivan Anishchenko, Cameron Glasscock, and David Baker. Atomic context-conditioned protein sequence design using LigandMPNN.Nature Methods, 22:717–723,
-
[12]
doi: 10.1038/s41592-025-02626-1. 9
-
[13]
Tomas Geffner, Kieran Didi, Zhonglin Cao, Danny Reidenbach, Zuobai Zhang, Christian Dallago, Emine Kucukbenli, Karsten Kreis, and Arash Vahdat. La-Proteina: Atomistic protein generation via partially latent flow matching.arXiv preprint arXiv:2507.09466, 2025
Pith/arXiv arXiv 2025
-
[14]
De novo design of all-atom biomolecular interactions with RFdiffusion3.bioRxiv, 2025
Jasper Butcher, Rohith Krishna, Raktim Mitra, Rafael I Brent, Yanjing Li, Nathaniel Corley, Paul T Kim, Jonathan Funk, Simon Mathis, Saman Salike, et al. De novo design of all-atom biomolecular interactions with RFdiffusion3.bioRxiv, 2025. doi: 10.1101/2025.09.18.676967
-
[15]
BoltzGen: Toward universal binder design
Hannes Stark, Felix Faltings, MinGyu Choi, Yuxin Xie, Eunsu Hur, Timothy O’Donnell, Anton Bushuiev, Talip Ucar, Saro Passaro, Weian Mao, Mateo Reveiz, Roman Bushuiev, et al. BoltzGen: Toward universal binder design. bioRxiv, 2025. doi: 10.1101/2025.11.20.689494
-
[16]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021
2021
-
[17]
Highly accurate protein structure prediction with AlphaFold.Nature, 596:583–589, 2021
John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with AlphaFold.Nature, 596:583–589, 2021
2021
-
[18]
Boltz-1 democratizing biomolecular interaction modeling
Jeremy Wohlwend, Gabriele Corso, Saro Passaro, Noah Getz, Mateo Reveiz, Ken Leidal, Wojtek Swiderski, Liam Atkinson, Tally Portnoi, Itamar Chinn, Jacob Silterra, Tommi Jaakkola, and Regina Barzilay. Boltz-1 democratizing biomolecular interaction modeling. May 2025
2025
-
[19]
URL https://www.biorxiv.org/content/10.1101/2024.10.10.615955v1
ChaiDiscovery, JacquesBoitreaud, JackDent, MatthewMcPartlon, JoshuaMeier, ViniciusReis, AlexRogozhnikov, and Kevin Wu. Chai-1: Decoding the molecular interactions of life.bioRxiv, 2024. doi: 10.1101/2024.10.10.615955
-
[20]
Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, Lindsay Willmore, Andrew J. Ballard, Joshua Bambrick, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3.Nature, 630(8016):493–500, 2024. doi: 10.1038/s41586-024-07487-w
-
[21]
Boltz-2: Towards accurate and efficient binding affinity prediction.bioRxiv, 2025
Saro Passaro, Gabriele Corso, Jeremy Wohlwend, Mateo Reveiz, Stephan Thaler, Vignesh Ram Somnath, Noah Getz, Tally Portnoi, Julien Roy, Hannes Stark, David Kwabi-Addo, Dominique Beaini, Tommi Jaakkola, and Regina Barzilay. Boltz-2: Towards accurate and efficient binding affinity prediction.bioRxiv, 2025. doi: 10.1101/2025.06.14.659707
-
[22]
Proteina: Scaling flow-based protein structure generative models
Tomas Geffner, Kieran Didi, Zuobai Zhang, Danny Reidenbach, Zhonglin Cao, Jason Yim, Mario Geiger, Christian Dallago, Emine Kucukbenli, Arash Vahdat, and Karsten Kreis. Proteina: Scaling flow-based protein structure generative models. InInternational Conference on Learning Representations, 2025
2025
-
[23]
Susskind, and Miguel Ángel Bautista
Yuyang Wang, Jiarui Lu, Navdeep Jaitly, Joshua M. Susskind, and Miguel Ángel Bautista. Simplefold: Folding proteins is simpler than you think. InThe Fourteenth International Conference on Learning Representations,
-
[24]
URLhttps://openreview.net/forum?id=0j0MmK7EMA
-
[25]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InInternational Conference on Computer Vision, 2023
2023
-
[26]
Rao, Jason Liu, Robert Verkuil, Joshua Meier, John Canny, Pieter Abbeel, Tom Sercu, and Alexander Rives
Roshan M. Rao, Jason Liu, Robert Verkuil, Joshua Meier, John Canny, Pieter Abbeel, Tom Sercu, and Alexander Rives. MSA transformer. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 8844–8856. PMLR, 2021
2021
-
[27]
Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Salvatore Candido, and Alexander Rives. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023. doi: ...
-
[28]
Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q
Thomas Hayes, Roshan Rao, Halil Akin, Nicholas J. Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q. Tran, Jonathan Deaton, Marius Wiggert, Rohil Badkundri, Irhum Shafkat, Jun Gong, Alexander Derry, Raul S. Molina, Neil Thomas, Yousuf A. Khan, Chetan Mishra, Carolyn Kim, Liam J. Bartie, Matthew Nemeth, Patrick D. Hsu, Tom Sercu, Salvatore Cand...
-
[29]
Elucidating the design space of diffusion-based generative models
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InAdvances in Neural Information Processing Systems, 2022. 10
2022
-
[30]
Building normalizing flows with stochastic interpolants
Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. In International Conference on Learning Representations, 2023
2023
-
[31]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations, 2023
2023
-
[32]
Albergo, Nicholas M
Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, and Saining Xie. SiT: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, 2024
2024
-
[33]
GLU variants improve transformer.arXiv preprint arXiv:2002.05202, 2020
Noam Shazeer. GLU variants improve transformer.arXiv preprint arXiv:2002.05202, 2020
Pith/arXiv arXiv 2002
-
[34]
Deep networks with stochastic depth
Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. InComputer Vision – ECCV 2016, Lecture Notes in Computer Science, pages 646–661. Springer International Publishing, Cham, 2016
2016
-
[35]
lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests.Bioinformatics, 29(21):2722–2728, 2013
Valerio Mariani, Marco Biasini, Alessandro Barbato, Florian Kiefer, and Torsten Schwede. lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests.Bioinformatics, 29(21):2722–2728, 2013
2013
-
[36]
Kim, Charlotte Tumescheit, Milot Mirdita, Jeongjae Lee, Cameron L.M
Michel van Kempen, Stephanie S. Kim, Charlotte Tumescheit, Milot Mirdita, Jeongjae Lee, Cameron L.M. Gilchrist, Johannes Söding, and Martin Steinegger. Fast and accurate protein structure search with Foldseek. Nature Biotechnology, 42:243–246, 2024
2024
-
[37]
PeptideBuilder: A simple python library to generate model peptides.PeerJ, 1:e80, May 2013
Matthew Z Tien, Dariya K Sydykova, Austin G Meyer, and Claus O Wilke. PeptideBuilder: A simple python library to generate model peptides.PeerJ, 1:e80, May 2013
2013
-
[38]
A solution for the best rotation to relate two sets of vectors.Foundations of Crystallography, 32(5):922–923, 1976
Wolfgang Kabsch. A solution for the best rotation to relate two sets of vectors.Foundations of Crystallography, 32(5):922–923, 1976
1976
-
[39]
Joosten, Fei Long, Garib N
Robbie P. Joosten, Fei Long, Garib N. Murshudov, and Anastassis Perrakis. The PDB_REDO server for macromolecular structure model optimization.IUCrJ, 1(4):213–220, 2014
2014
-
[40]
Berman, John Westbrook, Zukang Feng, Gary Gilliland, T
Helen M. Berman, John Westbrook, Zukang Feng, Gary Gilliland, T. N. Bhat, Helge Weissig, Ilya N. Shindyalov, and Philip E. Bourne. The Protein Data Bank.Nucleic Acids Research, 28(1):235–242, 2000
2000
-
[41]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019
2019
-
[42]
Stochastic interpolants: A unifying framework for flows and diffusions.Journal of Machine Learning Research, 26(209):1–80, 2025
Michael Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.Journal of Machine Learning Research, 26(209):1–80, 2025
2025
-
[43]
Stochastic sampling from deterministic flow models.arXiv preprint arXiv:2410.02217, 2024
Saurabh Singh and Ian Fischer. Stochastic sampling from deterministic flow models.arXiv preprint arXiv:2410.02217, 2024
arXiv 2024
-
[44]
Scoring function for automated assessment of protein structure template quality
Yang Zhang and Jeffrey Skolnick. Scoring function for automated assessment of protein structure template quality. Proteins: Structure, Function, and Bioinformatics, 57(4):702–710, 2004
2004
-
[45]
Martin, Tongsu Peng, and Michael W
Charles H. Martin, Tongsu Peng, and Michael W. Mahoney. Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data.Nature Communications, 12:4639, 2021. 11 Supplementary Information A Protein representation 13 A.1 Rep14 tokenisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....
2021
-
[46]
Since each ligand atom is its own token, ligand atom-atom bonds directly become token-token edges at the token level, preserving ligand connectivity in both graphs
Chemical bonds: all atom-atom and token-token bonds from the structure. Since each ligand atom is its own token, ligand atom-atom bonds directly become token-token edges at the token level, preserving ligand connectivity in both graphs
-
[47]
Ligand k-NN: the klig nearest HETATM neighbours of each HETATM atom, ensuring ligand-internal connectivity is maintained throughout the generation
-
[48]
This connectivity ensures that the scaffold always has direct access to the motif geometry
Motif connectivity(token level only): edges from every non-motif token to every motif token. This connectivity ensures that the scaffold always has direct access to the motif geometry. 5.k -NN fill: the remaining budget per node (atom or token) is filled by nearest neighbours in Euclidean distance. B Training details B.1 Base distribution The base distrib...
-
[49]
A seed atom (from the HETATMs if available) is selected, and whole residues or whole ligands are greedily added by increasing distance from the seed
Spatial cropping: Structures are cropped to a maximum token budget. A seed atom (from the HETATMs if available) is selected, and whole residues or whole ligands are greedily added by increasing distance from the seed. Unlike other models which crop based on spatial and sequence distance with equal probability, we only crop by spatial distance. We suspect ...
-
[50]
With equal probability, motif residues are sampled as contiguous 3-residue segments along the chain or as random individual residues
Motif sampling: With some probability, between 2 and 20 residues are selected as a frozen motif. With equal probability, motif residues are sampled as contiguous 3-residue segments along the chain or as random individual residues. For each selected residue, the atoms included in the motif are determined by one of two modes:(a)all mode, which includes all ...
-
[51]
HETATM freezing: With some probability, we freeze HETATM atoms (ligands, cofactors, metals) alongside the motif residues as part of the motif condition
-
[52]
For flowing (non-motif) atoms, all features apart from the atom index in the Rep14 representation (§A.1) are always masked
Conditional feature masking: Features are stochastically masked during training to allow for both unconditional and conditional inference. For flowing (non-motif) atoms, all features apart from the atom index in the Rep14 representation (§A.1) are always masked. For motif residues, features are only masked half of the time. Global features are also each i...
-
[53]
SE(3) augmentation: All structures are initially centred at the origin and then randomly rotated and translated stochastically for learning 3D equivariance
-
[54]
Coordinate scaling: The noise and data coordinates are scaled down byσdata = 10before being passed to the model for numerical and training stability. B.5 Dataset preparation Sources and filtering.Starting protein structures were obtained from (i) PDB-REDO [37], which provides re-refined and rebuilt X-ray crystallographic models, and (ii) the RCSB Protein ...
-
[55]
Backbone generation.For each motif target, Emyx generates 200 backbone structures using the EDM sampler (Algorithm 1) with 200 integration steps.10
-
[56]
The ligand and motif residue identities are held fixed; only scaffold positions are redesigned
Inverse folding.LigandMPNN [11] redesigns the sequence for each generated backbone, producing 8 candidate sequences per structure at a sampling temperature of0.1. The ligand and motif residue identities are held fixed; only scaffold positions are redesigned. 9Active sites are derived from the Mechanism and Catalytic Site Atlas (M-CSA) cross-referenced wit...
-
[57]
The prediction with the highest pTM score is retained as the representative fold for that sequence
Structure prediction.Each candidate sequence is folded 5 times by Boltz-2 [20] with MSA input enabled. The prediction with the highest pTM score is retained as the representative fold for that sequence
-
[58]
All three additionally requireno ligand clash: every inter-atomic distance between HETATM (ligand) atoms and protein atoms in the generated structure exceeds1.5Å
Success criteria.As discussed in §3.1, we evaluate three approaches to the success critera that differ in (i) how the designed and re-predicted structures are aligned and (ii) which RMSD values are checked. All three additionally requireno ligand clash: every inter-atomic distance between HETATM (ligand) atoms and protein atoms in the generated structure ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.