A new ASR framework integrates DCA-derived epistatic constraints with phylogenetic models, benchmarked via forward-evolution simulation on beta-lactamases and DNA-binding domains.
Expanding functional protein sequence space using high entropy generative models
3 Pith papers cite this work. Polarity classification is still indexing.
abstract
Boltzmann Machines trained on evolutionary sequence data have emerged as a powerful paradigm for the data-driven design of artificial proteins. However, the relationship between model architecture, specifically parameter density, and experimental performance remains poorly understood. Here, we investigate this relationship using the Chorismate Mutase enzyme family as a model system. We compare standard fully connected Boltzmann Machines for Direct Coupling Analysis (bmDCA) with sparse models generated via progressive edge activation (eaDCA) and edge decimation (edDCA). We identify a maximum-entropy model (meDCA) along the decimation trajectory that represents an optimal balance between constraint satisfaction and the flexibility of the probability distribution. We synthesized and tested artificial sequences from all models using an in vivo complementation assay, finding that all architectures, regardless of sparsity, generate functional enzymes with high success rates, even at significant divergence from natural sequences. Despite this functional equivalence, we demonstrate that the meDCA model samples a viable sequence space that is more than fifteen orders of magnitude larger than its low-entropy counterparts. Furthermore, comparative analyses reveal that high-entropy models systematically minimize overfitting and better capture the local neutral spaces surrounding natural proteins. These findings suggest that while various models satisfying coevolutionary statistics can generate functional sequences, high-entropy Boltzmann Machines provide a superior representation of the underlying evolutionary fitness landscape.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
Generative epistatic landscapes show that evolutionary intermediate reconstruction works best via conditional sampling of ensembles rather than maximum-likelihood point predictions, with success limited by landscape topology and mutability.
Review of generative sequence models and Direct Coupling Analysis for simulating protein evolutionary dynamics from extant data.
citing papers explorer
-
Towards coevolution-aware ancestral sequence reconstruction
A new ASR framework integrates DCA-derived epistatic constraints with phylogenetic models, benchmarked via forward-evolution simulation on beta-lactamases and DNA-binding domains.
-
Reconstructability of evolutionary intermediates in generative epistatic landscapes
Generative epistatic landscapes show that evolutionary intermediate reconstruction works best via conditional sampling of ensembles rather than maximum-likelihood point predictions, with success limited by landscape topology and mutability.
-
Modeling Protein Evolution with Generative Models: from Extant Sequence Data to Evolutionary Dynamics
Review of generative sequence models and Direct Coupling Analysis for simulating protein evolutionary dynamics from extant data.