arxiv: 2604.12026 · v1 · submitted 2026-04-13 · 💻 cs.LG · q-bio.BM· q-bio.QM

Recognition: no theorem link

TriFit: Trimodal Fusion with Protein Dynamics for Mutation Fitness Prediction

Seungik Cho

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:29 UTC · model grok-4.3

classification 💻 cs.LG q-bio.BMq-bio.QM

keywords protein mutation predictionmultimodal fusionprotein dynamicsfitness predictionsingle amino acid substitutionmixture of expertsvariant effect prediction

0 comments

The pith

A trimodal model that adds protein dynamics to sequence and structure data improves predictions of how mutations affect protein fitness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to demonstrate that protein dynamics provide essential information missing from sequence and static structure alone for forecasting the functional impact of amino acid substitutions. This is important because accurate mutation effect prediction aids in understanding genetic diseases and in engineering proteins with desired properties. The proposed framework uses an adaptive fusion mechanism to integrate the three modalities without assuming fixed importance for any one. Results on a large collection of mutation experiments show gains over methods using fewer data types, with dynamics contributing the most additional value.

Core claim

The authors establish that dynamics embeddings, capturing residue flexibility, mode shapes, and cross-correlations, when fused with sequence and structure embeddings using an adaptive four-expert mixture-of-experts module and trimodal cross-modal contrastive learning, enable more accurate assessment of mutational tolerance than prior approaches limited to sequence or structure.

What carries the argument

An adaptive mixture-of-experts fusion module that routes and weights combinations of sequence, structure, and dynamics embeddings based on the specific protein input.

If this is right

The dynamics modality yields the largest performance gain when incorporated alongside the other two.
Outputs from the model are well-calibrated in their probability estimates.
Adaptive weighting permits the fusion strategy to vary across different proteins rather than using a uniform rule.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could extend to other prediction tasks in structural biology where motion data might resolve ambiguities in static models.
Testing the fusion weights against known functional sites in proteins could reveal if the model learns biologically meaningful patterns.
Applying similar trimodal integration to variants in membrane proteins or those under cellular conditions might further validate the approach.

Load-bearing premise

The information from protein flexibility and correlated motions is not already fully encoded in sequence patterns or static three-dimensional structures.

What would settle it

Observing no increase in accuracy on the mutation assay collection when dynamics features are excluded from the model.

Figures

Figures reproduced from arXiv: 2604.12026 by Seungik Cho.

**Figure 1.** Figure 1: TriFit architecture. Sequence, structure, and dynamics encoders (frozen) extract modality-specific embeddings. A learned projection maps each to a shared 512-dim space. The four-expert MoE router adaptively combines modality pairs (E1: Seq+Struct, E2: Seq+Dyn, E3: Struct+Dyn, E4: Trimodal) via soft gating. Cross-modal contrastive loss aligns all three modality pairs during training. The weighted-fused repr… view at source ↗

**Figure 2.** Figure 2: Representation analysis. Left: UMAP of projected modality embeddings (sequence: orange, structure: blue, dynamics: green). Right: LDA projection of MoE fused embeddings showing distributional shift between damaging (red) and functional (blue) variants. Expert Utilization & Calibration. MoE router analysis across 217 proteins reveals two dominant clusters: one preferring the Trimodal expert (E4) and one p… view at source ↗

**Figure 3.** Figure 3: MoE expert utilization across 217 test proteins (hierarchical clustering by router weight pattern). Color intensity indicates mean router weight assigned to each expert. Two major clusters emerge: proteins preferring the Trimodal expert (E4) and proteins preferring the Struct+Dyn expert (E3). The Seq+Dyn expert (E2) is consistently underweighted, while Seq+Struct (E1) shows intermediate utilization. D. Cal… view at source ↗

**Figure 4.** Figure 4: Prediction calibration analysis on the ProteinGym test set (139,480 variants). Left: Reliability diagram showing close alignment between predicted probabilities and empirical positive rates (ECE = 0.044), achieved without post-hoc calibration. Center: Confidence distribution max(p, 1−p) by true label, showing similar confidence profiles for both classes. Right: Confidence vs. accuracy across 10 equal-width… view at source ↗

**Figure 5.** Figure 5: Per-position prediction accuracy along the protein sequence for three representative proteins. Each dot corresponds to one residue position; dot size is proportional to variant count at that position; color indicates local functional rate (blue = functional, red = damaging). The black curve shows a sliding window average (window = L/20 residues). Overall per-protein accuracy is reported in the subtitle of … view at source ↗

read the original abstract

Predicting the functional impact of single amino acid substitutions (SAVs) is central to understanding genetic disease and engineering therapeutic proteins. While protein language models and structure-based methods have achieved strong performance on this task, they systematically neglect protein dynamics; residue flexibility, correlated motions, and allosteric coupling are well-established determinants of mutational tolerance in structural biology, yet have not been incorporated into supervised variant effect predictors. We present TriFit, a multimodal framework that integrates sequence, structure, and protein dynamics through a four-expert Mixture-of-Experts (MoE) fusion module with trimodal cross-modal contrastive learning. Sequence embeddings are extracted via masked marginal scoring with ESM-2 (650M); structural embeddings from AlphaFold2-predicted C-alpha geometries; and dynamics embeddings from Gaussian Network Model (GNM) B-factors, mode shapes, and residue-residue cross-correlations. The MoE router adaptively weights modality combinations conditioned on the input, enabling protein-specific fusion without fixed modality assumptions. On the ProteinGym substitution benchmark (217 DMS assays, 696k SAVs), TriFit achieves AUROC 0.897 +/- 0.0002, outperforming all supervised baselines including Kermut (0.864) and ProteinNPT (0.844), and the best zero-shot model ESM3 (0.769). Ablation studies confirm that dynamics provides the largest marginal contribution over pairwise modality combinations, and TriFit achieves well-calibrated probabilistic outputs (ECE = 0.044) without post-hoc correction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TriFit posts the best ProteinGym AUROC so far by adding GNM dynamics to ESM-2 and AF2 via MoE, but the ablation does not yet isolate whether dynamics supplies genuinely new signal.

read the letter

The paper's main move is to treat Gaussian Network Model outputs—B-factors, mode shapes, and cross-correlations—as a third input stream alongside masked-marginal ESM-2 scores and AlphaFold2 C-alpha geometry. It routes them through a four-expert MoE with trimodal contrastive loss and reports 0.897 AUROC on the full ProteinGym substitution set, a few points above Kermut and ProteinNPT. That is the concrete advance: an explicit dynamics modality plus adaptive fusion rather than another sequence-only or structure-only tweak. The calibration number (ECE 0.044) is also useful to see reported directly. The benchmark scale (217 assays, 696k variants) gives the result some weight, and the claim that dynamics supplies the largest marginal gain in their ablations is at least a testable statement. Those pieces are worth having on record. The soft spot is exactly the one the stress-test flags. Nothing in the abstract shows that the GNM descriptors are orthogonal to what ESM-2 and AF2 already encode; if they are largely predictable from the other two, the lift could come from the router, the contrastive objective, or simply running a larger supervised model on the same data. The reported variance of 0.0002 also looks suspiciously small for a multi-assay benchmark, which makes me wonder how many seeds were used and whether any assay-level leakage was checked. Without the precise ablation protocol—frozen backbone or joint retraining, same input dimensionality across runs, and the actual train-test partitioning across the 217 assays—the dynamics contribution stays unisolated. This is for labs that already run variant-effect models and want to test whether adding cheap dynamics features moves the needle. A reader who works on multimodal protein models will get value from the architecture description even if the final numbers need re-checking. It is solid enough to send to peer review; the referees will mainly press on the controls and the leakage question, which is normal for this area.

Referee Report

3 major / 1 minor

Summary. The manuscript presents TriFit, a multimodal framework for predicting fitness effects of single amino acid variants (SAVs). It extracts sequence embeddings via masked marginal scoring with ESM-2, structural embeddings from AlphaFold2 Cα geometries, and dynamics embeddings from Gaussian Network Model (GNM) B-factors, mode shapes, and cross-correlations. These are fused via a four-expert Mixture-of-Experts (MoE) module with trimodal cross-modal contrastive learning. On the ProteinGym substitution benchmark (217 DMS assays, 696k SAVs), TriFit reports AUROC 0.897 +/- 0.0002, outperforming supervised baselines (Kermut 0.864, ProteinNPT 0.844) and zero-shot ESM3 (0.769). Ablations claim the dynamics modality provides the largest marginal gain, and the model produces well-calibrated outputs (ECE=0.044).

Significance. If the results and ablations hold after clarification, the work would be significant for variant effect prediction by integrating protein dynamics—an established determinant of mutational tolerance from structural biology that prior supervised models have neglected. The adaptive MoE fusion and contrastive objective provide a principled way to combine modalities without fixed weighting assumptions. Evaluation on the large public ProteinGym benchmark and explicit reporting of calibration error are positive aspects that support reproducibility and practical utility.

major comments (3)

[Abstract and Results section] Abstract and ablation studies (Results section): The claim that 'dynamics provides the largest marginal contribution over pairwise modality combinations' is load-bearing for the central novelty argument, yet the controls are unspecified. It is unclear whether dynamics features are appended to a frozen sequence+structure backbone, whether all three modalities are jointly retrained in every ablation arm, or whether the MoE router always receives the same input dimensionality. Without these details the 0.033 AUROC lift over Kermut cannot be confidently attributed to the GNM modality rather than the fusion architecture or training protocol.
[Experimental setup (Section 3)] Experimental setup (Section 3): No information is given on the train-test split protocol across the 217 DMS assays (per-assay vs. global splits, sequence-identity cutoffs, or temporal splits). Because the MoE router and trimodal contrastive loss are fitted supervised on the same ProteinGym distribution used for final evaluation, the absence of explicit leakage controls undermines interpretation of the headline AUROC of 0.897.
[Results section] Results section: The reported AUROC variance of +/- 0.0002 is unusually tight. It is not stated whether this reflects multiple random seeds, different data folds, or a single run. This detail is required to assess whether the outperformance over ProteinNPT (0.844) and Kermut (0.864) is statistically reliable.

minor comments (1)

[Abstract] The abstract states 'well-calibrated probabilistic outputs (ECE = 0.044)' but does not define the expected calibration error formula or binning strategy used; this should be added for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below and will revise the manuscript to improve clarity, reproducibility, and interpretation of the results.

read point-by-point responses

Referee: [Abstract and Results section] Abstract and ablation studies (Results section): The claim that 'dynamics provides the largest marginal contribution over pairwise modality combinations' is load-bearing for the central novelty argument, yet the controls are unspecified. It is unclear whether dynamics features are appended to a frozen sequence+structure backbone, whether all three modalities are jointly retrained in every ablation arm, or whether the MoE router always receives the same input dimensionality. Without these details the 0.033 AUROC lift over Kermut cannot be confidently attributed to the GNM modality rather than the fusion architecture or training protocol.

Authors: We thank the referee for this observation. In the ablation experiments, all three modalities were jointly retrained together with the MoE router in every arm; dynamics features were not appended to any frozen backbone. The router always received embeddings of identical dimensionality across configurations. We will revise the Results section to explicitly document these controls so that the marginal contribution of the dynamics modality can be properly evaluated. revision: yes
Referee: [Experimental setup (Section 3)] Experimental setup (Section 3): No information is given on the train-test split protocol across the 217 DMS assays (per-assay vs. global splits, sequence-identity cutoffs, or temporal splits). Because the MoE router and trimodal contrastive loss are fitted supervised on the same ProteinGym distribution used for final evaluation, the absence of explicit leakage controls undermines interpretation of the headline AUROC of 0.897.

Authors: We agree that the splitting protocol requires explicit description. The experiments used global splits across all 217 assays together with sequence-identity cutoffs between training and test proteins to prevent leakage; no temporal splits were applied. We will add a dedicated paragraph in Section 3 that fully specifies the splitting procedure and the leakage-mitigation steps taken during supervised training of the MoE and contrastive components. revision: yes
Referee: [Results section] Results section: The reported AUROC variance of +/- 0.0002 is unusually tight. It is not stated whether this reflects multiple random seeds, different data folds, or a single run. This detail is required to assess whether the outperformance over ProteinNPT (0.844) and Kermut (0.864) is statistically reliable.

Authors: The reported variance of +/- 0.0002 is the standard deviation obtained across multiple independent training runs that differed only in random seed. We will revise the Results section to state the exact number of runs performed and, space permitting, include a brief statistical comparison confirming that the observed gains remain significant. revision: yes

Circularity Check

0 steps flagged

No significant circularity in TriFit derivation chain

full rationale

The paper presents an empirical multimodal ML model that extracts fixed embeddings from ESM-2, AlphaFold2, and GNM, then trains a supervised MoE fusion module plus contrastive loss on the ProteinGym benchmark. Reported AUROC and ablation results are standard held-out evaluation metrics after training; they do not reduce by construction to the input features or to any self-citation. No load-bearing uniqueness theorems, self-definitional equations, or fitted parameters renamed as predictions appear in the abstract or described pipeline. The central claim (dynamics adds orthogonal signal) is an empirical hypothesis tested via ablation, not a tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central performance claim rests on the assumption that GNM-derived dynamics features are both accurate and complementary to sequence and structure; the model itself contains many learned parameters but no additional ad-hoc constants beyond standard training.

free parameters (1)

MoE router and expert weights
Learned during supervised training on ProteinGym data.

axioms (1)

domain assumption Gaussian Network Model B-factors and cross-correlations capture mutational tolerance signals
Invoked when extracting dynamics embeddings from AlphaFold structures.

pith-pipeline@v0.9.0 · 5579 in / 1179 out tokens · 53205 ms · 2026-05-10T15:29:13.883132+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references

[1]

R., and Erman, B

Bahar, I., Atilgan, A. R., and Erman, B. Direct evalua- tion of thermal fluctuations in proteins using a single- 4 TriFit: Trimodal Fusion with Protein Dynamics for Mutation Fitness Prediction parameter harmonic potential.F olding and Design, 2: 173–181, 1997

1997
[2]

M., and Bahar, I

Bakan, A., Meireles, L. M., and Bahar, I. ProDy: Protein dynamics inferred from theory and experiments.Bioin- formatics, 27:1575–1577, 2011

2011
[3]

Robust deep learning–based protein sequence design using Pro- teinMPNN.Science, 378:49–56, 2022

Dauparas, J., Anishchenko, I., Bennett, N., et al. Robust deep learning–based protein sequence design using Pro- teinMPNN.Science, 378:49–56, 2022

2022
[4]

Fowler, D. M. and Fields, S. Deep mutational scanning: a new style of protein science.Nature Methods, 11:801– 807, 2014

2014
[5]

Gaussian dynamics of folded proteins.Physical Review Letters, 79:3090, 1997

Haliloglu, T., Bahar, I., and Erman, B. Gaussian dynamics of folded proteins.Physical Review Letters, 79:3090, 1997

1997
[6]

Simulating 500 million years of evolution with a language model.Science, 2024

Hayes, T., Rao, R., Akin, H., et al. Simulating 500 million years of evolution with a language model.Science, 2024

2024
[7]

Learning inverse folding from millions of predicted structures

Hsu, C., Verkuil, R., Liu, J., et al. Learning inverse folding from millions of predicted structures. InICML, 2022

2022
[8]

Learning from protein structure with geometric vector perceptrons

Dror, R. Learning from protein structure with geometric vector perceptrons. InICLR, 2021

2021
[9]

Highly accurate protein structure prediction with AlphaFold.Nature, 596: 583–589, 2021

Jumper, J., Evans, R., Pritzel, A., et al. Highly accurate protein structure prediction with AlphaFold.Nature, 596: 583–589, 2021

2021
[10]

Kermut, V . et al. Modelling mutational effects on biochemi- cal phenotypes using gaussian processes: Application to clinical variant interpretation.bioRxiv, 2024

2024
[11]

Gremlin and GEMME: Fast and accurate protein fitness landscape prediction.PLOS Computational Biology, 2019

Laine, E., Karami, Y ., and Carbone, A. Gremlin and GEMME: Fast and accurate protein fitness landscape prediction.PLOS Computational Biology, 2019

2019
[12]

Evolutionary-scale pre- diction of atomic-level protein structure with a language model.Science, 379:1123–1130, 2023

Lin, Z., Akin, H., Rao, R., et al. Evolutionary-scale pre- diction of atomic-level protein structure with a language model.Science, 379:1123–1130, 2023

2023
[13]

and Hutter, F

Loshchilov, I. and Hutter, F. SGDR: Stochastic gradient descent with warm restarts. InICLR, 2017

2017
[14]

and Hutter, F

Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization. InICLR, 2019

2019
[15]

VESPA: Variant effect score prediction without alignments.PLOS Computational Biology, 2022

Marquet, C., Heinzinger, M., Olenyi, T., et al. VESPA: Variant effect score prediction without alignments.PLOS Computational Biology, 2022

2022
[16]

Language models enable zero-shot prediction of the effects of mutations on protein function

Meier, J., Rao, R., Verkuil, R., et al. Language models enable zero-shot prediction of the effects of mutations on protein function. InNeurIPS, 2021

2021
[17]

TranceptEVE: Combining family-specific and family-agnostic models of protein sequences for improved fitness prediction

Notin, P., Van Niekerk, L., Kollasch, A., et al. TranceptEVE: Combining family-specific and family-agnostic models of protein sequences for improved fitness prediction. In NeurIPS Workshop on Learning Meaningful Representa- tions of Life, 2022

2022
[18]

MSA transformer

Rao, R., Liu, J., Verkuil, R., et al. MSA transformer. 2021

2021
[19]

SaProt: Protein language modeling with structure-aware vocabulary

Su, J., Han, C., Zhou, Y ., et al. SaProt: Protein language modeling with structure-aware vocabulary. InICLR, 2024. van den Oord, A., Li, Y ., and Vinyals, O. Representation learning with contrastive predictive coding. InNeurIPS, 2018. 5 TriFit: Trimodal Fusion with Protein Dynamics for Mutation Fitness Prediction A. Implementation Details Embedding extra...

2024