When Does Structure Help? The Information Bonus of AlphaFold2 Representations over Protein Language Models

Kargi Chauhan

read the original abstract

AI scientist systems increasingly choose biological foundation models before they choose experiments. In protein pipelines, this creates a concrete engineering and scientific question: when is the cost of structural inference worth paying over a cheaper sequence-only model? We introduce the information bonus (IB), a task-level metric that measures the linearly accessible advantage of frozen single-sequence AlphaFold2 Evoformer representations over frozen ESM-2 embeddings under protein-level cross-validation. Across binding affinity regression (PDBbind, n=5,680), conformational flexibility (ATLAS molecular dynamics, 268 proteins), and allosteric-site classification (AlloSigDB, n=9,925 residues), IB is sharply mechanism-dependent. ESM-2 dominates binding affinity (IB=-0.141; Pearson r=0.449 vs. 0.307) and binary flexibility (IB=-0.060; AUROC 0.824 vs. 0.764; p=0.0017). AF2 single representations give the only above-chance allostery predictions (IB=+0.064; AUROC 0.548 vs. 0.485), revealing long-range geometric signal not recovered from sequence alone. We also identify a residue-level leakage artifact: naive residue splits inflate RMSF performance by 27-39% depending on the representation, enough to reverse representation rankings. These results turn representation selection into a measurable decision for AI-for-science systems.

When Does Structure Help? The Information Bonus of AlphaFold2 Representations over Protein Language Models

discussion (0)