Protein contacts are already in the attention: a single-forward-pass alternative to the Categorical Jacobian
Pith reviewed 2026-06-26 12:36 UTC · model grok-4.3
The pith
Protein contacts can be recovered from a small number of attention heads in a single forward pass, outperforming the multi-pass Categorical Jacobian on leakage-clean data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The signal it reconstructs is already concentrated in a small subset of attention heads: averaging the top-K contact-relevant heads, selected on as few as 10 labeled proteins, recovers contacts in one forward pass and beats CJ on leakage-clean data for every bidirectional model where CJ is defined, and matches or beats it in-distribution (the exceptions being the smallest 8M model and a statistical tie on ESM Cambrian).
What carries the argument
Top-K contact-relevant attention heads, selected on labeled proteins, whose averaged attention maps yield contact predictions in a single forward pass.
If this is right
- The head readout beats CJ by 9 pp on ESM-2-650M in the leakage-clean CAMEO split.
- The gain localizes to labeled head selection, not averaging, since the unweighted mean ties a supervised logistic regression at matched label budget.
- Representation-CJ extends the Jacobian to architectures without a masked-LM head.
- Optimal K tracks how diffusely a model spreads its contact heads.
- Both methods lose the contact signal on causal LMs such as ProGen2.
Where Pith is reading between the lines
- Attention mechanisms in bidirectional models may inherently capture pairwise residue interactions during pretraining.
- Head selection could be applied to extract other structural signals from the same models without perturbation-based methods.
- The performance drop on leakage-clean data suggests that contact recovery numbers in prior work partly reflect pretraining overlap.
Load-bearing premise
The CAMEO split ensures that neither the head selection on 10 proteins nor the evaluation set touches sequences the models have plausibly memorized during pretraining.
What would settle it
If the averaged top-K heads fail to outperform the Categorical Jacobian on a new leakage-clean CAMEO-style split for bidirectional models, the single-forward-pass advantage would be falsified.
Figures
read the original abstract
The Categorical Jacobian (CJ) of Zhang et al. (2024) reads protein contacts from a language model by perturbing every residue with every alternative amino acid, about 19L forward passes. We show the signal it reconstructs is already concentrated in a small subset of attention heads: averaging the top-K contact-relevant heads, selected on as few as 10 labeled proteins, recovers contacts in one forward pass and beats CJ on leakage-clean data for every bidirectional model where CJ is defined, and matches or beats it in-distribution (the exceptions being the smallest 8M model and a statistical tie on ESM Cambrian). Ablations localize the gain to labeled head selection, not averaging: at a matched label budget the unweighted mean ties a supervised L1 logistic regression on the same heads, so the parameter-free mean is selection's minimal form, not the source of the advantage. Our primary test is leakage-clean: on a CAMEO split where neither selection nor evaluation touches data the models have plausibly memorized, the head readout beats CJ on ESM-2-650M by +9 pp (N=29, p<0.001), with the within-model margin reproducing across architectures on a wider pretraining-aware set. Both methods fall 30-36 percentage points from their in-distribution Zhang numbers to the leakage-clean numbers, consistent with substantial pretraining overlap inflating prior numbers (a CAMEO-vs-Zhang difficulty shift contributes too, so we read it as an upper bound on the leakage component). We additionally introduce representation-CJ, a hidden-state generalization of the Jacobian for architectures without a masked-LM head; show that the optimal K tracks how diffusely a model spreads its contact heads; and find that both methods lose the contact signal on both causal LMs we test (ProGen2), suggesting attention-encoded pair structure may depend on bidirectional pretraining.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that protein contact signals are concentrated in a small subset of attention heads within bidirectional protein language models. Selecting the top-K contact-relevant heads on as few as 10 labeled proteins and averaging their maps yields a single-forward-pass contact predictor that outperforms the Categorical Jacobian (CJ) of Zhang et al. (2024) on a leakage-clean CAMEO split (e.g., +9 pp on ESM-2-650M, N=29, p<0.001) while matching or exceeding CJ in-distribution for most models. Ablations attribute the gain to labeled head selection rather than the averaging operation itself; the method also generalizes via a new representation-CJ for models lacking a masked-LM head and fails on the causal LMs tested.
Significance. If the leakage-clean results hold, the work supplies a computationally cheap, attention-based alternative to perturbation-based contact extraction and demonstrates that bidirectional pretraining already encodes pair structure in a sparse set of heads. The empirical ablations, cross-architecture reproduction, and introduction of representation-CJ are concrete strengths. The CAMEO evaluation, if rigorously verified, would strengthen claims that the observed margins reflect intrinsic attention signals rather than pretraining memorization.
major comments (2)
- [Methods / Results (CAMEO split)] Methods / Results (CAMEO split description): The central +9 pp margin on leakage-clean data (abstract and main results) is interpreted as evidence that the head-averaging signal generalizes beyond memorized sequences. However, the manuscript provides no explicit sequence-identity thresholds, BLAST or MMseqs2 homology search results, or date cutoffs against the models' pretraining corpora (UniRef50/90) for either the 10-protein head-selection set or the N=29 evaluation proteins. This verification is load-bearing for the leakage-clean claim.
- [Results (head-selection procedure)] Results (head-selection procedure): The optimal K is reported to track how diffusely contact heads are distributed, yet the selection itself uses external labeled contacts. The manuscript should clarify whether the reported margins remain stable under a fully unsupervised head-ranking criterion (e.g., attention entropy or mutual information with sequence conservation) or whether the labeled budget is indispensable.
minor comments (2)
- [Abstract] Abstract: the parenthetical remark on exceptions ('the smallest 8M model and a statistical tie on ESM Cambrian') should name the exact 8M model for clarity.
- [Figures / Tables] Figure captions and tables: ensure all reported p-values indicate whether they are corrected for the number of models or splits tested.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the strength of our leakage-clean claims and the role of supervision. We address both major points below and will revise the manuscript accordingly where changes are needed.
read point-by-point responses
-
Referee: The central +9 pp margin on leakage-clean data (abstract and main results) is interpreted as evidence that the head-averaging signal generalizes beyond memorized sequences. However, the manuscript provides no explicit sequence-identity thresholds, BLAST or MMseqs2 homology search results, or date cutoffs against the models' pretraining corpora (UniRef50/90) for either the 10-protein head-selection set or the N=29 evaluation proteins. This verification is load-bearing for the leakage-clean claim.
Authors: We agree this verification is important for rigorously supporting the leakage-clean interpretation. The CAMEO split was selected because CAMEO targets are recent structures with low similarity to older training data, but the manuscript does not report explicit MMseqs2 or BLAST searches against UniRef50/90 with sequence-identity thresholds or date cutoffs. We will add this analysis in the revised Methods and Results sections, including the specific thresholds used (e.g., <30% identity over aligned regions) and confirmation that neither the 10-protein selection set nor the N=29 evaluation proteins overlap with pretraining data under those criteria. This will be presented as a new supplementary table. revision: yes
-
Referee: The optimal K is reported to track how diffusely contact heads are distributed, yet the selection itself uses external labeled contacts. The manuscript should clarify whether the reported margins remain stable under a fully unsupervised head-ranking criterion (e.g., attention entropy or mutual information with sequence conservation) or whether the labeled budget is indispensable.
Authors: Our ablations already show that the performance gain is attributable to the labeled selection step rather than averaging per se. We did not evaluate fully unsupervised ranking criteria such as attention entropy or conservation-based mutual information, as the core contribution is demonstrating that a small labeled budget (10 proteins) suffices to extract a strong single-pass predictor that outperforms the multi-pass CJ. We will revise the Discussion to explicitly state that the labeled budget appears indispensable based on the existing ablations and that testing unsupervised alternatives is an interesting direction for future work but lies outside the scope of the current study. revision: partial
Circularity Check
No significant circularity; empirical selection on external labels
full rationale
The paper presents an empirical procedure: rank attention heads by contact recovery on a small external set of 10 labeled proteins, then average the top-K heads in a single forward pass. This selection step uses independent labeled data and produces a parameter-free mean readout; neither the ranking nor the averaging reduces by the paper's own equations to a fitted quantity or to a self-citation chain. No mathematical derivation is claimed that would be self-definitional, and the comparison to Categorical Jacobian is a direct empirical benchmark rather than a prediction forced by construction. The CAMEO leakage-clean split is an external data-partition claim, not an internal definitional loop.
Axiom & Free-Parameter Ledger
free parameters (2)
- K (number of heads)
- Head selection labels
axioms (2)
- domain assumption Contact information is concentrated in a small subset of attention heads in bidirectional protein LMs
- domain assumption The CAMEO split provides a leakage-free evaluation relative to pretraining data
Reference graph
Works this paper leans on
-
[1]
Anonymous. LiveProteinBench: A contamination-free benchmark for assessing models’ specialized capabilities in protein science.arXiv preprint arXiv:2512.22257,
-
[2]
Tristan Bepler and Bonnie Berger
doi: 10.1038/s42256-025-01176-7. Tristan Bepler and Bonnie Berger. Learning protein sequence embeddings using information from structure. InInternational Conference on Learning Representations (ICLR),
-
[3]
Koo, David Baker, Yun S
Nicholas Bhattacharya, Neil Thomas, Roshan Rao, Justas Dauparas, Peter K. Koo, David Baker, Yun S. Song, and Sergey Ovchinnikov. Single layers of attention suffice to predict protein contacts. InICLR 2021 Workshop on Energy Based Models,
2021
-
[4]
Preprint: bioRxiv 2020.12.21.423882. Biohub. ESM: A world model of protein biology. https://biohub.ai/esm/protein,
2020
-
[5]
11 Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D
Open-source release bundling ESM Cambrian, ESMFold2, and the ESM Atlas. 11 Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. What does BERT look at? an analysis of BERT’s attention. InProceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP,
2019
-
[6]
Learning meaningful representations of protein sequences.Nature Communications, 13:1914,
Nicki Skafte Detlefsen, Søren Hauberg, and Wouter Boomsma. Learning meaningful representations of protein sequences.Nature Communications, 13:1914,
1914
-
[7]
doi: 10.1038/s41467-022-29443-w. Stanley D. Dunn, Lindi M. Wahl, and Gregory B. Gloor. Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction.Bioinformatics, 24(3): 333–340,
-
[8]
doi: 10.1109/TPAMI.2021. 3095381. ESM Team. ESM Cambrian: Revealing the mysteries of proteins with unsupervised learning. https://evolutionaryscale.ai/blog/esm-cambrian,
-
[9]
doi: 10.1101/2024.09.23.614603. AMPLIFY model release. Thomas Hayes, Roshan Rao, Halil Akin, Nicholas J. Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q. Tran, Jonathan Deaton, Marius Wiggert, et al. Simulating 500 million years of evolution with a language model.Science, 387:850–858,
-
[10]
Transformer protein language models are unsupervised structure learners
12 Roshan Rao, Joshua Meier, Tom Sercu, Sergey Ovchinnikov, and Alexander Rives. Transformer protein language models are unsupervised structure learners. InInternational Conference on Learning Representations (ICLR), 2021a. Preprint: bioRxiv 2020.12.15.422761. Roshan M. Rao, Jason Liu, Robert Verkuil, Joshua Meier, John F. Canny, Pieter Abbeel, Tom Sercu,...
2020
-
[11]
doi: 10.1038/s41592-025-02836-7. Martin Steinegger and Johannes Söding. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.Nature Biotechnology, 35:1026–1028,
-
[12]
p <0.01 on every cell
A UniRef50 release choice for the Hermann filter The Hermann [Hermann et al., 2024] filter requires that each test sequence have no high-identity hit against a database representative of the pretraining corpus. ESM-2 was pretrained on the UniRef50 2021_04 release [Lin et al., 2023]; we filter against the current (2025) UniRef50 release rather than the 202...
2024
-
[13]
Top Heads
and a “Top Heads” variant that uses the regression only to select the top k heads, then averages them unweighted; the latter outperforms the former. Neither variant evaluates an unsupervised all-heads average, and the labeled head-selection in the regression is fit to per-pair contact labels across a much larger PDB-chain training set than our 50-protein ...
2000
-
[14]
win rate
Figure A1: K-sweep across architectures.Top- L/2 long-range precision as a function of K on Zhang eval-200. Four of nine variants peak at K∈ {2,3,7} ; a one-size-fits-all default (K=10) is not optimal across the full architecture set. Table A7 quantifies the head-cluster diffuseness used in Section 4.3. For each architecture we report: the global top-1 he...
2021
-
[15]
repr-CJ scores on ESM-2- 650M Zhang eval-200 ( N=200, Pearson r=0.96)
Figure A3:Representation-CJ validation.Per-protein logit-CJ vs. repr-CJ scores on ESM-2- 650M Zhang eval-200 ( N=200, Pearson r=0.96). Points cluster near y=x; repr-CJ is a faithful generalization of logit-CJ on architectures where both apply. Table A8: AMPLIFY-350M repr-CJP@L/2long on Zhang eval-200 as a function of layer index ℓ. N= 200proteins per row....
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.