Protein contacts are already in the attention: a single-forward-pass alternative to the Categorical Jacobian

Rome Thorstenson

arxiv: 2606.21876 · v1 · pith:VKU4J5VWnew · submitted 2026-06-20 · 💻 cs.LG · cs.AI

Protein contacts are already in the attention: a single-forward-pass alternative to the Categorical Jacobian

Rome Thorstenson This is my paper

Pith reviewed 2026-06-26 12:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords protein language modelsattention headscontact predictioncategorical jacobiansingle forward passpretraining leakagebidirectional modelsesm models

0 comments

The pith

Protein contacts can be recovered from a small number of attention heads in a single forward pass, outperforming the multi-pass Categorical Jacobian on leakage-clean data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The Categorical Jacobian extracts protein contacts by running many perturbations on a language model. The paper demonstrates that this information is already concentrated in a subset of attention heads. By selecting the top-K relevant heads using only ten labeled proteins and averaging them, contacts are recovered with one forward pass. This method beats the Jacobian on data unlikely to have been seen in pretraining and matches it otherwise. The improvement stems from the selection process rather than the averaging.

Core claim

The signal it reconstructs is already concentrated in a small subset of attention heads: averaging the top-K contact-relevant heads, selected on as few as 10 labeled proteins, recovers contacts in one forward pass and beats CJ on leakage-clean data for every bidirectional model where CJ is defined, and matches or beats it in-distribution (the exceptions being the smallest 8M model and a statistical tie on ESM Cambrian).

What carries the argument

Top-K contact-relevant attention heads, selected on labeled proteins, whose averaged attention maps yield contact predictions in a single forward pass.

If this is right

The head readout beats CJ by 9 pp on ESM-2-650M in the leakage-clean CAMEO split.
The gain localizes to labeled head selection, not averaging, since the unweighted mean ties a supervised logistic regression at matched label budget.
Representation-CJ extends the Jacobian to architectures without a masked-LM head.
Optimal K tracks how diffusely a model spreads its contact heads.
Both methods lose the contact signal on causal LMs such as ProGen2.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Attention mechanisms in bidirectional models may inherently capture pairwise residue interactions during pretraining.
Head selection could be applied to extract other structural signals from the same models without perturbation-based methods.
The performance drop on leakage-clean data suggests that contact recovery numbers in prior work partly reflect pretraining overlap.

Load-bearing premise

The CAMEO split ensures that neither the head selection on 10 proteins nor the evaluation set touches sequences the models have plausibly memorized during pretraining.

What would settle it

If the averaged top-K heads fail to outperform the Categorical Jacobian on a new leakage-clean CAMEO-style split for bidirectional models, the single-forward-pass advantage would be falsified.

Figures

Figures reproduced from arXiv: 2606.21876 by Rome Thorstenson.

**Figure 1.** Figure 1: Top-L/2 long-range precision on Zhang eval-200 across nine protein-LM variants (ESM-2 at five scales, ESM-1b, AMPLIFY-350M, ProtT5-XL, ProGen2-xlarge). Naive-mean fusion of the top-K attention heads (blue) beats the per-protein top-1 head (grey) and the best-available Categorical Jacobian (red) on every cell at and above the 35M scale; the concentrated-cluster ESM-2-8M (best read at K=3, not the K=10 plott… view at source ↗

read the original abstract

The Categorical Jacobian (CJ) of Zhang et al. (2024) reads protein contacts from a language model by perturbing every residue with every alternative amino acid, about 19L forward passes. We show the signal it reconstructs is already concentrated in a small subset of attention heads: averaging the top-K contact-relevant heads, selected on as few as 10 labeled proteins, recovers contacts in one forward pass and beats CJ on leakage-clean data for every bidirectional model where CJ is defined, and matches or beats it in-distribution (the exceptions being the smallest 8M model and a statistical tie on ESM Cambrian). Ablations localize the gain to labeled head selection, not averaging: at a matched label budget the unweighted mean ties a supervised L1 logistic regression on the same heads, so the parameter-free mean is selection's minimal form, not the source of the advantage. Our primary test is leakage-clean: on a CAMEO split where neither selection nor evaluation touches data the models have plausibly memorized, the head readout beats CJ on ESM-2-650M by +9 pp (N=29, p<0.001), with the within-model margin reproducing across architectures on a wider pretraining-aware set. Both methods fall 30-36 percentage points from their in-distribution Zhang numbers to the leakage-clean numbers, consistent with substantial pretraining overlap inflating prior numbers (a CAMEO-vs-Zhang difficulty shift contributes too, so we read it as an upper bound on the leakage component). We additionally introduce representation-CJ, a hidden-state generalization of the Jacobian for architectures without a masked-LM head; show that the optimal K tracks how diffusely a model spreads its contact heads; and find that both methods lose the contact signal on both causal LMs we test (ProGen2), suggesting attention-encoded pair structure may depend on bidirectional pretraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Head averaging after selection on 10 proteins gives a single-pass contact readout that beats CJ on the CAMEO split, but the no-leakage claim rests on an unverified split.

read the letter

The main point is that contact signal lives in a small number of attention heads, so ranking them on a tiny labeled set and averaging produces a one-forward-pass predictor that beats the Categorical Jacobian on leakage-clean data for the bidirectional models tested.

The paper does a clean job showing the practical alternative. The ablation is the strongest part: at the same label budget the plain mean matches a supervised logistic regression on the selected heads, which pins the gain on the selection step rather than on any special aggregation. They also add a hidden-state version of the Jacobian for models without a masked head and report that both methods lose the signal on the two causal LMs. The drop from in-distribution numbers to the CAMEO numbers is reported consistently across architectures, which usefully flags how much earlier results may have been helped by overlap.

The soft spot is the leakage guarantee. The abstract states that the CAMEO split keeps both the 10-protein selection set and the N=29 evaluation set away from pretraining data, but it gives no sequence-identity thresholds, homology-search results, or date cutoffs against the actual pretraining corpora. Without those checks the +9 pp margin on ESM-2-650M remains plausible but not fully pinned down; some of it could still trace to undetected memorization. The model-specific choice of K is minor and expected for this kind of work.

This is for people who extract contacts from protein LMs or who want cheaper inference in structural workflows. Readers doing attention probing or trying to cut the 19L cost of Jacobian methods will get direct use from the head-localization results and the ablations. The empirical comparisons are concrete enough to merit a serious referee, even if the leakage details need tightening.

Referee Report

2 major / 2 minor

Summary. The paper claims that protein contact signals are concentrated in a small subset of attention heads within bidirectional protein language models. Selecting the top-K contact-relevant heads on as few as 10 labeled proteins and averaging their maps yields a single-forward-pass contact predictor that outperforms the Categorical Jacobian (CJ) of Zhang et al. (2024) on a leakage-clean CAMEO split (e.g., +9 pp on ESM-2-650M, N=29, p<0.001) while matching or exceeding CJ in-distribution for most models. Ablations attribute the gain to labeled head selection rather than the averaging operation itself; the method also generalizes via a new representation-CJ for models lacking a masked-LM head and fails on the causal LMs tested.

Significance. If the leakage-clean results hold, the work supplies a computationally cheap, attention-based alternative to perturbation-based contact extraction and demonstrates that bidirectional pretraining already encodes pair structure in a sparse set of heads. The empirical ablations, cross-architecture reproduction, and introduction of representation-CJ are concrete strengths. The CAMEO evaluation, if rigorously verified, would strengthen claims that the observed margins reflect intrinsic attention signals rather than pretraining memorization.

major comments (2)

[Methods / Results (CAMEO split)] Methods / Results (CAMEO split description): The central +9 pp margin on leakage-clean data (abstract and main results) is interpreted as evidence that the head-averaging signal generalizes beyond memorized sequences. However, the manuscript provides no explicit sequence-identity thresholds, BLAST or MMseqs2 homology search results, or date cutoffs against the models' pretraining corpora (UniRef50/90) for either the 10-protein head-selection set or the N=29 evaluation proteins. This verification is load-bearing for the leakage-clean claim.
[Results (head-selection procedure)] Results (head-selection procedure): The optimal K is reported to track how diffusely contact heads are distributed, yet the selection itself uses external labeled contacts. The manuscript should clarify whether the reported margins remain stable under a fully unsupervised head-ranking criterion (e.g., attention entropy or mutual information with sequence conservation) or whether the labeled budget is indispensable.

minor comments (2)

[Abstract] Abstract: the parenthetical remark on exceptions ('the smallest 8M model and a statistical tie on ESM Cambrian') should name the exact 8M model for clarity.
[Figures / Tables] Figure captions and tables: ensure all reported p-values indicate whether they are corrected for the number of models or splits tested.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the strength of our leakage-clean claims and the role of supervision. We address both major points below and will revise the manuscript accordingly where changes are needed.

read point-by-point responses

Referee: The central +9 pp margin on leakage-clean data (abstract and main results) is interpreted as evidence that the head-averaging signal generalizes beyond memorized sequences. However, the manuscript provides no explicit sequence-identity thresholds, BLAST or MMseqs2 homology search results, or date cutoffs against the models' pretraining corpora (UniRef50/90) for either the 10-protein head-selection set or the N=29 evaluation proteins. This verification is load-bearing for the leakage-clean claim.

Authors: We agree this verification is important for rigorously supporting the leakage-clean interpretation. The CAMEO split was selected because CAMEO targets are recent structures with low similarity to older training data, but the manuscript does not report explicit MMseqs2 or BLAST searches against UniRef50/90 with sequence-identity thresholds or date cutoffs. We will add this analysis in the revised Methods and Results sections, including the specific thresholds used (e.g., <30% identity over aligned regions) and confirmation that neither the 10-protein selection set nor the N=29 evaluation proteins overlap with pretraining data under those criteria. This will be presented as a new supplementary table. revision: yes
Referee: The optimal K is reported to track how diffusely contact heads are distributed, yet the selection itself uses external labeled contacts. The manuscript should clarify whether the reported margins remain stable under a fully unsupervised head-ranking criterion (e.g., attention entropy or mutual information with sequence conservation) or whether the labeled budget is indispensable.

Authors: Our ablations already show that the performance gain is attributable to the labeled selection step rather than averaging per se. We did not evaluate fully unsupervised ranking criteria such as attention entropy or conservation-based mutual information, as the core contribution is demonstrating that a small labeled budget (10 proteins) suffices to extract a strong single-pass predictor that outperforms the multi-pass CJ. We will revise the Discussion to explicitly state that the labeled budget appears indispensable based on the existing ablations and that testing unsupervised alternatives is an interesting direction for future work but lies outside the scope of the current study. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical selection on external labels

full rationale

The paper presents an empirical procedure: rank attention heads by contact recovery on a small external set of 10 labeled proteins, then average the top-K heads in a single forward pass. This selection step uses independent labeled data and produces a parameter-free mean readout; neither the ranking nor the averaging reduces by the paper's own equations to a fitted quantity or to a self-citation chain. No mathematical derivation is claimed that would be self-definitional, and the comparison to Categorical Jacobian is a direct empirical benchmark rather than a prediction forced by construction. The CAMEO leakage-clean split is an external data-partition claim, not an internal definitional loop.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the empirical observation that contact signal localizes to attention heads and can be recovered via small-scale supervised selection; no new physical entities are introduced, but the method depends on labeled contact data and the assumption of a clean test split.

free parameters (2)

K (number of heads)
Chosen per model by performance on the 10-protein selection set; value is not fixed a priori.
Head selection labels
Uses 10 labeled proteins to rank heads; constitutes the supervised component of the otherwise parameter-free mean.

axioms (2)

domain assumption Contact information is concentrated in a small subset of attention heads in bidirectional protein LMs
Invoked as the basis for why selection plus averaging works.
domain assumption The CAMEO split provides a leakage-free evaluation relative to pretraining data
Used to interpret the +9 pp margin as evidence against memorization.

pith-pipeline@v0.9.1-grok · 5875 in / 1527 out tokens · 38692 ms · 2026-06-26T12:36:11.225340+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 5 canonical work pages

[1]

LiveProteinBench: A contamination-free benchmark for assessing models’ specialized capabilities in protein science.arXiv preprint arXiv:2512.22257,

Anonymous. LiveProteinBench: A contamination-free benchmark for assessing models’ specialized capabilities in protein science.arXiv preprint arXiv:2512.22257,

arXiv
[2]

Tristan Bepler and Bonnie Berger

doi: 10.1038/s42256-025-01176-7. Tristan Bepler and Bonnie Berger. Learning protein sequence embeddings using information from structure. InInternational Conference on Learning Representations (ICLR),

work page doi:10.1038/s42256-025-01176-7
[3]

Koo, David Baker, Yun S

Nicholas Bhattacharya, Neil Thomas, Roshan Rao, Justas Dauparas, Peter K. Koo, David Baker, Yun S. Song, and Sergey Ovchinnikov. Single layers of attention suffice to predict protein contacts. InICLR 2021 Workshop on Energy Based Models,

2021
[4]

Preprint: bioRxiv 2020.12.21.423882. Biohub. ESM: A world model of protein biology. https://biohub.ai/esm/protein,

2020
[5]

11 Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D

Open-source release bundling ESM Cambrian, ESMFold2, and the ESM Atlas. 11 Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. What does BERT look at? an analysis of BERT’s attention. InProceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP,

2019
[6]

Learning meaningful representations of protein sequences.Nature Communications, 13:1914,

Nicki Skafte Detlefsen, Søren Hauberg, and Wouter Boomsma. Learning meaningful representations of protein sequences.Nature Communications, 13:1914,

1914
[7]

Stanley D

doi: 10.1038/s41467-022-29443-w. Stanley D. Dunn, Lindi M. Wahl, and Gregory B. Gloor. Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction.Bioinformatics, 24(3): 333–340,

work page doi:10.1038/s41467-022-29443-w
[8]

doi: 10.1109/TPAMI.2021. 3095381. ESM Team. ESM Cambrian: Revealing the mysteries of proteins with unsupervised learning. https://evolutionaryscale.ai/blog/esm-cambrian,

work page doi:10.1109/tpami.2021 2021
[9]

AMPLIFY model release

doi: 10.1101/2024.09.23.614603. AMPLIFY model release. Thomas Hayes, Roshan Rao, Halil Akin, Nicholas J. Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q. Tran, Jonathan Deaton, Marius Wiggert, et al. Simulating 500 million years of evolution with a language model.Science, 387:850–858,

work page doi:10.1101/2024.09.23.614603 2024
[10]

Transformer protein language models are unsupervised structure learners

12 Roshan Rao, Joshua Meier, Tom Sercu, Sergey Ovchinnikov, and Alexander Rives. Transformer protein language models are unsupervised structure learners. InInternational Conference on Learning Representations (ICLR), 2021a. Preprint: bioRxiv 2020.12.15.422761. Roshan M. Rao, Jason Liu, Robert Verkuil, Joshua Meier, John F. Canny, Pieter Abbeel, Tom Sercu,...

2020
[11]

2025 , pages =

doi: 10.1038/s41592-025-02836-7. Martin Steinegger and Johannes Söding. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.Nature Biotechnology, 35:1026–1028,

work page doi:10.1038/s41592-025-02836-7
[12]

p <0.01 on every cell

A UniRef50 release choice for the Hermann filter The Hermann [Hermann et al., 2024] filter requires that each test sequence have no high-identity hit against a database representative of the pretraining corpus. ESM-2 was pretrained on the UniRef50 2021_04 release [Lin et al., 2023]; we filter against the current (2025) UniRef50 release rather than the 202...

2024
[13]

Top Heads

and a “Top Heads” variant that uses the regression only to select the top k heads, then averages them unweighted; the latter outperforms the former. Neither variant evaluates an unsupervised all-heads average, and the labeled head-selection in the regression is fit to per-pair contact labels across a much larger PDB-chain training set than our 50-protein ...

2000
[14]

win rate

Figure A1: K-sweep across architectures.Top- L/2 long-range precision as a function of K on Zhang eval-200. Four of nine variants peak at K∈ {2,3,7} ; a one-size-fits-all default (K=10) is not optimal across the full architecture set. Table A7 quantifies the head-cluster diffuseness used in Section 4.3. For each architecture we report: the global top-1 he...

2021
[15]

repr-CJ scores on ESM-2- 650M Zhang eval-200 ( N=200, Pearson r=0.96)

Figure A3:Representation-CJ validation.Per-protein logit-CJ vs. repr-CJ scores on ESM-2- 650M Zhang eval-200 ( N=200, Pearson r=0.96). Points cluster near y=x; repr-CJ is a faithful generalization of logit-CJ on architectures where both apply. Table A8: AMPLIFY-350M repr-CJP@L/2long on Zhang eval-200 as a function of layer index ℓ. N= 200proteins per row....

2024

[1] [1]

LiveProteinBench: A contamination-free benchmark for assessing models’ specialized capabilities in protein science.arXiv preprint arXiv:2512.22257,

Anonymous. LiveProteinBench: A contamination-free benchmark for assessing models’ specialized capabilities in protein science.arXiv preprint arXiv:2512.22257,

arXiv

[2] [2]

Tristan Bepler and Bonnie Berger

doi: 10.1038/s42256-025-01176-7. Tristan Bepler and Bonnie Berger. Learning protein sequence embeddings using information from structure. InInternational Conference on Learning Representations (ICLR),

work page doi:10.1038/s42256-025-01176-7

[3] [3]

Koo, David Baker, Yun S

Nicholas Bhattacharya, Neil Thomas, Roshan Rao, Justas Dauparas, Peter K. Koo, David Baker, Yun S. Song, and Sergey Ovchinnikov. Single layers of attention suffice to predict protein contacts. InICLR 2021 Workshop on Energy Based Models,

2021

[4] [4]

Preprint: bioRxiv 2020.12.21.423882. Biohub. ESM: A world model of protein biology. https://biohub.ai/esm/protein,

2020

[5] [5]

11 Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D

Open-source release bundling ESM Cambrian, ESMFold2, and the ESM Atlas. 11 Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. What does BERT look at? an analysis of BERT’s attention. InProceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP,

2019

[6] [6]

Learning meaningful representations of protein sequences.Nature Communications, 13:1914,

Nicki Skafte Detlefsen, Søren Hauberg, and Wouter Boomsma. Learning meaningful representations of protein sequences.Nature Communications, 13:1914,

1914

[7] [7]

Stanley D

doi: 10.1038/s41467-022-29443-w. Stanley D. Dunn, Lindi M. Wahl, and Gregory B. Gloor. Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction.Bioinformatics, 24(3): 333–340,

work page doi:10.1038/s41467-022-29443-w

[8] [8]

doi: 10.1109/TPAMI.2021. 3095381. ESM Team. ESM Cambrian: Revealing the mysteries of proteins with unsupervised learning. https://evolutionaryscale.ai/blog/esm-cambrian,

work page doi:10.1109/tpami.2021 2021

[9] [9]

AMPLIFY model release

doi: 10.1101/2024.09.23.614603. AMPLIFY model release. Thomas Hayes, Roshan Rao, Halil Akin, Nicholas J. Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q. Tran, Jonathan Deaton, Marius Wiggert, et al. Simulating 500 million years of evolution with a language model.Science, 387:850–858,

work page doi:10.1101/2024.09.23.614603 2024

[10] [10]

Transformer protein language models are unsupervised structure learners

12 Roshan Rao, Joshua Meier, Tom Sercu, Sergey Ovchinnikov, and Alexander Rives. Transformer protein language models are unsupervised structure learners. InInternational Conference on Learning Representations (ICLR), 2021a. Preprint: bioRxiv 2020.12.15.422761. Roshan M. Rao, Jason Liu, Robert Verkuil, Joshua Meier, John F. Canny, Pieter Abbeel, Tom Sercu,...

2020

[11] [11]

2025 , pages =

doi: 10.1038/s41592-025-02836-7. Martin Steinegger and Johannes Söding. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.Nature Biotechnology, 35:1026–1028,

work page doi:10.1038/s41592-025-02836-7

[12] [12]

p <0.01 on every cell

A UniRef50 release choice for the Hermann filter The Hermann [Hermann et al., 2024] filter requires that each test sequence have no high-identity hit against a database representative of the pretraining corpus. ESM-2 was pretrained on the UniRef50 2021_04 release [Lin et al., 2023]; we filter against the current (2025) UniRef50 release rather than the 202...

2024

[13] [13]

Top Heads

and a “Top Heads” variant that uses the regression only to select the top k heads, then averages them unweighted; the latter outperforms the former. Neither variant evaluates an unsupervised all-heads average, and the labeled head-selection in the regression is fit to per-pair contact labels across a much larger PDB-chain training set than our 50-protein ...

2000

[14] [14]

win rate

Figure A1: K-sweep across architectures.Top- L/2 long-range precision as a function of K on Zhang eval-200. Four of nine variants peak at K∈ {2,3,7} ; a one-size-fits-all default (K=10) is not optimal across the full architecture set. Table A7 quantifies the head-cluster diffuseness used in Section 4.3. For each architecture we report: the global top-1 he...

2021

[15] [15]

repr-CJ scores on ESM-2- 650M Zhang eval-200 ( N=200, Pearson r=0.96)

Figure A3:Representation-CJ validation.Per-protein logit-CJ vs. repr-CJ scores on ESM-2- 650M Zhang eval-200 ( N=200, Pearson r=0.96). Points cluster near y=x; repr-CJ is a faithful generalization of logit-CJ on architectures where both apply. Table A8: AMPLIFY-350M repr-CJP@L/2long on Zhang eval-200 as a function of layer index ℓ. N= 200proteins per row....

2024