Atom-level Protein Representation Learning Improves Protein Structure Prediction

Hyeongwoo Kim; Hyosoon Jang; Hyunjin Seo; Mingyeong Shin; Seonghwan Seo; Sungsoo Ahn; Taewon Kim; Wonho Zhung; Wooyoun Kim

arxiv: 2605.22133 · v3 · pith:4LFMY7YZnew · submitted 2026-05-21 · 🧬 q-bio.BM · cs.AI

Atom-level Protein Representation Learning Improves Protein Structure Prediction

Taewon Kim , Hyosoon Jang , Hyunjin Seo , Seonghwan Seo , Hyeongwoo Kim , Wonho Zhung , Mingyeong Shin , Wooyoun Kim

show 1 more author

Sungsoo Ahn

This is my paper

Pith reviewed 2026-05-25 02:52 UTC · model grok-4.3

classification 🧬 q-bio.BM cs.AI

keywords protein representation learningstructure predictionpretrainingmulti-viewVQ-VAEhomodimerinteraction predictionRepSP benchmark

0 comments

The pith

Pretraining to recover tokens from corrupted amino-acid, backbone, and full-atom views produces representations that improve structure-prediction tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TriProRep, which jointly encodes three aligned residue-level views of a protein—amino-acid identity, backbone geometry, and local full-atom geometry—using VQ-VAE tokenizers. It pretrains the model to reconstruct the original tokens after generator-based corruption of these views, forcing it to learn distinctions between plausible but incorrect combinations. The authors create the RepSP benchmark to test how well such representations support structure-oriented uses: homodimer co-folding from separate chain representations, residue-level interaction property prediction, and representation-aligned monomer folding. Across these tasks TriProRep outperforms both sequence-only baselines and earlier structure-aware representation methods while remaining competitive on standard benchmarks.

Core claim

TriProRep jointly models three aligned residue-level views—amino-acid identity, backbone geometry, and local full-atom geometry—discretely encoded via VQ-VAE tokenizers. By pretraining to recover original tokens from generator-corrupted views, it learns to distinguish plausible but incorrect cross-view augmentations from the original protein. When evaluated on RepSP, which includes homodimer co-folding from apo-chain representations, residue-level prediction of interaction properties, and representation-aligned monomer structure prediction, TriProRep improves over sequence-only and prior structure-aware models.

What carries the argument

Joint three-view token recovery pretraining on VQ-VAE-encoded amino-acid identity, backbone geometry, and local full-atom geometry.

If this is right

Representations support direct homodimer co-folding from individual apo-chain inputs.
They improve residue-level prediction of homodimer interaction properties.
They enhance monomer structure prediction when the model is aligned to the learned representations.
Performance on conventional protein benchmarks remains competitive with prior methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same cross-view corruption objective could be tested on other biomolecules that have both sequence and coordinate data.
RepSP-style benchmarks may expose limitations in existing representations when the downstream goal is interaction rather than single-chain folding.
If the three-view alignment proves robust, similar tokenization could reduce the need for task-specific fine-tuning in related prediction problems.

Load-bearing premise

Recovering original tokens from generator-corrupted cross-view augmentations produces representations that transfer to downstream structure-prediction tasks without additional fine-tuning or task-specific adaptation.

What would settle it

Apply the frozen TriProRep representations to the three RepSP tasks and measure whether co-folding accuracy, interaction prediction AUC, or monomer structure quality show no improvement over sequence-only baselines.

Figures

Figures reproduced from arXiv: 2605.22133 by Hyeongwoo Kim, Hyosoon Jang, Hyunjin Seo, Mingyeong Shin, Seonghwan Seo, Sungsoo Ahn, Taewon Kim, Wonho Zhung, Wooyoun Kim.

**Figure 1.** Figure 1: TRIPROREP. (a) Three-view tokenization. A protein is independently tokenized into amino-acid, backbone, and full-atom token sequences at the residue level. (b) ELECTRA-style discriminative pretraining. A small generator corrupts each of the three sequences, and a large discriminator predicts the original token at every position. The richer space of cross-token corruptions provides a stronger training signa… view at source ↗

**Figure 2.** Figure 2: REPSP. We define three structure-generative tasks that use protein representations as input: (task 1) homodimer structure prediction, (task 2) per-residue homodimer binding-property prediction via MLP probing, and (task 3) distillation into a monomer structure prediction model. identity. From the resulting cluster representatives, we select 400 validation and 1,000 test sequences, and use the remaining rep… view at source ↗

**Figure 3.** Figure 3: Scaling of flexible-docking. Predicted homodimer structures (chain A blue, chain B gold) overlaid on ground truth (gray) across encoder sizes (150M, 650M, 3B) for four test records. while the huge TRIPROREP model achieves the strongest performance on nearly all metrics, with ESM3 only marginally higher in LDDT. The gains are most pronounced on interface-level metrics, which depend not only on accurate mono… view at source ↗

**Figure 4.** Figure 4: Acceleration of monomer structure prediction via representation alignment on REPSP. We compare the no-REPA baseline against ESM2, SaProt, S-PLM, MIF-ST, and TRIPROREP as the alignment target. TRIPROREP provides strongest alignment target for structure prediction model. 5.2 Per-residue homodimer binding property prediction [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Tokens vs. sidechain rotamer. Density of codes in the χ1 simplex. (a) Backbone tokens. (b) Full-atom tokens. Hyperparameters. The tokenizer uses single width 256, pair width 128, and N = 6 Pairformer-style layers with 8 attention heads. The output embedding dimension is 256. The codebook contains V = 512 entries, uses EMA updates with decay 0.99, and uses entropy regularization with weight 0.1. Backbone a… view at source ↗

read the original abstract

Recent advances in generative modeling show that pretrained representations can improve generation as conditioning features or alignment targets. Motivated by this, we study protein representations for predicting structures beyond conventional function annotation. We propose TriProRep, a structure-aware pretraining method that jointly models three aligned residue-level views: amino-acid identity, backbone geometry, and local full-atom geometry, discretely encoded via VQ-VAE tokenizers. By pretraining to recover original tokens from generator-corrupted views, TriProRep learns to distinguish plausible but incorrect cross-view augmentations from the original protein. We further introduce RepSP, a benchmark for evaluating protein representations in structure-predictive settings. RepSP tests three uses of representations: homodimer co-folding from apo-chain representations, residue-level prediction of homodimer-derived interaction properties, and representation-aligned monomer structure prediction. Across these tasks, TriProRep improves over sequence-only and prior structure-aware representation models, while maintaining competitive performance on conventional benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TriProRep's three-view VQ-VAE pretraining and RepSP benchmark are new, but the abstract supplies no numbers so the claimed gains on structure tasks remain unverified.

read the letter

TriProRep's three-view VQ-VAE pretraining and RepSP benchmark are the main new pieces. The method encodes amino-acid, backbone, and full-atom views separately with VQ-VAE, then pretrains by recovering tokens from corrupted cross-view versions. This is meant to produce representations that work for structure prediction tasks without extra fine-tuning. The paper does a decent job laying out why cross-view consistency might help with things like homodimer co-folding and interaction prediction. Introducing RepSP as a testbed for representation use in structure settings is also useful, as it moves beyond standard function annotation benchmarks. The soft spot is the lack of any numbers in the abstract. It says TriProRep improves over sequence-only and prior models on the three RepSP tasks, but without quantitative results, baselines, or ablations it's hard to see how big the gains are or whether they are robust. The central assumption that token recovery on corrupted views transfers to global structure tasks is plausible but untested in the provided summary. If the full paper has those details and they hold up, the work is more interesting; right now the evidence is thin. This paper is for researchers in computational biology who are building or using protein representations for structure-related problems. A reader working on pretraining methods or benchmarks would get value from the new setup and the RepSP definition, even if they end up modifying the approach. It deserves a serious referee because the ideas are grounded in the literature on generative pretraining and the benchmark could be adopted more widely if the experiments are solid. I'd recommend sending it to review rather than desk rejecting, provided the full manuscript includes the missing experimental details.

Referee Report

2 major / 1 minor

Summary. The paper proposes TriProRep, a structure-aware pretraining method that jointly models three aligned residue-level views (amino-acid identity, backbone geometry, and local full-atom geometry) via VQ-VAE tokenizers. Pretraining recovers original tokens from generator-corrupted cross-view augmentations to learn distinctions between plausible but incorrect augmentations and the original protein. The authors introduce the RepSP benchmark, which evaluates representations on homodimer co-folding from apo-chain representations, residue-level prediction of homodimer-derived interaction properties, and representation-aligned monomer structure prediction. They claim TriProRep improves over sequence-only and prior structure-aware models on RepSP tasks while remaining competitive on conventional benchmarks.

Significance. If the claimed improvements hold under rigorous controls, the work would demonstrate that atom-level cross-view token recovery pretraining can yield transferable structure-aware representations without task-specific fine-tuning, advancing representation learning for proteins beyond sequence-only or coarse structure models. The RepSP benchmark itself would be a useful standardized testbed for structure-predictive uses of representations.

major comments (2)

[Abstract] Abstract: the claim of improvements across RepSP tasks is stated without any quantitative results, baselines, error bars, ablation studies, or statistical details, preventing assessment of whether the gains are load-bearing or attributable to the proposed pretraining rather than dataset effects.
[Abstract] Abstract (pretraining description): the token-recovery objective on generator-corrupted cross-view augmentations is presented as producing representations that improve homodimer co-folding, interaction prediction, and monomer folding without fine-tuning, yet the objective supplies only local cross-view consistency and no direct supervision on global fold geometry or interaction interfaces; this leaves open whether reported gains reflect learned structure awareness or artifacts such as tokenizer leakage or data overlap.

minor comments (1)

[Abstract] Abstract: 'RepSP' is introduced as a new benchmark without spelling out the acronym or providing even a high-level description of its three tasks beyond the listed uses.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and indicate where revisions will be made to strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of improvements across RepSP tasks is stated without any quantitative results, baselines, error bars, ablation studies, or statistical details, preventing assessment of whether the gains are load-bearing or attributable to the proposed pretraining rather than dataset effects.

Authors: We agree that the abstract would benefit from quantitative support. In the revised manuscript, we will incorporate key performance metrics from the RepSP benchmark, including specific improvements over sequence-only and prior structure-aware baselines, along with error bars from repeated runs and any available statistical details. This will allow clearer evaluation of the gains. revision: yes
Referee: [Abstract] Abstract (pretraining description): the token-recovery objective on generator-corrupted cross-view augmentations is presented as producing representations that improve homodimer co-folding, interaction prediction, and monomer folding without fine-tuning, yet the objective supplies only local cross-view consistency and no direct supervision on global fold geometry or interaction interfaces; this leaves open whether reported gains reflect learned structure awareness or artifacts such as tokenizer leakage or data overlap.

Authors: While the objective operates at the local residue level, the joint three-view modeling is intended to capture structural distinctions that transfer to global tasks, as demonstrated by the RepSP results on co-folding and interaction properties. The full manuscript provides comparisons showing advantages over prior models. To address concerns about artifacts, we will expand the discussion section with additional analysis on data overlap and tokenizer behavior. We view the gains as reflecting the learned representations rather than artifacts. revision: partial

Circularity Check

0 steps flagged

No significant circularity: pretraining objective independent of downstream RepSP tasks

full rationale

The paper defines TriProRep via a VQ-VAE token-recovery objective on generator-corrupted cross-view augmentations (amino-acid, backbone, full-atom). This objective is stated independently of the RepSP benchmark tasks (homodimer co-folding, residue-level interaction prediction, representation-aligned monomer folding). No equations, fitted parameters, or self-citations are shown that reduce the claimed gains on RepSP to quantities fitted on the evaluation data or to self-referential definitions. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; limited visibility into exact hyperparameters or modeling choices.

axioms (1)

domain assumption VQ-VAE tokenizers can faithfully discretize amino-acid identity, backbone geometry, and local full-atom geometry without significant information loss for downstream structure tasks.
Central to the three-view tokenization step described in the abstract.

pith-pipeline@v0.9.0 · 5727 in / 1211 out tokens · 19681 ms · 2026-05-25T02:52:00.960787+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

pretraining to recover original tokens from generator-corrupted views... three aligned residue-level views
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VQ-VAE tokenizers... corrective pretraining objective

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.