pith. sign in

arxiv: 2605.18552 · v1 · pith:AGC75AZLnew · submitted 2026-05-18 · 💻 cs.LG · q-bio.BM· q-bio.QM

Protein Fold Classification at Scale: Benchmarking and Pretraining

Pith reviewed 2026-05-20 13:11 UTC · model grok-4.3

classification 💻 cs.LG q-bio.BMq-bio.QM
keywords protein fold classificationself-supervised learningmasked autoencodersprotein structureTEDBenchbenchmarkSE(3) invarianceMiAE
0
0 comments X

The pith

Masked Invariant Autoencoders with up to 90 percent masking outperform supervised methods for protein fold classification on a new large benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs TEDBench, a large non-redundant test set for protein fold classification drawn from the Encyclopedia of Domains and Foldseek-clustered AlphaFold predictions. It introduces Masked Invariant Autoencoders, a self-supervised method that masks most input coordinates and trains an SE(3)-invariant encoder plus lightweight decoder to reconstruct the backbone. On this benchmark the approach scales better than prior self-supervised or supervised models and also transfers to experimental structures. A sympathetic reader would care because reliable large-scale fold classification supports understanding of biological function without depending on massive labeled datasets or enormous model sizes.

Core claim

We introduce TEDBench, a large-scale, non-redundant benchmark for protein fold classification constructed from the Encyclopedia of Domains (TED) and Foldseek-clustered AlphaFold structures. We propose Masked Invariant Autoencoders (MiAE), a self-supervised framework for protein structure representation learning that uses an extremely high masking ratio of up to 90 percent with an SE(3)-invariant encoder and a lightweight decoder that reconstructs backbone coordinates from the latent representation and mask tokens. MiAE scales well and outperforms supervised counterparts and state-of-the-art baselines on TEDBench.

What carries the argument

Masked Invariant Autoencoders (MiAE) that apply extreme masking to protein backbone coordinates and use an SE(3)-invariant encoder with a lightweight decoder to reconstruct the full structure from the masked latent representation.

If this is right

  • Protein fold classification benefits from self-supervised pretraining on large unlabeled structure sets instead of relying solely on supervised training.
  • Very high masking ratios remain effective for learning useful structural representations in this domain.
  • The non-redundant benchmark construction removes duplicate-induced inflation that affected earlier protein fold datasets.
  • Gains from MiAE pretraining carry over when models are tested on experimental rather than predicted structures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same high-masking recipe might reduce the model size needed for other structure-related prediction tasks in biology.
  • If AlphaFold models contain systematic local errors, future benchmarks could combine experimental and predicted data with explicit error weighting.
  • Similar masking-based pretraining could be tested on other chain-like biomolecules such as RNA backbones.

Load-bearing premise

The construction of TEDBench from TED and Foldseek-clustered AlphaFold structures yields a truly non-redundant and unbiased test of fold classification that generalizes beyond prediction artifacts in AlphaFold models.

What would settle it

Re-evaluate all models on a version of the benchmark rebuilt by clustering only experimental CATH structures with an independent algorithm; if MiAE loses its performance advantage or if substantial fold duplicates appear in the test split, the central scaling claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.18552 by Andrei Manolache, Dexiong Chen, Karsten Borgwardt, Mathias Niepert.

Figure 1
Figure 1. Figure 1: Overview of the TEDBench. TEDBench is a large-scale, non-redundant benchmark for protein fold classification. Given the high diversity of protein structures, CATH (Orengo et al., 1997) provides a hierarchical classification of protein domain structures. TED (Lau et al., 2024) extends this to AFDB. Our TEDBench builds upon TED and contains more than 460K predicted structures and 27K experimental structures … view at source ↗
Figure 2
Figure 2. Figure 2: Reconstructions of experimental structures using an MiAE pre-trained with a masking ratio of 90%. The predictions recover well the shape and secondary structures, even when using a high masking ratio. Masked residues are masked in gray. represented as a 4 × 4 homogeneous matrix: Ti =  Ri ti 01×3 1  ∈ SE(3), where Ri ∈ SO(3) is a rotation matrix and ti ∈ R 3 is a translation vector. The translation ti cor… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the MiAE architecture. During pre-training, a high masking ratio (e.g., 90%) is applied to backbone frames. The geometric encoder processes only this small subset of unmasked frames, maintaining SE(3)-invariance relative to the input coordinates. Following the encoder, mask tokens are reintegrated into the latent sequence. A lightweight decoder then operates on the full set of encoded frames an… view at source ↗
Figure 4
Figure 4. Figure 4: Masking ratio. A high masking ratio tends to deliver higher linear probing performance and higher reconstruction error (RMSD). The test performance is plotted. 10 4 10 4.5 10 5 10 5.5 10 6 Dataset size 30 40 50 60 Macro F1 (%) Linear probing performance test external test 10 4 10 4.5 10 5 10 5.5 10 6 Dataset size 68 70 72 74 76 Macro F1 (%) Fine-tuning performance test external test [PITH_FULL_IMAGE:figur… view at source ↗
Figure 6
Figure 6. Figure 6: t-SNE projection of pretrained MiAE protein embeddings (before fine-tuning), colored by CATH topology; several topologies form clear neighborhoods in the learned space. scaffolding. Table 5e shows that random span masking works slightly better than random masking while being more complicated. Therefore, we keep the simpler random masking strategy as our default setting. Scaling and sequence incorporation. … view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of attention weights. Protein samples with colored residues based on a heatmap defined by the average attention weights of the last layer of an end-to-end fine-tuned MiAE-B model. (a) and (b): two examples from the mainly alpha class; (b): mainly beta; (c): alpha and beta. The model appears to identify the core structural components. SaProt. We use the official checkpoints from SaProt: SaProt… view at source ↗
Figure 8
Figure 8. Figure 8: t-SNE projection of protein-level embeddings produced by the MiAE encoder before fine-tuning (mean pooled from final-layer residue representations). Points correspond to proteins and are colored by CATH labels (class, architecture, and topology). The map shows that class and architecture labels tend to occupy distinct regions, while topology provides a finer-grained view with multiple topologies forming re… view at source ↗
Figure 9
Figure 9. Figure 9: Uncurated random samples on the external experimental structures, using an MiAE pretrained with a masking ratio of 90% on the FoldSeek clustered dataset. For each sample, we show the original structure, the masked structure, and the reconstructed structure for two masking ratios 70% and 90%. Masked residues are marked in gray. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
read the original abstract

Classifying protein topology is essential for deciphering biological function, but progress is held back by the lack of large-scale benchmarks that avoid duplicates and by models that do not scale well. We introduce TEDBench, a large-scale, non-redundant benchmark for protein fold classification constructed from the Encyclopedia of Domains (TED) and Foldseek-clustered AlphaFold structures. We show that on TEDBench, current protein representation learning methods either require very large models or fail to deliver strong performance. To address this challenge, we propose Masked Invariant Autoencoders (MiAE), a self-supervised framework for protein structure representation learning. MiAE uses an extremely high masking ratio of up to 90% with an $\mathrm{SE(3)}$-invariant encoder and a lightweight decoder that reconstructs backbone coordinates from the latent representation and mask tokens. MiAE scales well and outperforms supervised counterparts and state-of-the-art baselines on TEDBench, establishing a strong recipe for protein fold classification. To test transfer beyond AlphaFold structures, we further benchmark on a curated dataset from experimental structures of CATH v4.4. TEDBench is available at https://github.com/BorgwardtLab/TEDBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces TEDBench, a large-scale non-redundant benchmark for protein fold classification built from the Encyclopedia of Domains (TED) combined with Foldseek-clustered AlphaFold structures. It proposes Masked Invariant Autoencoders (MiAE), a self-supervised framework using up to 90% masking, an SE(3)-invariant encoder, and a lightweight decoder that reconstructs backbone coordinates. The central claim is that MiAE scales well and outperforms both supervised baselines and prior state-of-the-art methods on TEDBench while transferring to a curated experimental CATH v4.4 dataset; TEDBench is released publicly.

Significance. If the results hold after validation of the benchmark, the work would be significant for structural bioinformatics and representation learning by supplying a scalable self-supervised recipe for fold classification and a new public benchmark that addresses duplication issues in existing resources. The high-masking invariant autoencoder approach and the github release of TEDBench constitute concrete contributions to reproducibility.

major comments (2)
  1. [TEDBench construction] TEDBench construction section: the description of combining TED domains with Foldseek-clustered AlphaFold structures supplies no quantitative details on clustering thresholds, maximum inter-set TM-scores, sequence-identity cutoffs, or explicit checks that train/test folds are structurally dissimilar. Without these metrics it is impossible to confirm that reported gains reflect learned fold representations rather than residual similarities or AlphaFold geometric biases; this directly undermines the central outperformance claim.
  2. [Results] Results and experimental sections: the claim that MiAE 'outperforms supervised counterparts and state-of-the-art baselines' on TEDBench and transfers to CATH requires tabulated metrics with error bars, ablation studies on the masking ratio, and precise baseline implementations. The abstract states clear superiority but the manuscript must supply these numbers to substantiate scaling behavior and rule out dataset-specific artifacts.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'up to 90%' masking ratio should be accompanied by the exact ratio(s) used in the reported experiments for clarity.
  2. [Methods] Notation: the SE(3)-invariant encoder is introduced without an explicit equation or reference to the invariance property; adding a short formal definition would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough and constructive review of our manuscript. We have carefully considered the two major comments and provide point-by-point responses below. We agree that both points identify areas where the manuscript can be strengthened and plan to make the corresponding revisions.

read point-by-point responses
  1. Referee: [TEDBench construction] TEDBench construction section: the description of combining TED domains with Foldseek-clustered AlphaFold structures supplies no quantitative details on clustering thresholds, maximum inter-set TM-scores, sequence-identity cutoffs, or explicit checks that train/test folds are structurally dissimilar. Without these metrics it is impossible to confirm that reported gains reflect learned fold representations rather than residual similarities or AlphaFold geometric biases; this directly undermines the central outperformance claim.

    Authors: We agree that additional quantitative details on benchmark construction are required for full reproducibility and to allow independent verification that train/test splits are structurally dissimilar. In the revised manuscript we will expand the TEDBench construction section to report the specific Foldseek clustering parameters, maximum inter-set TM-scores, sequence-identity cutoffs, and the results of explicit structural dissimilarity checks (e.g., TM-align) between folds in different splits. These metrics will be presented in the main text together with a supplementary table summarizing the final dataset statistics and split properties. revision: yes

  2. Referee: [Results] Results and experimental sections: the claim that MiAE 'outperforms supervised counterparts and state-of-the-art baselines' on TEDBench and transfers to CATH requires tabulated metrics with error bars, ablation studies on the masking ratio, and precise baseline implementations. The abstract states clear superiority but the manuscript must supply these numbers to substantiate scaling behavior and rule out dataset-specific artifacts.

    Authors: We acknowledge that the experimental reporting can be made more rigorous. In the revised manuscript we will augment the results section with complete tabulated performance metrics (including accuracy, F1, and other relevant measures) accompanied by error bars from multiple independent runs, dedicated ablation experiments varying the masking ratio, and precise descriptions of all baseline implementations (model architectures, hyperparameters, and training protocols). These additions will directly support the scaling claims and outperformance statements. revision: yes

Circularity Check

0 steps flagged

New benchmark construction and self-supervised pretraining exhibit no self-referential reduction or load-bearing self-citation.

full rationale

The paper constructs TEDBench from TED domains plus Foldseek-clustered AlphaFold structures and then evaluates MiAE (high-masking SE(3)-invariant autoencoder) on it, reporting empirical outperformance over supervised baselines and prior methods. No equations, fitted parameters renamed as predictions, or self-citation chains are visible that would make the reported gains equivalent to the inputs by construction. The CATH experimental benchmark is invoked as an external transfer test. This is the common honest case of a self-contained empirical contribution; the reader's assigned score of 2 reflects only the normal novelty of a new benchmark rather than any circular step.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim depends on the validity of the deduplicated benchmark and the effectiveness of high-ratio masking for learning fold-discriminative representations; standard SE(3) invariance and standard autoencoder reconstruction objectives are assumed without new justification.

free parameters (1)
  • masking ratio
    Up to 90% masking ratio chosen as a hyperparameter to create a challenging reconstruction task.
axioms (1)
  • domain assumption Protein backbone coordinates are SE(3)-invariant
    Invoked to justify the choice of invariant encoder architecture.
invented entities (1)
  • MiAE no independent evidence
    purpose: Self-supervised pretraining framework for protein structures
    Newly proposed method combining high masking with invariant encoder and lightweight decoder.

pith-pipeline@v0.9.0 · 5756 in / 1319 out tokens · 50347 ms · 2026-05-20T13:11:49.648923+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 2 internal anchors

  1. [1]

    nature , volume=

    Highly accurate protein structure prediction with AlphaFold , author=. nature , volume=. 2021 , publisher=

  2. [2]

    Nucleic acids research , volume=

    AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences , author=. Nucleic acids research , volume=. 2024 , publisher=

  3. [3]

    2009 , organization=

    Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , booktitle=cvpr, pages=. 2009 , organization=

  4. [4]

    Molecular Biology of the Cell

    The shape and structure of proteins , author=. Molecular Biology of the Cell. 4th edition , year=

  5. [5]

    Computational and structural biotechnology journal , volume=

    Protein domain identification methods and online resources , author=. Computational and structural biotechnology journal , volume=. 2021 , publisher=

  6. [6]

    1997 , publisher=

    Orengo, Christine A and Michie, Alex D and Jones, Susan and Jones, David T and Swindells, Mark B and Thornton, Janet M , journal=. 1997 , publisher=

  7. [7]

    2025 , publisher=

    Waman, Vaishali P and Bordin, Nicola and Lau, Andy and Kandathil, Shaun and Wells, Jude and Miller, David and Velankar, Sameer and Jones, David T and Sillitoe, Ian and Orengo, Christine , journal=. 2025 , publisher=

  8. [8]

    2007 , publisher=

    Redfern, Oliver C and Harrison, Andrew and Dallman, Tim and Pearl, Frances M G and Orengo, Christine A , journal=. 2007 , publisher=

  9. [9]

    2023 , publisher=

    Nallapareddy, Vamsi and Bordin, Nicola and Sillitoe, Ian and Heinzinger, Michael and Littmann, Maria and Waman, Vaishali P and Sen, Neeladri and Rost, Burkhard and Orengo, Christine , journal=. 2023 , publisher=

  10. [10]

    2017 , publisher=

    Dawson, Natalie L and Lewis, Tony E and Das, Sayoni and Lees, Jonathan G and Lee, David and Ashford, Paul and Orengo, Christine A and Sillitoe, Ian , journal=. 2017 , publisher=

  11. [11]

    Systematic comparison of

    Csaba, Gergely and Birzele, Fabian and Zimmer, Ralf , journal=. Systematic comparison of. 2009 , publisher=

  12. [12]

    Science , volume=

    Exploring structural diversity across the protein universe with The Encyclopedia of Domains , author=. Science , volume=. 2024 , publisher=

  13. [13]

    Nature biotechnology , volume=

    Fast and accurate protein structure search with Foldseek , author=. Nature biotechnology , volume=. 2024 , publisher=

  14. [14]

    Nature Communications , volume=

    Merizo: a rapid and accurate protein domain segmentation method using invariant point attention , author=. Nature Communications , volume=. 2023 , publisher=

  15. [15]

    Bioinformatics , volume=

    Chainsaw: protein domain segmentation with fully convolutional neural networks , author=. Bioinformatics , volume=. 2024 , publisher=

  16. [16]

    Bioinformatics , volume=

    A unified approach to protein domain parsing with inter-residue distance matrix , author=. Bioinformatics , volume=. 2023 , publisher=

  17. [17]

    2018 , publisher=

    Hou, Jie and Adhikari, Badri and Cheng, Jianlin , journal=. 2018 , publisher=

  18. [18]

    Nature , volume=

    Clustering predicted structures at the scale of the known protein universe , author=. Nature , volume=. 2023 , publisher=

  19. [19]

    Masked autoencoders are scalable vision learners , author=

  20. [20]

    Science , volume=

    Simulating 500 million years of evolution with a language model , author=. Science , volume=. 2025 , publisher=

  21. [21]

    Generative models for graph-based protein design , author=

  22. [22]

    Neurocomputing , volume=

    Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

  23. [23]

    and Sander, C

    Holm, L. and Sander, C. , title =. J. Mol. Biol. , volume =. 1993 , month = sep, issn =. 8377180 , doi =

  24. [24]

    and Harrison, Andrew and Dallman, Tim and Pearl, Frances M

    Redfern, Oliver C. and Harrison, Andrew and Dallman, Tim and Pearl, Frances M. G. and Orengo, Christine A. , title =. PLoS Comput. Biol. , volume =. 2007 , month = nov, issn =. 18052539 , doi =

  25. [25]

    Shindyalov, I. N. and Bourne, P. E. , title =. Protein Eng. , volume =. 1998 , month = sep, issn =. 9796821 , doi =

  26. [26]

    UniProt: the uni- versal protein knowledgebase in 2025

    The UniProt Consortium , title =. Nucleic Acids Research , volume =. 2024 , month =. doi:10.1093/nar/gkae1010 , url =

  27. [27]

    M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T

    Berman, Helen M. and Westbrook, John and Feng, Zukang and Gilliland, Gary and Bhat, T. N. and Weissig, Helge and Shindyalov, Ilya N. and Bourne, Philip E. , title =. Nucleic Acids Research , volume =. 2000 , month =. doi:10.1093/nar/28.1.235 , url =

  28. [28]

    Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges

    Bronstein, Michael M. and Bruna, Joan and Cohen, Taco and Veli. arXiv , year =. 2104.13478 , doi =

  29. [29]

    Mario Geiger and Tess Smidt and Alby M. and Benjamin Kurt Miller and Wouter Boomsma and Bradley Dice and Kostiantyn Lapchevskyi and Maurice Weiler and Michał Tyszkiewicz and Simon Batzner and Dylan Madisetti and Martin Uhrin and Jes Frellsen and Nuri Jung and Sophia Sanborn and Mingjian Wen and Josh Rackers and Marcel Rød and Michael Bailey , title =. doi...

  30. [30]

    Ilyes Batatia and David Peter Kovacs and Gregor N. C. Simm and Christoph Ortner and Gabor Csanyi , booktitle=. 2022 , url=

  31. [31]

    2025 , title =

    Aykent, Sarp and Xia, Tian , booktitle =. 2025 , title =

  32. [32]

    Dauparas and I

    J. Dauparas and I. Anishchenko and N. Bennett and H. Bai and R. J. Ragotte and L. F. Milles and B. I. M. Wicky and A. Courbet and R. J. de Haas and N. Bethel and P. J. Y. Leung and T. F. Huddy and S. Pellock and D. Tischer and F. Chan and B. Koepnick and H. Nguyen and A. Kang and B. Sankaran and A. K. Bera and N. P. King and D. Baker , title =. Science , ...

  33. [33]

    and Riley, Patrick F

    Gilmer, Justin and Schoenholz, Samuel S. and Riley, Patrick F. and Vinyals, Oriol and Dahl, George E. , title =. 2017 , publisher =

  34. [34]

    The Graph Neural Network Model , year=

    Scarselli, Franco and Gori, Marco and Tsoi, Ah Chung and Hagenbuchner, Markus and Monfardini, Gabriele , journal=. The Graph Neural Network Model , year=

  35. [35]

    Protein Engineering, Design and Selection , volume =

    Yang, Kevin K and Zanichelli, Niccolò and Yeh, Hugh , title =. Protein Engineering, Design and Selection , volume =. 2022 , month =. doi:10.1093/protein/gzad015 , url =

  36. [36]

    Attention is All you Need , url =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

  37. [37]

    2024 , url=

    SaProt: Protein Language Modeling with Structure-aware Vocabulary , author=. 2024 , url=

  38. [38]

    Science , volume=

    Evolutionary-scale prediction of atomic-level protein structure with a language model , author=. Science , volume=. 2023 , publisher=

  39. [39]

    Bert: Pre-training of deep bidirectional transformers for language understanding , author=

  40. [40]

    Proceedings of the National Academy of Sciences , volume=

    Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , author=. Proceedings of the National Academy of Sciences , volume=. 2021 , publisher=

  41. [41]

    Bioinformatics , volume=

    Endowing protein language models with structural knowledge , author=. Bioinformatics , volume=. 2025 , publisher=

  42. [42]

    Protein Representation Learning by Geometric Structure Pretraining , author=

  43. [43]

    magic state factories

    Zhu, Ciyou and Byrd, Richard H. and Lu, Peihuang and Nocedal, Jorge , title =. ACM Trans. Math. Softw. , month = dec, pages =. 1997 , issue_date =. doi:10.1145/279232.279236 , abstract =

  44. [44]

    ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

    Electra: Pre-training text encoders as discriminators rather than generators , author=. arXiv preprint arXiv:2003.10555 , year=

  45. [45]

    Journal of Machine Learning Research , year =

    Laurens van der Maaten and Geoffrey Hinton , title =. Journal of Machine Learning Research , year =

  46. [46]

    2024 , url =

    Keller Jordan and Yuchen Jin and Vlado Boza and You Jiacheng and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

  47. [47]

    International Conference on Learning Representations , year=

    Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

  48. [48]

    Kingma and Jimmy Ba , editor =

    Diederik P. Kingma and Jimmy Ba , editor =. Adam:. 3rd International Conference on Learning Representations,. 2015 , url =

  49. [49]

    Proceedings of the National Academy of Sciences , volume=

    Tertiary alphabet for the observable protein structural universe , author=. Proceedings of the National Academy of Sciences , volume=. 2016 , publisher=

  50. [50]

    Proteinshake: Building datasets and benchmarks for deep learning on protein structures , author=

  51. [51]

    Evaluating protein transfer learning with TAPE , author=