ProtDBench: A Unified Benchmark of Protein Binder Design and Evaluation
Pith reviewed 2026-05-25 06:45 UTC · model grok-4.3
The pith
ProtDBench standardizes binder design evaluation and shows that structure prediction verifiers disagree on the same designs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ProtDBench defines unified benchmark tasks, evaluation protocols, and success criteria for protein binder design. Using wet-lab annotated data, it reveals substantial verifier-dependent bias and limited agreement among structure prediction models under identical filtering. Benchmarking representative generative methods across ten targets under a fixed protocol that includes throughput-aware metrics and cluster-level criteria exposes systematic differences induced by filtering rules, success definitions, and evaluation settings.
What carries the argument
ProtDBench, the evaluation framework that applies a fixed 24-hour budget, cluster-level success criteria, and consistent verifier protocols across methods and targets.
If this is right
- The same set of designed sequences receives different success labels depending on which structure prediction model is used as verifier.
- Throughput-aware scoring under a fixed time budget produces different efficiency rankings than per-sequence success rates alone.
- Cluster-level criteria that penalize redundant structures change which methods appear to succeed compared with sequence-level counts.
- A shared protocol makes it possible to attribute performance gaps to method differences rather than evaluation choices.
Where Pith is reading between the lines
- Adopting one shared benchmark could reduce the number of designs that later fail experimental validation.
- Methods that balance high success under multiple verifiers with good throughput may prove more practical than those optimized for a single metric.
- The observed verifier disagreement suggests that combining several structure predictors could give more stable estimates of design quality.
- The framework could be extended to additional targets or to closed-source methods to test whether current rankings hold.
Load-bearing premise
The wet-lab annotations are treated as accurate ground truth for measuring verifier bias and method performance on the chosen targets.
What would settle it
A new independent wet-lab screen on the same ten targets that assigns different binding labels to the evaluated sequences would alter both the measured verifier disagreements and the method rankings.
Figures
read the original abstract
Recent advances in de novo protein binder design have enabled increasing experimental validation, yet reported in silico metrics remain difficult to interpret or compare across studies due to non-standardized evaluation protocols. We introduce ProtDBench, a standardized and throughput-aware evaluation framework for protein binder design. ProtDBench defines unified benchmark tasks, evaluation protocols, and success criteria, enabling systematic analysis of how evaluation design influences observed performance. Using a large wet-lab annotated dataset, we analyze commonly used structure prediction models as evaluation verifiers, revealing substantial verifier-dependent bias and limited agreement under identical filtering protocols. We then benchmark representative open-source generative binder design methods across ten diverse protein targets under a fixed evaluation protocol. Beyond per-sequence success rates, ProtDBench incorporates throughput-aware metrics based on a fixed 24-hour budget, as well as cluster-level success criteria to account for structural diversity. Together, these results expose systematic differences induced by filtering rules, success definitions, and throughput-aware evaluation between computational efficiency, success rate, and structural diversity. Overall, ProtDBench provides a fair and reproducible evaluation pipeline that supports systematic and controlled comparison of protein binder design methods under realistic evaluation settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ProtDBench, a standardized evaluation framework for protein binder design that defines unified benchmark tasks, protocols, success criteria, and throughput-aware metrics (including a fixed 24-hour budget and cluster-level diversity). It uses a large wet-lab annotated dataset to analyze structure prediction models as verifiers, reporting substantial verifier-dependent bias and limited agreement under identical filtering, then benchmarks representative open-source generative methods across ten diverse targets under a fixed protocol.
Significance. If the wet-lab annotations prove reliable and representative, ProtDBench would provide a needed reproducible pipeline for fair comparison of binder design methods, exposing how filtering rules and success definitions affect rankings of computational efficiency, success rate, and structural diversity. The empirical focus on external wet-lab data and inclusion of throughput metrics are strengths for practical utility.
major comments (2)
- [Abstract / verifier analysis section] Abstract and the section on verifier analysis: the central claim of 'substantial verifier-dependent bias and limited agreement' is stated without any quantitative details on dataset size, exact filtering rules, agreement metrics (e.g., percentage overlap or statistical tests), or how the wet-lab annotations were obtained and validated. This directly undermines assessment of the bias finding.
- [Wet-lab dataset / benchmarking results section] Section describing the wet-lab dataset and its use for method benchmarking: the dataset is treated as ground truth both for quantifying verifier bias and for ranking generative methods on ten targets, yet no validation of annotation accuracy, no distributional comparison (sequence, fold, or binding regime) to the benchmark targets, and no discussion of potential systematic assay errors are provided. This is load-bearing for both the bias results and the reported method rankings.
minor comments (2)
- [Methods / evaluation protocol] Clarify the exact implementation of the 24-hour throughput budget and cluster-level success criteria, including any pseudocode or parameter values used.
- [Benchmark targets description] Add explicit statements on how the ten target proteins were selected and whether they overlap with or are representative of the wet-lab dataset proteins.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight opportunities to strengthen the clarity and transparency of our claims. We address each major point below and have made revisions to incorporate additional details and discussion.
read point-by-point responses
-
Referee: [Abstract / verifier analysis section] Abstract and the section on verifier analysis: the central claim of 'substantial verifier-dependent bias and limited agreement' is stated without any quantitative details on dataset size, exact filtering rules, agreement metrics (e.g., percentage overlap or statistical tests), or how the wet-lab annotations were obtained and validated. This directly undermines assessment of the bias finding.
Authors: We agree that the abstract and verifier analysis section would benefit from explicit quantitative support for the central claim. In the revised manuscript we have added the size of the wet-lab annotated dataset, the precise filtering rules applied, agreement metrics (including percentage overlap and appropriate statistical tests), and a description of how the annotations were sourced and validated from the original experimental studies. These additions directly address the concern and allow readers to better evaluate the reported verifier-dependent bias. revision: yes
-
Referee: [Wet-lab dataset / benchmarking results section] Section describing the wet-lab dataset and its use for method benchmarking: the dataset is treated as ground truth both for quantifying verifier bias and for ranking generative methods on ten targets, yet no validation of annotation accuracy, no distributional comparison (sequence, fold, or binding regime) to the benchmark targets, and no discussion of potential systematic assay errors are provided. This is load-bearing for both the bias results and the reported method rankings.
Authors: We acknowledge that treating the wet-lab annotations as ground truth requires supporting context. In the revision we have expanded the dataset description to include (i) a summary of annotation validation procedures reported in the source studies, (ii) distributional comparisons of sequence identity, fold classes, and binding regimes between the annotated dataset and the ten benchmark targets, and (iii) an explicit discussion of potential systematic assay errors together with the limitations these impose on interpretation. These additions provide the necessary caveats for both the bias analysis and the method rankings. revision: yes
Circularity Check
No circularity: empirical benchmark against external wet-lab data
full rationale
The paper introduces ProtDBench as a standardized evaluation framework and performs empirical comparisons of structure prediction models and generative methods using a large external wet-lab annotated dataset as ground truth. No derivation chain, equations, fitted parameters renamed as predictions, or self-citations are present in the provided text that reduce any claim to its own inputs by construction. The work is self-contained as a benchmark study; assumptions about the wet-lab data's reliability are correctness concerns, not circularity.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Wet-lab annotated dataset serves as reliable ground truth for in silico verifier evaluation
- domain assumption Fixed 24-hour budget and cluster-level criteria are appropriate proxies for practical utility
invented entities (1)
-
ProtDBench benchmark framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
BioGeometry. Geoflow-v2: A unified atomic diffusion model for protein structure prediction and de novo de- sign.bioRxiv, pp. 2025–05,
work page 2025
-
[2]
Butcher, J. K. V ., Krishna, R., Mitra, R., Brent, R. I., Li, Y ., Corley, N., Kim, P., Funk, J., Mathis, S. V ., Salike, S., et al. De novo design of all-atom biomolecular interac- tions with rfdiffusion3.bioRxiv, pp. 2025–09,
work page 2025
-
[3]
Chai- 1: Decoding the molecular interactions of life.BioRxiv, pp
ChaiDiscovery, Boitreaud, J., Dent, J., McPartlon, M., Meier, J., Reis, V ., Rogozhonikov, A., and Wu, K. Chai- 1: Decoding the molecular interactions of life.BioRxiv, pp. 2024–10,
work page 2024
-
[4]
Zero-shot antibody design in a 24-well plate.bioRxiv, pp
ChaiDiscovery, Boitreaud, J., Dent, J., Geisz, D., McPart- lon, M., Meier, J., Qiao, Z., Rogozhnikov, A., Rollins, N., Wollenhaupt, P., et al. Zero-shot antibody design in a 24-well plate.bioRxiv, pp. 2025–07,
work page 2025
-
[5]
Cho, Y ., Pacesa, M., Zhang, Z., Correia, B
doi: 10.1101/2025.01.08.631967. Cho, Y ., Pacesa, M., Zhang, Z., Correia, B. E., and Ovchin- nikov, S. Boltzdesign1: Inverting all-atom structure pre- diction model for generalized biomolecular binder de- sign.bioRxiv, pp. 2025–04,
-
[6]
URLhttps://www.biorxiv.org/content/ early/2023/05/25/2023.05.24.542194
doi: 10.1101/2023.05.24.542194. URLhttps://www.biorxiv.org/content/ early/2023/05/25/2023.05.24.542194. Evans, R., O’Neill, M., Pritzel, A., Antropova, N., Senior, A., Green, T., ˇZ´ıdek, A., Bates, R., Blackwell, S., Yim, J., et al. Protein complex prediction with alphafold- multimer.biorxiv, pp. 2021–10,
-
[7]
Gong, C., Chen, X., Zhang, Y ., Song, Y ., Zhou, H., and Xiao, W. Protenix-mini: Efficient structure predictor via compact architecture, few-step diffusion and switchable plm.arXiv preprint arXiv:2507.11839,
-
[8]
Qu, W., Ma, Y ., Ye, F., Lu, C., Zhou, Y ., Zhang, K., Wang, L., Gui, M., and Gu, Q
doi: 10.1101/2025.06.14.659707. Qu, W., Ma, Y ., Ye, F., Lu, C., Zhou, Y ., Zhang, K., Wang, L., Gui, M., and Gu, Q. Seedproteo: Accurate de novo all-atom design of protein binders,
-
[9]
URLhttps: //arxiv.org/abs/2512.24192. Stark, H., Faltings, F., Choi, M., Xie, Y ., Hur, E., O’Donnell, T., Bushuiev, A., Uc ¸ar, T., Passaro, S., Mao, W., Reveiz, M., Bushuiev, R., Pluskal, T., Sivic, J., Kreis, K., Vahdat, A., Ray, S., Goldstein, J. T., Savinov, A., Hambalek, J. A., Gupta, A., Taquiri-Diaz, D. A., Zhang, Y ., Hatstat, A. K., Arada, A., K...
-
[10]
URLhttps://www.biorxiv.org/content/ early/2025/11/24/2025.11.20.689494
doi: 10.1101/2025.11.20.689494. URLhttps://www.biorxiv.org/content/ early/2025/11/24/2025.11.20.689494. Team, P., Ren, M., Sun, J., Guan, J., Liu, C., Gong, C., Wang, Y ., Wang, L., Cai, Q., Chen, X., and Xiao, W. Pxdesign: Fast, modular, and accu- rate de novo design of protein binders.bioRxiv,
-
[11]
URL https://www.biorxiv.org/content/ early/2025/08/16/2025.08.15.670450
doi: 10.1101/2025.08.15.670450. URL https://www.biorxiv.org/content/ early/2025/08/16/2025.08.15.670450. Van Kempen, M., Kim, S. S., Tumescheit, C., Mirdita, M., Lee, J., Gilchrist, C. L., S ¨oding, J., and Steinegger, M. Fast and accurate protein structure search with foldseek. Nature biotechnology, 42(2):243–246,
-
[12]
Zambaldi, V ., La, D., Chu, A. E., Patani, H., Danson, A. E., Kwan, T. O., Frerix, T., Schneider, R. G., Saxton, D., Thillaisundaram, A., et al. De novo design of high- affinity protein binders with alphaproteo.arXiv preprint arXiv:2409.08022,
-
[13]
URLhttps: //arxiv.org/abs/2510.22304. 11 ProtDBench: A Unified Benchmark of Protein Binder Design and Evaluation Target PDB ID Crop Hotspot Natural binder Binder length BHRF12wh6A2–158 A65, A74, A77, A82, A85, A93 BH3 helix 80–120 SC2RBD6m0jE333–526 E485, E489, E494, E500, E505 ACE2 receptor 80–120 IL-7RA3di3B17–209 B58, B80, B139 IL-7 50–120 PD-L15o45A17...
-
[14]
For each target we report the total number of experimentally annotated non-binders (Neg.) and binders (Pos.), the number of non-binders retained after subsampling (Ret., capped at 20,000per crop for tractability in Figure 1a), and binder-sequence length statistics before/after subsampling. The EGFR row pools two crops, EGFRc and EGFRn, that share the same...
work page 2017
-
[15]
On this AF2-IG-pre-filtered AI-generated dataset, Protenix-Mini is more selective and retains fewer candidates, but achieves non-trivial precision on Mdm2 (0.623), PDL1 (0.389), IL7Ra (0.414), and InsulinReceptor (0.300), suggesting that our framework remains meaningfully correlated with wet-lab outcomes on AI-generated candidate distributions and not onl...
work page 2022
-
[16]
achieves higher pre- diction accuracy and is able to predict the joint structure of complexes including proteins, nucleic acids, small molecules, ions and modified residues. Multiple open-source variants of AF3 are released later, including Boltz-1 (Wohlwend et al., 2024), Boltz-2 (Passaro et al., 2025), Chai-1 (ChaiDiscovery et al.,
work page 2024
-
[17]
and Protenix (Chen et al., 2025). These predictors are now integral to design and evaluation workflows. 23 ProtDBench: A Unified Benchmark of Protein Binder Design and Evaluation Table 10.Retrospective verifier evaluation on AF2-IG-prefiltered RFdiffusion designs.“Wet+/Wet−” are the wet-lab confirmed binder/non-binder counts per target. “TP” is the number...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.