arxiv: 2605.04118 · v1 · submitted 2026-05-05 · 🧬 q-bio.QM · cs.AI

Recognition: 3 theorem links

· Lean Theorem

ProtDBench: A Unified Benchmark of Protein Binder Design and Evaluation

Chengyue Gong, Cong Liu, Jiaqi Guan, Jinyuan Sun, Milong Ren, Wenzhi Xiao, Xinshi Chen

Pith reviewed 2026-05-08 18:09 UTC · model grok-4.3

classification 🧬 q-bio.QM cs.AI

keywords protein binder designbenchmarkevaluation protocolstructure predictionde novo designwet-lab validationthroughput metricsstructural diversity

0 comments

The pith

A new benchmark for protein binder design shows that common structure prediction models disagree substantially on which designs succeed under identical rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a standardized framework called ProtDBench for evaluating computational methods that design proteins to bind specific targets. It applies this framework to a dataset of designs already validated through wet-lab experiments to test how reliably structure prediction models can judge success. The analysis finds that different models often reach conflicting conclusions about the same designs and that choices in filtering and success rules produce different rankings of design methods. The benchmark adds measures of how many designs can be generated within a fixed time limit and how structurally diverse they are. These elements together support more consistent comparisons across design approaches.

Core claim

ProtDBench defines unified benchmark tasks, evaluation protocols, and success criteria for protein binder design. Using a large wet-lab annotated dataset, analysis of structure prediction models as verifiers reveals substantial verifier-dependent bias and limited agreement under identical filtering protocols. Benchmarking of representative generative methods across ten targets under a fixed protocol exposes systematic differences induced by filtering rules, success definitions, and throughput-aware evaluation between computational efficiency, success rate, and structural diversity.

What carries the argument

The ProtDBench evaluation framework, which specifies unified tasks, protocols, success criteria, 24-hour throughput metrics, and cluster-level criteria for structural diversity.

If this is right

The same set of designed sequences receives different success labels depending on which structure prediction model acts as verifier.
The measured performance of any given design method shifts when filtering protocols or success definitions change.
Throughput-aware metrics under a fixed time budget expose trade-offs between the number of designs produced and their success rates.
Cluster-level success criteria add a requirement for structural diversity beyond per-sequence success.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Design methods might be optimized to satisfy the preferences of the specific verifiers in the benchmark rather than actual binding behavior.
Expanding the set of targets or incorporating newer structure predictors could change the observed levels of verifier disagreement.
Reporting results under multiple verifier settings would help show whether a design method is robust to evaluation choices.

Load-bearing premise

That the wet-lab annotated dataset together with the chosen success criteria and filtering protocols accurately reflect real experimental binder performance.

What would settle it

Applying the same analysis to an independent collection of wet-lab validated binders and finding that the structure prediction models agree more closely or produce different method rankings than those reported.

Figures

Figures reproduced from arXiv: 2605.04118 by Chengyue Gong, Cong Liu, Jiaqi Guan, Jinyuan Sun, Milong Ren, Wenzhi Xiao, Xinshi Chen.

**Figure 1.** Figure 1: Benchmarking structure prediction models as filters on the Cao dataset. (a) Top-1% success rates achieved by individual confidence metrics (e.g., ipTM, pTM, pLDDT, ipAE) derived from AF2-IG, Boltz-1, Boltz-2, Chai-1, ColabFold, Protenix, and ProtenixMini. (b) Success rates of combined filtering strategies across eight targets. “AF3 (published)” denotes baseline results from prior work, while other bars co… view at source ↗

**Figure 2.** Figure 2: Binder design benchmarks: (a) Success rates under AF2-IG-Easy; (b) Structural consistency across targets. Enlarged view across columns for better granularity. defined as the fraction of sequences whose predicted structures recapitulate the generated backbone: CR = 1 P i |Si | X i X s∈Si C(s). (8) 4.3. Main Findings Generation throughput varies substantially across generative paradigms view at source ↗

**Figure 3.** Figure 3: Binder structural diversity benchmark. For each target, bars show the diversity-adjusted cluster pass rate of each model after AF2-IG-Easy filtering. Clustering is performed only on backbones that pass the filter, and the reported percentage corresponds to the number of unique structural clusters at TM-score thresholds 0.6, 0.8, and 1.0, normalized by the total number of designed backbones. Different color… view at source ↗

**Figure 4.** Figure 4: Protenix-Mini sequence-level success rates. Fraction of generated binder sequences passing the Protenix-Mini filter for each target. Absolute success rates are lower than AF2-IG-Easy due to the conservative confidence scores of Protenix-Mini view at source ↗

**Figure 5.** Figure 5: Protenix-Mini cluster pass rates. Diversity-adjusted success rate computed as the number of unique structural clusters among passed designs (clustered at TM-score thresholds 0.6/0.8/1.0) divided by the total number of generated backbones. This metric captures both filter success and diversity among successful designs. Protenix-Mini filter for each target view at source ↗

**Figure 6.** Figure 6: Performance of binders with different secondary structures designed by various methods on AlphaFold2- and Protenix-based metrics view at source ↗

**Figure 7.** Figure 7: Alpha-helix ratio on different targets across various methods. 15 view at source ↗

**Figure 8.** Figure 8: Reference ratio of gyration radius on different targets across various methods view at source ↗

**Figure 9.** Figure 9: AlphaFold2 initial guess interface pAE on different targets across various methods. 16 view at source ↗

**Figure 10.** Figure 10: AlphaFold2 initial guess pLDDT on different targets across various methods view at source ↗

**Figure 11.** Figure 11: Bound unbound RMSD on different targets across various methods. 17 view at source ↗

**Figure 12.** Figure 12: AlphaFold2 initial guess pTM on different targets across various methods. Since BindCraft integrates hallucination and evaluation in a single pipeline, we removed evaluation time from our measurement to enable a fairer comparison. Specifically, we measured the time consumed by the binder hallucination function (https://github.com/martinpacesa/BindCraft/blob/main/bindcraft.py, Lines 109- 111). Within this… view at source ↗

**Figure 13.** Figure 13: The performance of the AF2 confidence score filters. The SR for each confidence combination is plotted as one gray dot. The pareto frontier filters are highlighted as red stars, and the selected one is marked as a black star. For each threshold combination, we compute the success rate (SR) on each target. Following the definition of Pareto Frontier, the search algorithm can come to a set of optimal points… view at source ↗

**Figure 14.** Figure 14: AUC and Average Precision scores for individual confidence metrics on subsampled Cao data. (a) Higher values indicate better global discrimination between binders and non-binders. (b) Similar to AUC, but more sensitive to top-ranking false positives. We report AUC and average precision scores for individual confidence metrics across diverse design targets ( view at source ↗

read the original abstract

Recent advances in de novo protein binder design have enabled increasing experimental validation, yet reported in silico metrics remain difficult to interpret or compare across studies due to non-standardized evaluation protocols. We introduce ProtDBench, a standardized and throughput-aware evaluation framework for protein binder design. ProtDBench defines unified benchmark tasks, evaluation protocols, and success criteria, enabling systematic analysis of how evaluation design influences observed performance. Using a large wet-lab annotated dataset, we analyze commonly used structure prediction models as evaluation verifiers, revealing substantial verifier-dependent bias and limited agreement under identical filtering protocols. We then benchmark representative open-source generative binder design methods across ten diverse protein targets under a fixed evaluation protocol. Beyond per-sequence success rates, ProtDBench incorporates throughput-aware metrics based on a fixed 24-hour budget, as well as cluster-level success criteria to account for structural diversity. Together, these results expose systematic differences induced by filtering rules, success definitions, and throughput-aware evaluation between computational efficiency, success rate, and structural diversity. Overall, ProtDBench provides a fair and reproducible evaluation pipeline that supports systematic and controlled comparison of protein binder design methods under realistic evaluation settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ProtDBench standardizes binder design evaluation with new throughput and cluster metrics but its verifier-bias claims rest on a wet-lab dataset whose fit to real performance is unshown.

read the letter

ProtDBench gives the field a single set of tasks, protocols, and success rules for protein binder design, plus metrics that count how many designs fit in a 24-hour run and that score clusters rather than lone sequences. It then uses a wet-lab annotated collection to test common structure predictors as verifiers and reports large differences plus low agreement even under identical filters. After that it runs several open-source design methods on ten targets under one fixed protocol and compares them on rate, speed, and diversity. These pieces address the real problem that current papers cannot be compared because each uses its own filters and success definitions. The throughput and cluster angles are practical additions that match how labs actually work. The central soft spot is the wet-lab dataset. The paper treats it as ground truth for measuring verifier bias and method rankings, yet the abstract gives no external checks on whether the annotations track actual binding affinity or whether the chosen targets cover the right range of difficulties. If the dataset is narrow or noisy, the reported differences become artifacts of that choice rather than general findings. The full methods section would need to show sensitivity tests or independent validation before the bias numbers can be taken as settled. This work is aimed at computational biologists who design or evaluate binder methods and want a shared yardstick. It is worth sending to peer review because the standardization goal is sound and the new metrics are straightforward to implement, even though the ground-truth question will require extra attention from referees.

Referee Report

2 major / 2 minor

Summary. The paper introduces ProtDBench, a standardized benchmark framework for de novo protein binder design that defines unified tasks, evaluation protocols, success criteria, and throughput-aware metrics. Using a large wet-lab annotated dataset, it demonstrates substantial verifier-dependent bias and limited agreement among structure prediction models under identical filters, then benchmarks representative generative methods across ten targets while incorporating cluster-level diversity criteria.

Significance. If the wet-lab dataset and protocols hold as reliable proxies, ProtDBench would fill a critical gap by enabling reproducible, controlled comparisons of binder design methods and exposing how filtering rules and verifier choice systematically alter reported performance, success rates, and efficiency rankings. The throughput budget and diversity metrics are particularly valuable additions for practical assessment.

major comments (2)

[Abstract and dataset description] The central claim that ProtDBench supplies a fair pipeline under realistic settings rests on the wet-lab annotated dataset serving as ground truth; however, the manuscript provides no external validation, correlation analysis with independent affinity measurements, or sensitivity checks for target selection and annotation quality biases (see abstract and the dataset description section).
[Evaluation protocols and results sections] § on success criteria and filtering protocols: the definitions of success (structure-based or otherwise) and the 24-hour throughput budget are presented without reported correlation to actual experimental binding outcomes, which directly affects the validity of the observed verifier biases and method rankings.

minor comments (2)

[Introduction] The abstract and introduction would benefit from explicit citation of prior non-standardized evaluation practices in the field to better motivate the need for unification.
[Figures and tables] Figure legends and table captions should clarify the exact number of sequences per target and the precise definition of 'cluster-level success' to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and outline revisions to improve clarity on dataset reliability and the grounding of success criteria.

read point-by-point responses

Referee: [Abstract and dataset description] The central claim that ProtDBench supplies a fair pipeline under realistic settings rests on the wet-lab annotated dataset serving as ground truth; however, the manuscript provides no external validation, correlation analysis with independent affinity measurements, or sensitivity checks for target selection and annotation quality biases (see abstract and the dataset description section).

Authors: We agree that the manuscript's central claims depend on the wet-lab dataset serving as a reliable proxy for ground truth. The dataset aggregates binders from published experimental studies with wet-lab validation (e.g., via SPR, ELISA, or functional assays). However, the current version does not include new external validation, cross-correlations with independent affinity data, or formal sensitivity checks. We will revise the dataset description section to expand on data provenance, annotation quality controls, known limitations, and a sensitivity analysis for target selection and annotation biases. This will better substantiate the fairness of the evaluation pipeline. revision: partial
Referee: [Evaluation protocols and results sections] § on success criteria and filtering protocols: the definitions of success (structure-based or otherwise) and the 24-hour throughput budget are presented without reported correlation to actual experimental binding outcomes, which directly affects the validity of the observed verifier biases and method rankings.

Authors: The success criteria and 24-hour throughput budget are defined using computational proxies (structure prediction agreement, sequence and cluster diversity) with the wet-lab annotations serving as the reference for positive binders. We acknowledge that the manuscript does not report explicit quantitative correlations between these in silico definitions and independent experimental binding outcomes, which could influence interpretation of the verifier biases and method rankings. We will revise the evaluation protocols section to explicitly discuss the proxy nature of the metrics, reference any available supporting literature on their correlation with experiments, and add a limitations paragraph. This will strengthen the presentation without altering the core results. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark definitions and analyses are independent of self-referential inputs

full rationale

The paper introduces ProtDBench by defining new unified benchmark tasks, evaluation protocols, success criteria, and throughput-aware metrics applied to a wet-lab annotated dataset. It then uses these definitions to analyze verifier biases in structure prediction models and to benchmark generative design methods across targets. No load-bearing claims reduce by construction to fitted parameters, self-citations, or renamed inputs; the central results on bias, agreement, and method comparisons follow directly from applying the externally grounded dataset and fixed protocols. This is a methods/benchmark paper with no derivation chain that collapses to its own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on domain assumptions about the validity of in silico verifiers and the representativeness of the wet-lab dataset for defining success, without introducing new fitted parameters or invented entities.

axioms (2)

domain assumption Structure prediction models serve as appropriate proxies for experimental validation of binder designs.
The paper uses them as evaluation verifiers and analyzes their bias.
domain assumption The wet-lab annotated dataset provides reliable ground truth for benchmarking.
Used for verifier analysis and method benchmarking across targets.

pith-pipeline@v0.9.0 · 5518 in / 1347 out tokens · 56914 ms · 2026-05-08T18:09:03.659188+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation / Foundation.AlphaCoordinateFixation washburn_uniqueness_aczel; alpha_pin_under_high_calibration unclear
Filter thresholds: binder ipTM>0.85, binder pTM>0.88, complex RMSD<2.5Å (grid-searched on Cao dataset)

Reference graph

Works this paper leans on

17 extracted references · 11 canonical work pages

[1]

Geoflow-v2: A unified atomic diffusion model for protein structure prediction and de novo de- sign.bioRxiv, pp

BioGeometry. Geoflow-v2: A unified atomic diffusion model for protein structure prediction and de novo de- sign.bioRxiv, pp. 2025–05,

2025
[2]

I., Li, Y ., Corley, N., Kim, P., Funk, J., Mathis, S., Sa- like, S., Muraishi, A., Eisenach, H., Thompson, T

Butcher, J., Krishna, R., Mitra, R., Brent, R. I., Li, Y ., Corley, N., Kim, P., Funk, J., Mathis, S., Sa- like, S., Muraishi, A., Eisenach, H., Thompson, T. R., Chen, J., Politanska, Y ., Sehgal, E., Coven- try, B., Zhang, O., Qiang, B., Didi, K., Kazman, M., DiMaio, F., and Baker, D. De novo design of all-atom biomolecular interactions with rfdiffusion3...

work page doi:10.1101/2025.09.18.676967 2025
[3]

Chai- 1: Decoding the molecular interactions of life.BioRxiv, pp

ChaiDiscovery, Boitreaud, J., Dent, J., McPartlon, M., Meier, J., Reis, V ., Rogozhonikov, A., and Wu, K. Chai- 1: Decoding the molecular interactions of life.BioRxiv, pp. 2024–10,

2024
[4]

Zero-shot antibody design in a 24-well plate.bioRxiv, pp

ChaiDiscovery, Boitreaud, J., Dent, J., Geisz, D., McPart- lon, M., Meier, J., Qiao, Z., Rogozhnikov, A., Rollins, N., Wollenhaupt, P., et al. Zero-shot antibody design in a 24-well plate.bioRxiv, pp. 2025–07,

2025
[5]

Cho, Y ., Pacesa, M., Zhang, Z., Correia, B

doi: 10.1101/2025.01.08.631967. Cho, Y ., Pacesa, M., Zhang, Z., Correia, B. E., and Ovchin- nikov, S. Boltzdesign1: Inverting all-atom structure pre- diction model for generalized biomolecular binder de- sign.bioRxiv, pp. 2025–04,

work page doi:10.1101/2025.01.08.631967 2025
[6]

URLhttps://www.biorxiv.org/content/ early/2023/05/25/2023.05.24.542194

doi: 10.1101/2023.05.24.542194. URLhttps://www.biorxiv.org/content/ early/2023/05/25/2023.05.24.542194. Evans, R., O’Neill, M., Pritzel, A., Antropova, N., Senior, A., Green, T., ˇZ´ıdek, A., Bates, R., Blackwell, S., Yim, J., et al. Protein complex prediction with alphafold- multimer.biorxiv, pp. 2021–10,

work page doi:10.1101/2023.05.24.542194 2023
[7]

Protenix-mini: Efficient structure predictor via compact architecture, few-step diffusion and switchable plm.arXiv preprint arXiv:2507.11839,

Gong, C., Chen, X., Zhang, Y ., Song, Y ., Zhou, H., and Xiao, W. Protenix-mini: Efficient structure predictor via compact architecture, few-step diffusion and switchable plm.arXiv preprint arXiv:2507.11839,

work page arXiv
[8]

H., Vinu´e, L., Yachnin, B

Pacesa, M., Nickel, L., Schellhaas, C., Schmidt, J., Pyatova, E., Kissling, L., Barendse, P., Choudhury, J., Kapoor, S., Alcaraz-Serna, A., Cho, Y ., Ghamary, K. H., Vinu´e, L., Yachnin, B. J., Wollacott, A. M., Buckley, S., Westphal, A. H., Lindhoud, S., Georgeon, S., Goverde, C. A., Hatzopoulos, G. N., G¨onczy, P., Muller, Y . D., Schwank, G., Swarts, D...

work page doi:10.1101/2024.09.30.615802 2024
[9]

Boltz-2: Towards Accurate and Efficient Binding Affinity Prediction

doi: 10.1101/2025.06.14.659707. Qu, W., Ma, Y ., Ye, F., Lu, C., Zhou, Y ., Zhang, K., Wang, L., Gui, M., and Gu, Q. Seedproteo: Accurate de novo all-atom design of protein binders,

work page doi:10.1101/2025.06.14.659707 2025
[10]

URLhttps: //arxiv.org/abs/2512.24192. Stark, H., Faltings, F., Choi, M., Xie, Y ., Hur, E., O’Donnell, T., Bushuiev, A., Uc ¸ar, T., Passaro, S., Mao, W., Reveiz, M., Bushuiev, R., Pluskal, T., Sivic, J., Kreis, K., Vahdat, A., Ray, S., Goldstein, J. T., Savinov, A., Hambalek, J. A., Gupta, A., Taquiri-Diaz, D. A., Zhang, Y ., Hatstat, A. K., Arada, A., K...

work page arXiv
[11]

URLhttps://www.biorxiv.org/content/ early/2025/11/24/2025.11.20.689494

doi: 10.1101/2025.11.20.689494. URLhttps://www.biorxiv.org/content/ early/2025/11/24/2025.11.20.689494. Team, P., Ren, M., Sun, J., Guan, J., Liu, C., Gong, C., Wang, Y ., Wang, L., Cai, Q., Chen, X., and Xiao, W. Pxdesign: Fast, modular, and accu- rate de novo design of protein binders.bioRxiv,

work page doi:10.1101/2025.11.20.689494 2025
[12]

URL https://www.biorxiv.org/content/ early/2025/08/16/2025.08.15.670450

doi: 10.1101/2025.08.15.670450. URL https://www.biorxiv.org/content/ early/2025/08/16/2025.08.15.670450. Van Kempen, M., Kim, S. S., Tumescheit, C., Mirdita, M., Lee, J., Gilchrist, C. L., S ¨oding, J., and Steinegger, M. Fast and accurate protein structure search with foldseek. Nature biotechnology, 42(2):243–246,

work page doi:10.1101/2025.08.15.670450 2025
[13]

De novo design of high-affinity protein binders with alphaproteo.arXiv preprint arXiv:2409.08022, 2024

Zambaldi, V ., La, D., Chu, A. E., Patani, H., Danson, A. E., Kwan, T. O., Frerix, T., Schneider, R. G., Saxton, D., Thillaisundaram, A., et al. De novo design of high- affinity protein binders with alphaproteo.arXiv preprint arXiv:2409.08022,

work page arXiv
[14]

URLhttps: //arxiv.org/abs/2510.22304. 11 ProtDBench: A Unified Benchmark of Protein Binder Design and Evaluation Target PDB ID Crop Hotspot Natural binder Binder length BHRF12wh6A2–158 A65, A74, A77, A82, A85, A93 BH3 helix 80–120 SC2RBD6m0jE333–526 E485, E489, E494, E500, E505 ACE2 receptor 80–120 IL-7RA3di3B17–209 B58, B80, B139 IL-7 50–120 PD-L15o45A17...

work page arXiv 2024
[15]

14 ProtDBench: A Unified Benchmark of Protein Binder Design and Evaluation (a)All-alpha binder performance on AF2 metric

With the exception of BindCraft, all methods were evaluated using their default code and parameters. 14 ProtDBench: A Unified Benchmark of Protein Binder Design and Evaluation (a)All-alpha binder performance on AF2 metric. (b)All-alpha binder performance on Protenix metric. (c)Mainly-beta binder performance on AF2 metric. (d)Mainly-beta binder performance...

2017
[16]

Multiple open-source variants of AF3 are released later, including Boltz-1 (Wohlwend et al., 2024), Boltz-2 (Passaro et al., 2025), Chai-1 (ChaiDiscovery et al.,

achieves higher pre- diction accuracy and is able to predict the joint structure of complexes including proteins, nucleic acids, small molecules, ions and modified residues. Multiple open-source variants of AF3 are released later, including Boltz-1 (Wohlwend et al., 2024), Boltz-2 (Passaro et al., 2025), Chai-1 (ChaiDiscovery et al.,

2024
[17]

These predictors are now integral to design and evaluation workflows

and Protenix (Chen et al., 2025). These predictors are now integral to design and evaluation workflows. 20

2025