pith. machine review for the scientific record. sign in

arxiv: 2605.11764 · v1 · submitted 2026-05-12 · 💻 cs.LG · q-bio.BM

Recognition: 2 theorem links

· Lean Theorem

Decomposing the Generalization Gap in PROTAC Activity Prediction: Variance Attribution and the Inter-Laboratory Ceiling

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:36 UTC · model grok-4.3

classification 💻 cs.LG q-bio.BM
keywords PROTAC activity predictiongeneralization gapinter-laboratory varianceleave-one-target-outAUROC decompositionmachine learning benchmarkvariance attributionfew-shot calibration
0
0 comments X

The pith

Inter-laboratory variance dominates the generalization gap in PROTAC activity prediction, capping leave-one-target-out AUROC near 0.67.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Machine-learning models for PROTAC biochemical activity show high performance under random splits but drop to around 0.67 AUROC under leave-one-target-out evaluation that mimics prediction for novel targets. The paper decomposes this gap by constructing a within-target cross-laboratory cascade from published measurements and finds that inter-laboratory variance accounts for the largest share, bounded at 0.124 AUROC, compared with only 0.05 from binarisation choices. Across eight architectures and large protein language models, performance plateaus at this level even after extensive hyperparameter search and deduplication. Modest gains come from few-shot per-target retraining with ADMET features and post-hoc calibration. The work releases a benchmark of 10,748 measurements across 173 targets to support further decomposition studies.

Core claim

The generalization gap between random-split and leave-one-target-out performance arises mainly from inter-laboratory measurement variance rather than model architecture or split type. A within-target cross-laboratory cascade constructed from existing data bounds the inter-laboratory contribution at 0.124 AUROC, exceeding the 0.05 contribution from binarisation-threshold selection. LOTO AUROC plateaus near 0.67 across models and cannot be broken by 21-dimensional hyperparameter optimisation or SMILES deduplication; single-seed results regress by 0.161 AUROC under multi-seed validation, matching a closed-form selection-bias prediction.

What carries the argument

The within-target cross-laboratory cascade that isolates and upper-bounds the inter-laboratory variance component of the observed generalization gap.

If this is right

  • LOTO AUROC plateaus near 0.67 across eight published architectures and ESM-2 models up to 3B parameters.
  • A 21-dimensional 2000-trial hyperparameter search and SMILES-level deduplication both fail to exceed the plateau.
  • Single-seed rank-1 configurations lose 0.161 AUROC under multi-seed evaluation, matching the closed-form selection-bias formula.
  • Few-shot k=5 stratified per-target retraining with ADMET features raises 65-target LOTO AUROC from 0.668 to 0.705.
  • Post-hoc Platt scaling brings raw model outputs inside the 0.05 well-calibrated threshold.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Standardising assay protocols across labs or pooling multi-lab data for the same targets could raise the effective performance ceiling for novel-target prediction.
  • The observed selection bias under single-seed validation implies that future PROTAC and bioactivity papers should report multi-seed statistics by default.
  • The variance-decomposition approach could be applied to other target-specific activity prediction tasks where published data come from heterogeneous laboratory sources.

Load-bearing premise

The cascade built from published data cleanly separates inter-laboratory variance without leftover confounding from assay protocols or compound selection biases.

What would settle it

Measure the same set of PROTAC compounds on the same targets in multiple independent laboratories under controlled conditions, then recompute the AUROC gap between intra-lab and cross-lab splits to test whether it matches the 0.124 bound.

Figures

Figures reproduced from arXiv: 2605.11764 by Ming Tang, Thor Klamt, Wolfgang Nejdl.

Figure 1
Figure 1. Figure 1: The held-out-target performance gap and train-test molecular overlap. (A) Reported versus replicated AUROC across four published PROTAC predictors (DeepPROTACs, Ribes et al., PROTAC-STAN, DegradeMaster) and two reference baselines (RF+Morgan, kNN). Per-target AUROCs are overlaid as scatter dots on the replicated bars. (B) Maximum Tanimoto similarity to nearest training-set neighbor, density across protocol… view at source ↗
Figure 2
Figure 2. Figure 2: Triangulating bounds on the random-CV-to-LOTO gap. Methodologically primary decomposition is the ω 2 = 0.256 variance-share analysis reported in the body text; bars are partially￾overlapping bounds rather than additive components (the four anchors sum to approximately the 0.18 to 0.20 macro-mean gap as a coarse upper-bound triangulation rather than as an additive identity). Same-compound-cross-target sensi… view at source ↗
Figure 3
Figure 3. Figure 3: Factorial decomposition and few-shot calibration. (a) Mean marginal AUROC contribu￾tion of each factor under target-clustered bootstrap (n=65 targets, 5000 replicates); few-shot k=5 at +0.0306 with CI [+0.015, +0.051], ADMET at +0.0111, warhead transfer at +0.0025 with CI [−0.009, +0.015] crossing zero. (b) Few-shot learning curves: RF retraining beats both meta-learning baselines (MAML, ProtoNet) at every… view at source ↗
Figure 4
Figure 4. Figure 4: HPO V2 per-trial AUROC across major hyperparameter dimensions. Each panel reports trial-level AUROC against one HP dimension (head type, molecular encoder, protein encoder, fragment mode, ternary attention, warhead transfer toggle). Seed 0 exploration phase (light blue), seed 0 exploitation phase (dark blue), seed 1 exploration phase (light orange), seed 1 exploitation phase (orange). The visual separation… view at source ↗
Figure 5
Figure 5. Figure 5: Functional ANOVA variance attribution across the 21-dimensional HPO V2 search space. Bars report the proportion of trial-level AUROC variance attributable to each hyperparameter dimension. The head_type dimension explains 28.1 percent of variance, molecular encoder 26.0 percent, protein encoder 14.6 percent, normalisation 10.3 percent, and fragment mode 9.3 percent, with rdkit_desc (5.1 percent) and all re… view at source ↗
Figure 6
Figure 6. Figure 6: Structure ladder under matched LOTO evaluation. AUROC across six methodologically distinct geometric and structural approaches plus pocket-shuffle and zero-pocket controls. Error bars where 10-seed standard deviation is available; Boltz-2 ternary, Morgan plus Boltz-2, and AlphaFold￾pocket rows are reported as single-seed point estimates under the upstream pipeline configurations and therefore omit error ba… view at source ↗
Figure 7
Figure 7. Figure 7: Pocket-shuffle control on the EGNN hybrid configuration. Bar heights show 10-seed mean LOTO AUROC. Per-seed dots are overlaid; the dotted reference line marks the original-hybrid mean. Pocket geometry contribution is at most 0.013 AUROC, within seed standard deviation. content. Level 3 cross-references targets to ChEMBL via UniProt-to-ChEMBL ID mapping and pulls activity assay details. All three levels fee… view at source ↗
Figure 8
Figure 8. Figure 8: Synthetic-noise calibration of the inter-laboratory bound. LOTO macro-mean AUROC under uniform-random label flips at f ∈ {0, 0.01, 0.02, 0.05, 0.10, 0.15, 0.20}, 3 seeds per noise level (per-seed dots overlaid on the error bars). Linear regression through the 7 means has slope −0.0054 AUROC per percent flip and intercept 0.659. The shaded band shows the central 80 percent prediction interval, with band hal… view at source ↗
Figure 9
Figure 9. Figure 9: Reliability diagram for the Morgan baseline and full-stack pipeline under raw output. LOWESS-smoothed empirical positive rate plotted against mean predicted probability for the Morgan baseline (C0) and the full-stack Morgan plus warhead plus ADMET plus few-shot k=5 pipeline (C3), aggregated across 10-seed canonical LOTO evaluation (n = 94,280 predictions per condition). Per-segment linewidth encodes local … view at source ↗
Figure 10
Figure 10. Figure 10: Per-family LOFO AUROC across the 22-family cohort. Horizontal bars sorted by LOFO mean AUROC descending. Number of targets per family shown next to each family label. Vertical reference line at the canonical LOTO baseline of 0.668. Colour gradient from green (high AUROC) to orange (low AUROC). Under matched canonical RF settings (10 seeds, n_estimators=200, morgan_bits=2048, class_weight=balanced) the agg… view at source ↗
Figure 11
Figure 11. Figure 11: Local mlcroissant validator output for the populated Croissant metadata. The validator confirms schema PASS at exit code zero with all twenty MLCommons RAI extension fields populated; the only validator output is the cosmetic equivalentProperty warning shared with the upstream Croissant 1.0 schema. The validator report against the live HuggingFace URL is committed to the camera-ready release. Compute and … view at source ↗
read the original abstract

Machine-learning predictors of biochemical activity often exhibit large random-split-to-leave-one-target-out generalisation gaps that have been documented but not decomposed. We frame this as an evaluation-science question and use targeted protein degradation as the empirical test bed. PROTACs (proteolysis-targeting chimeras) are heterobifunctional small molecules that induce targeted protein degradation, with more than forty candidates currently in clinical trials; published predictors report AUROC of 0.85 to 0.91 under random-split cross-validation, while the leave-one-target-out (LOTO) protocol of Ribes et al. reduces performance to approximately 0.67. Random splits reward within-target interpolation, whereas LOTO measures the novel-target prediction that de-novo design depends on. We decompose this gap and identify inter-laboratory measurement variance as the dominant component, anchored by a within-target cross-laboratory cascade bounding the inter-laboratory contribution at 0.124 AUROC, well above the 0.05 contribution from binarisation-threshold choice. Across eight published architectures and ESM-2 protein language models up to 3B parameters, LOTO AUROC plateaus near 0.67, with a comparable plateau under SMILES-level deduplication; a 21-dimensional 2000-trial hyperparameter optimisation cannot break this ceiling, and the rank-1 single-seed configuration regresses by 0.161 AUROC under multi-seed validation, matching a closed-form selection-bias prediction (Bailey and Lopez de Prado, 2014). Few-shot k=5 stratified per-target retraining combined with ADMET features lifts 65-target LOTO AUROC from 0.668 to 0.7050, and post-hoc Platt scaling recovers raw output to within the 0.05 well-calibrated threshold. We release PROTAC-Bench (10,748 measurements, 173 targets, 65 LOTO folds), the variance-decomposition framework, the per-target calibration protocol, and the evaluation code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript decomposes the random-split to leave-one-target-out (LOTO) generalization gap in PROTAC activity prediction models. It identifies inter-laboratory measurement variance as the dominant component, anchored by a within-target cross-laboratory cascade that bounds this contribution at 0.124 AUROC, exceeding the 0.05 from binarisation-threshold choice. The work demonstrates performance plateaus across multiple architectures including large ESM-2 models, consistency of hyperparameter optimization results with selection-bias theory, and modest gains from few-shot per-target retraining and Platt scaling. A new benchmark dataset PROTAC-Bench is released along with evaluation code.

Significance. If the decomposition is robust, the paper makes a meaningful contribution to understanding limits in de-novo PROTAC design by highlighting data variance across laboratories as the primary barrier to generalization, rather than model expressivity. The use of external published data for the bound, the match to closed-form selection bias predictions, and the open release of the benchmark and framework are strengths that support reproducibility and practical impact in the field.

major comments (1)
  1. [within-target cross-laboratory cascade construction] The 0.124 AUROC bound from the within-target cross-laboratory cascade is load-bearing for the central attribution of the gap to inter-laboratory variance. Because the cascade aggregates heterogeneous published measurements, any unmodeled differences in assay format (cell-based vs. biochemical), detection method, or compound selection criteria could introduce residual confounding and inflate the bound. The manuscript does not report stratification or regression on assay metadata, so the isolation of pure inter-lab variance is not fully established.
minor comments (1)
  1. The improved LOTO AUROC is reported as 0.7050 in the abstract; confirm the precise value and whether it reflects four-decimal precision or a typographical artifact.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for highlighting the importance of rigorously isolating inter-laboratory variance. We address the single major comment below and commit to revisions that strengthen the supporting analysis.

read point-by-point responses
  1. Referee: [within-target cross-laboratory cascade construction] The 0.124 AUROC bound from the within-target cross-laboratory cascade is load-bearing for the central attribution of the gap to inter-laboratory variance. Because the cascade aggregates heterogeneous published measurements, any unmodeled differences in assay format (cell-based vs. biochemical), detection method, or compound selection criteria could introduce residual confounding and inflate the bound. The manuscript does not report stratification or regression on assay metadata, so the isolation of pure inter-lab variance is not fully established.

    Authors: We agree that the absence of explicit stratification or regression on assay metadata leaves open the possibility of residual confounding. In the revised manuscript we will add a dedicated subsection that (i) stratifies the within-target cross-laboratory cascade by the assay metadata available in PROTAC-Bench (cell-based vs. biochemical, detection method, and compound selection criteria where recorded), (ii) reports the AUROC bound within each stratum, and (iii) fits a linear regression of pairwise AUROC differences on laboratory identity while controlling for assay type and detection method. Preliminary internal checks show that the inter-laboratory effect remains statistically significant and of comparable magnitude after these controls, but we will report the full coefficients, adjusted bound, and any sensitivity analyses. These additions will be placed immediately after the current cascade description so that readers can evaluate the robustness of the 0.124 AUROC figure directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; inter-lab bound anchored on external published measurements

full rationale

The central derivation decomposes the random-split to LOTO gap by attributing dominance to inter-laboratory variance, with the 0.124 AUROC bound supplied by a within-target cross-laboratory cascade built from heterogeneous published data rather than fitted to the present model's outputs or LOTO folds. This external sourcing prevents definitional reduction or fitted-input-called-prediction patterns. No self-citation load-bearing step, uniqueness theorem, or ansatz smuggling appears in the reported chain; the hyperparameter search, few-shot retraining lift, and selection-bias regression are presented as separate empirical observations. The minor score of 2 reflects only the normal presence of external citations without any load-bearing reduction to self-referential inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The analysis rests on the assumption that published cross-lab measurements form an unbiased sample of inter-laboratory variance and that the cascade construction isolates this component from target-specific effects.

free parameters (1)
  • binarisation threshold contribution
    The 0.05 AUROC contribution is treated as an empirical upper bound derived from threshold choice experiments.
axioms (1)
  • domain assumption Within-target cross-laboratory measurements provide a valid upper bound on inter-laboratory variance for the same compounds
    Invoked to anchor the 0.124 AUROC bound as the dominant gap component.

pith-pipeline@v0.9.0 · 5683 in / 1278 out tokens · 36317 ms · 2026-05-13T07:36:28.816615+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 1 internal anchor

  1. [1]

    On over-fitting in model selection and subsequent selection bias in performance evaluation.The Journal of Machine Learning Research, 11:2079–2107,

    Gavin C Cawley and Nicola LC Talbot. On over-fitting in model selection and subsequent selection bias in performance evaluation.The Journal of Machine Learning Research, 11:2079–2107,

  2. [2]

    A closer look at few-shot classification.arXiv preprint arXiv:1904.04232, 2019

    Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A closer look at few-shot classification.arXiv preprint arXiv:1904.04232,

  3. [3]

    Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann

    doi: 10.1093/nar/gkae768. Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673,

  4. [4]

    arXiv preprint arXiv:2007.01434 , year=

    Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization.arXiv preprint arXiv:2007.01434,

  5. [5]

    Protein language models are accidental taxonomists.bioRxiv, pages 2025–10,

    Logan Hallee, Tamar Peleg, Nikolaos Rafailidis, and Jason P Gleghorn. Protein language models are accidental taxonomists.bioRxiv, pages 2025–10,

  6. [6]

    Beware of data leakage from protein llm pretraining.bioRxiv, pages 2024–07,

    Leon Hermann, Tobias Fiedler, Hoang An Nguyen, Melania Nowicka, and Jakub M Bartoszewicz. Beware of data leakage from protein llm pretraining.bioRxiv, pages 2024–07,

  7. [7]

    Selective classification can magnify disparities across groups.arXiv preprint arXiv:2010.14134,

    Erik Jones, Shiori Sagawa, Pang Wei Koh, Ananya Kumar, and Percy Liang. Selective classification can magnify disparities across groups.arXiv preprint arXiv:2010.14134,

  8. [8]

    Fine-tuning can distort pretrained features and underperform out-of-distribution.arXiv preprint arXiv:2202.10054,

    Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution.arXiv preprint arXiv:2202.10054,

  9. [9]

    Quantifying the Carbon Emissions of Machine Learning

    Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. Quantifying the carbon emissions of machine learning.arXiv preprint arXiv:1910.09700,

  10. [10]

    Stefano Ribes, Eva Nittinger, Christian Tyrchan, and Rocío Mercado

    doi: 10.1093/nar/gkaf996. Stefano Ribes, Eva Nittinger, Christian Tyrchan, and Rocío Mercado. Modeling protac degradation activity with machine learning.Artificial Intelligence in the Life Sciences, 6:100104,

  11. [11]

    doi: 10.1002/ardp.70225. David R Roberts, V olker Bahn, Simone Ciuti, Mark S Boyce, Jane Elith, Gurutzeta Guillera-Arroita, Severin Hauenstein, José J Lahoz-Monfort, Boris Schröder, Wilfried Thuiller, et al. Cross- validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure.Ecog- raphy, 40(8):913–929,

  12. [12]

    Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B Tenenbaum, and Phillip Isola

    doi: 10.1038/s42256-025-01176-7. Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B Tenenbaum, and Phillip Isola. Rethinking few-shot image classification: a good embedding is all you need? InEuropean conference on computer vision, pages 266–282. Springer,

  13. [13]

    Datasheets aren’t enough: Datarubrics for automated quality metrics and accountability.arXiv preprint arXiv:2506.01789,

    Genta Indra Winata, David Anugraha, Emmy Liu, Alham Fikri Aji, Shou-Yi Hung, Aditya Parashar, Patrick Amadeus Irawan, Ruochen Zhang, Zheng-Xin Yong, Jan Christian Blaise Cruz, et al. Datasheets aren’t enough: Datarubrics for automated quality metrics and accountability.arXiv preprint arXiv:2506.01789,

  14. [14]

    Protap: A benchmark for protein modeling on realistic downstream applications.arXiv preprint arXiv:2506.02052,

    Shuo Yan, Yuliang Yan, Bin Ma, Chenao Li, Haochun Tang, Jiahua Lu, Minhua Lin, Yuyuan Feng, Hui Xiong, and Enyan Dai. Protap: A benchmark for protein modeling on realistic downstream applications.arXiv preprint arXiv:2506.02052,

  15. [15]

    at five scales (8M, 35M, 150M, 650M, 3B parameters) were evaluated as protein encoders concatenated to Morgan 2048 fingerprints with a Random Forest head, under both random-split and LOTO evaluation, across 5 canonical seeds. Random-CV pooled AUROC inflates monotonically with PLM scale from0.890 (8M) to 0.914 (3B), while LOTO macro-mean AUROC follows a no...

  16. [16]

    Seed 0 exploration phase (light blue), seed 0 exploitation phase (dark blue), seed 1 exploration phase (light orange), seed 1 exploitation phase (orange)

    attention mlp rf xgboost 0.4 0.5 0.6 0.7 0.8LOTO AUROC (15-target) head_type chembertamaccs_167molformer morgan_1024_r2morgan_2048_r2 rdkit_200 mol_encoder ankh_largeesm2_150Mesm2_35Mesm2_3Besm2_650Mesm2_8M noneprostt5prot_t5_xl prot_encoder all_fragments linker_only none warhead_only 0.4 0.5 0.6 0.7 0.8LOTO AUROC (15-target) fragment_mode False True tern...

  17. [17]

    and a faithful replication on PROTAC-Bench at 0.626 was investigated through five controlled single-variable substitutions, with each component’s individual contribution reported in Table 4; the components are not orthogonal and their individual effects do not constitute an additive decomposition. Dataset and class balance accounts for approximately 0.10 ...

  18. [18]

    Temporal-prospective evaluation (training pre-2023, testing

  19. [19]

    ‡Includes patent-enumerated compounds without matched activity assays

    63,136 ‡ 252 CC-BY-NC-ND-4.0 No †Binary degradation entries with measured DC50 or Dmax. ‡Includes patent-enumerated compounds without matched activity assays. Pairwise eta-squared decomposition.A two-way Type-II ANOV A on the within-target cross-lab cohort (36 targets across four binarization schemes) partitions AUROC variance with target as the 36-level ...

  20. [20]

    shows severe under-prediction at the lowest-confidence bin (mean predicted 0.054, empirical 0.294, gap +0.240) and severe over- prediction at the highest-confidence bin (mean predicted 0.941, empirical 0.645, gap −0.296), consistent with the high-confidence overconfidence pattern documented by Jones et al. [2020]. Post- hoc temperature scaling reduces ECE...

  21. [21]

    finding that post-hoc calibration fails under dataset shift. Calibration split protocol under LOTO.Platt scaling parameters are fit on a 20 percent held- out calibration fold drawn from the LOTO training partition (the remaining 80 percent of non-test 26 0 5 10 15 20 25 Uniform-random label flip rate (%) 0.52 0.54 0.56 0.58 0.60 0.62 0.64 0.66 0.68LOTO ma...

  22. [22]

    Pathological-tail targets.Five LOTO-eligible targets exhibit sub-chance AUROC across all 10 canonical seeds: four small-n boundary cases (Q96SW2, P15170, Q9Y2I7, P33981; n between 17 and 21, all near the class-balance eligibility boundary) and Q07889 ( n= 91 , biology SOS1, a Ras-pathway GEF whose non-enzymatic scaffold-protein SAR diverges from the kinas...

  23. [23]

    The merged dataset, fold assignments, and accompanying evaluation code are released under CC-BY-4.0 (dataset) and MIT (code)

    is pub- lished under CC BY 4.0. The merged dataset, fold assignments, and accompanying evaluation code are released under CC-BY-4.0 (dataset) and MIT (code). All compound activity measurements are derived from peer-reviewed publications and patent literature; no proprietary or restricted-access data are included. No human-subjects information is associate...