arxiv: 2605.13826 · v1 · submitted 2026-05-13 · 💻 cs.LG · cond-mat.mtrl-sci· physics.chem-ph

Recognition: unknown

Reducing cross-sample prediction churn in scientific machine learning

Gordan Prastalo , Kevin Maik Jablonka

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:11 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.mtrl-sciphysics.chem-ph

keywords prediction churnbootstrappingconsistency lossscientific machine learningensemble methodschemistry benchmarksdata resampling

0 comments

The pith

K-bootstrap bagging cuts cross-sample prediction churn 40-54% at no accuracy cost in chemistry benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that two models trained on independent random draws from the same data set often assign different class labels to the same test examples, even when their overall accuracy matches within a few points. This disagreement, termed cross-sample prediction churn, stays high under standard parameter-side methods such as deep ensembles or dropout. Simple data-side resampling called K-bootstrap bagging lowers the disagreement rate by 40 to 54 percent on all nine tested chemistry data sets while leaving accuracy unchanged. A joint-training variant called twin-bootstrap, which adds a symmetric KL consistency term between the two networks, lowers churn by another median 45 percent at the same 2-times compute budget.

Core claim

Two classifiers trained on independent bootstraps of the same training set agree on aggregate accuracy to within 1.3-4.2 percentage points but disagree on the class label of 8.0-21.8% of test molecules. K-bootstrap bagging cuts this churn rate 40-54% on every dataset at no accuracy cost at K-times ERM compute. Twin-bootstrap, two networks trained jointly on independent bootstraps with a symmetric KL consistency loss, reduces churn a further median 45% beyond bagging-K=2 at matched 2-times compute.

What carries the argument

Twin-bootstrap training: two networks trained jointly on independent bootstraps with a symmetric KL consistency loss between their output distributions.

If this is right

Cross-sample prediction churn should be reported as a separate column in scientific-ML benchmark tables.
Parameter-side uncertainty techniques leave churn unchanged while data-side bagging reduces it.
Twin-bootstrap delivers extra churn reduction at fixed 2-times compute compared with ordinary bagging.
Churn reduction comes at linear extra compute for bagging but requires no accuracy trade-off.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Churn may serve as a practical diagnostic for whether a model has extracted stable decision boundaries from limited scientific data.
The same data-resampling logic could be tested on regression or multi-task scientific problems to check if churn behaves similarly.
If churn remains after these interventions, it may point to inherent limits in the information contained in the training distribution itself.

Load-bearing premise

The churn reductions measured on the nine chemistry classification benchmarks will continue to hold for other tasks and data distributions.

What would settle it

Applying K-bootstrap bagging or twin-bootstrap to a non-chemistry classification task and measuring no reduction in label disagreement across independent retrainings.

Figures

Figures reproduced from arXiv: 2605.13826 by Gordan Prastalo, Kevin Maik Jablonka.

**Figure 1.** Figure 1: Twin-bootstrap eliminates most of ERM’s retraining-induced prediction flips on BACE. Each row is one of the 80 test molecules with the largest cross-sample contrast; each column is one of ten retrainings on an independent bootstrap of the BACE training pool; cells are coloured by predicted class. Visible vertical stripes in the left (ERM) panel are predictions that flip class across retrainings; under twin… view at source ↗

**Figure 2.** Figure 2: Bagging and twin-bootstrap beat ERM on every chemistry benchmark; MC dropout, deep ensembles, and SWA do not. Left: paired ∆ id-churn vs. ERM for six methods, one row per dataset (smallest N at top), 95% paired-bootstrap CIs across the 45 seed pairs. Vertical reference lines mark each method’s across-dataset mean (twin-bootstrap solid, others dashed; colours match markers); the solid black line is parity w… view at source ↗

**Figure 3.** Figure 3: Routing the top 30% of test predictions (ranked by per-example churn from one extra retraining) to a human reviewer captures 58–100% of all retraining-induced class flips on the 9 chemistry datasets. Each curve is one dataset (BACE in red, held-out in grey); x-axis is the fraction of test predictions reviewed in churn-rank order, y-axis is the cumulative fraction of total flip-mass captured. The diagonal m… view at source ↗

**Figure 4.** Figure 4: Twin-bootstrap λ=300 sits at the accuracy-preserving end of the BACE Pareto frontier. Twin-bootstrap at λ ∈ {1, 3, 10, 30, 100, 300} traces an accuracy-vs.-churn trajectory; the pre-registered selection rule (largest λ with id-acc ≥ ERM-id-acc −0.02) picks λ=300 (filled red). Bagging-K=5 (blue circle) achieves a similar accuracy at higher churn; ERM (open square, upper-right) at the highest churn. Error ba… view at source ↗

**Figure 5.** Figure 5: shows six BACE id-test molecules where ERM flips class on ≥ 36% of seed pairs and twin-bootstrap flips on 0% over the same ten retrainings. Both methods see the same canonical training pool and test set; the difference is the consistency loss. ERM 56% Twin-bootstrap 0% ERM 53% Twin-bootstrap 0% ERM 53% Twin-bootstrap 0% ERM 47% Twin-bootstrap 0% ERM 47% Twin-bootstrap 0% ERM 36% Twin-bootstrap 0% class 0 c… view at source ↗

**Figure 6.** Figure 6: One extra bootstrap is enough: the top-30% recall at K=2 is within 10–24 pp of the K=10 gold standard on every dataset. Mean recall across 30 random K-subsets per K<10. Per-dataset at K=2: 48–83%. At K=10 (using all ten bootstraps to score churn): 58–100%. MOFthermal is the floor on every K (48–58%); BBBP and BBB-Martins reach 100% recall by K=5 [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

read the original abstract

Scientific machine learning reports predictive performance. It does not report whether the same prediction would survive a different draw of training data. Across $9$ chemistry benchmarks, two classifiers trained on independent bootstraps of the same training set agree on aggregate accuracy to within $1.3\text{--}4.2$ percentage points but disagree on the class label of $8.0\text{--}21.8\%$ of test molecules. We call this gap \emph{cross-sample prediction churn}. The standard parameter-side techniques (deep ensembles, MC dropout, stochastic weight averaging) do not reduce this gap; two data-side methods do. The first is $K$-bootstrap bagging, which cuts the rate $40\text{--}54\%$ on every dataset at no accuracy cost ($K{\times}$-ERM compute). The second is \emph{twin-bootstrap}, our proposal: two networks trained jointly on independent bootstraps with a sym-KL consistency loss between their predictions, which at matched $2{\times}$-ERM compute reduces churn a further median $45\%$ beyond bagging-$K{=}2$. Cross-sample prediction churn deserves a column alongside predictive performance in scientific-ML benchmark reports, because without it the parameter-side and data-side methods are indistinguishable on the metric they actually differ on.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines cross-sample prediction churn as a distinct issue and shows that K-bootstrap bagging plus a twin-bootstrap consistency trick cut it substantially on nine chemistry benchmarks with no accuracy loss, though the numbers come without error bars or tests.

read the letter

The main thing here is that two models trained on independent bootstraps of the same chemistry data can match within a few accuracy points yet disagree on the label for 8-22% of test molecules. The authors call this cross-sample prediction churn and show that common parameter-side uncertainty methods leave it untouched. Their data-side fixes work: plain K-bootstrap bagging drops churn 40-54% across all nine benchmarks at K times the training cost, and the twin-bootstrap version with a symmetric KL loss between the pair adds a further median 45% reduction at matched 2x compute while keeping accuracy flat or slightly better. That empirical pattern is the useful part. It gives a low-overhead way to make predictions more stable when the next step is an expensive experiment. The results line up on every dataset they report, which is better than many incremental ML papers manage. The soft spot is the lack of any variance numbers. All the percentage cuts are point estimates with no standard deviations, confidence intervals, or repeated trials shown. Bootstrapping and SGD are both noisy, so it is unclear whether the observed gaps would survive different random seeds. The accuracy comparisons are presented the same way. The work stays within chemistry classification tasks, so broader claims about scientific ML would need more testing. This is for groups running predictive models on molecular or materials data who already care about reliability beyond average accuracy. It is worth sending to peer review because the observation is concrete and the proposed fixes are straightforward, but any referee should ask for error bars and at least one non-classification check before acceptance.

Referee Report

1 major / 2 minor

Summary. The paper introduces cross-sample prediction churn as the disagreement rate on test predictions between classifiers trained on independent bootstraps of the same training set. On nine chemistry benchmarks it shows that parameter-side methods (ensembles, MC dropout, SWA) leave churn largely unchanged while K-bootstrap bagging reduces churn 40–54 % at no accuracy cost and twin-bootstrap (joint training on two bootstraps plus symmetric-KL consistency loss) yields a further median 45 % reduction at matched 2×-ERM compute; the authors argue that churn should be reported alongside accuracy.

Significance. If the quantitative claims hold, the work is significant because it isolates a distinct failure mode—sensitivity to training-set resampling—that is invisible to standard accuracy or uncertainty metrics yet directly relevant to scientific reproducibility. The consistent empirical pattern across nine benchmarks and the proposal of two simple, compute-matched remedies constitute a concrete, falsifiable contribution that could be adopted as a standard reporting practice.

major comments (1)

[§4] §4 (results tables): the central quantitative claims—40–54 % churn reduction for K-bootstrap and median 45 % further reduction for twin-bootstrap—are reported as point estimates or ranges across the nine datasets with no standard deviations, confidence intervals, or repeated independent trials (different bootstrap seeds and optimization seeds). Because both resampling and SGD are stochastic, it is impossible to determine whether the observed differences exceed run-to-run variability; the same tables also present accuracy gaps (1.3–4.2 pp) without variance, undermining the “no accuracy cost” claim.

minor comments (2)

[Abstract] Abstract: the accuracy comparison is stated as “1.3–4.2 percentage points” without clarifying whether this is a range, mean, or median; adding a brief parenthetical on variance would improve clarity.
[§3.2] §3.2: the hyper-parameter governing the sym-KL loss weight is introduced but no ablation or sensitivity plot is provided; a short supplementary table showing churn and accuracy versus this weight on one or two datasets would strengthen the method description.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the presentation of our quantitative results. We address the major comment below.

read point-by-point responses

Referee: [§4] §4 (results tables): the central quantitative claims—40–54 % churn reduction for K-bootstrap and median 45 % further reduction for twin-bootstrap—are reported as point estimates or ranges across the nine datasets with no standard deviations, confidence intervals, or repeated independent trials (different bootstrap seeds and optimization seeds). Because both resampling and SGD are stochastic, it is impossible to determine whether the observed differences exceed run-to-run variability; the same tables also present accuracy gaps (1.3–4.2 pp) without variance, undermining the “no accuracy cost” claim.

Authors: We agree that reporting variability is necessary to substantiate the claims given the stochasticity of bootstrapping and SGD. In the revised manuscript we will rerun all experiments across multiple independent trials (distinct bootstrap seeds and optimization seeds), and update the §4 tables to report means ± standard deviations for the churn-reduction percentages and for the accuracy values. This will allow direct assessment of whether the 40–54 % and median-45 % reductions exceed run-to-run variability and whether the accuracy gaps (1.3–4.2 pp) remain negligible within the observed variance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements on external benchmarks

full rationale

The paper defines cross-sample prediction churn as an observable disagreement rate between models trained on independent bootstraps and then reports measured reductions (40-54% for K-bootstrap, median 45% further for twin-bootstrap) from direct experiments on nine held-out chemistry classification benchmarks. These are point estimates from test-set comparisons, not quantities derived from the same data via fitting or self-referential equations. No load-bearing step reduces by construction to its inputs, no uniqueness theorem is invoked, and no ansatz is smuggled via self-citation. The central claims remain falsifiable against the reported benchmarks and do not collapse into re-labeling of fitted parameters.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper rests on standard supervised classification assumptions (i.i.d. draws, bootstrap resampling validity) and introduces one new training objective (sym-KL consistency) whose weight is a free hyperparameter; no new physical entities or ad-hoc axioms are postulated.

free parameters (1)

sym-KL loss weight
The coefficient balancing the consistency term against the classification loss is chosen by the authors and affects the reported churn reduction.

axioms (1)

domain assumption Bootstrap samples are valid proxies for independent draws from the same underlying distribution
Invoked when treating independent bootstraps as the source of cross-sample variation.

pith-pipeline@v0.9.0 · 5540 in / 1499 out tokens · 24986 ms · 2026-05-14T19:11:39.413585+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 10 canonical work pages

[1]

On the reproducibility of neural network predictions.arXiv preprint arXiv: 2102.03349,

Srinadh Bhojanapalli, Kimberly Wilber, Andreas Veit, Ankit Singh Rawat, Seungyeon Kim, Aditya Menon, and Sanjiv Kumar. On the reproducibility of neural network predictions.arXiv preprint arXiv: 2102.03349,

work page arXiv
[2]

Selective ensembles for consistent predictions

Emily Black, Klas Leino, and Matt Fredrikson. Selective ensembles for consistent predictions. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25- 29,

2022
[3]

doi: 10.1007/bf00058655

ISSN 1573-0565. doi: 10.1007/bf00058655. URLhttp://dx.doi.org/10.1007/BF00058655. Leo Breiman. Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science, 16(3), August

work page doi:10.1007/bf00058655
[4]

Steven L

ISSN 0883-4237. doi: 10.1214/ss/1009213726. URL http://dx.doi.org/10.1214/ss/1009213726. Seyone Chithrananda, Gabriel Grand, and Bharath Ramsundar. Chemberta: Large-scale self- supervised pretraining for molecular property prediction.arXiv preprint arXiv: 2010.09885,

work page doi:10.1214/ss/1009213726 2010
[5]

ISBN 9781510838819

Curran Associates Inc. ISBN 9781510838819. Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D. Hoffman, Farhad Hormozdiari, Neil Houlsby, Shaobo Hou, Ghassen Jerfel, Alan Karthikesalingam, Mario Lucic, Yian Ma, Cory McLean, Diana Mincu, Akinori Mitani, An...

work page arXiv 2011
[6]

doi: 10.1038/s41597-023-02897-3

ISSN 2052-4463. doi: 10.1038/s41597-023-02897-3. URL http://dx.doi.org/ 10.1038/s41597-023-02897-3. Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf H. Roohani, J. Leskovec, Connor W. Coley, Cao Xiao, Jimeng Sun, and M. Zitnik. Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development.NeurIPS Datasets and Benchmarks,

work page doi:10.1038/s41597-023-02897-3 2052
[7]

Clever materials: When models identify good materials for the wrong reasons

Kevin Maik Jablonka. Clever materials: When models identify good materials for the wrong reasons. arXiv preprint arXiv: 2602.17730,

work page arXiv
[8]

Churn reduction via distillation

Heinrich Jiang, Harikrishna Narasimhan, Dara Bahri, Andrew Cotter, and Afshin Rostamizadeh. Churn reduction via distillation. InThe Tenth International Conference on Learning Rep- resentations, ICLR 2022, Virtual Event, April 25-29,

2022
[9]

Charles T

URL https://openreview.net/forum?id= Bkg6RiCqY7. Charles T. Marx, Flávio P. Calmon, and Berk Ustun. Predictive multiplicity in classification. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Research, pages 6765–6774. PMLR,

2020
[10]

doi: 10.1038/s41467-020-17755-8

ISSN 2041-1723. doi: 10.1038/s41467-020-17755-8. URL http://dx.doi.org/10.1038/ s41467-020-17755-8. Aditya Nandy, Gianmarco Terrones, Naveen Arunachalam, Chenru Duan, David W. Kastner, and Heather J. Kulik. Mofsimplify, machine learning models with extracted stability data of three thousand metal–organic frameworks.Scientific Data, 9(1), March

work page doi:10.1038/s41467-020-17755-8 2041
[11]

doi: 10.1038/s41597-022-01181-0

ISSN 2052-4463. doi: 10.1038/s41597-022-01181-0. URLhttp://dx.doi.org/10.1038/s41597-022-01181-0. Richard D. Riley and Gary S. Collins. Stability of clinical prediction models developed using statistical or machine learning methods.Biometrical Journal, 65(8),

work page doi:10.1038/s41597-022-01181-0 2052
[12]

doi: 10.1002/bimj.202200302

ISSN 1521-4036. doi: 10.1002/bimj.202200302. URLhttp://dx.doi.org/10.1002/bimj.202200302. Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machine learning. Chemical Science, 9(2):513–530,

work page doi:10.1002/bimj.202200302
[13]

doi: 10.1039/c7sc02664a

ISSN 2041-6539. doi: 10.1039/c7sc02664a. URL http://dx.doi.org/10.1039/C7SC02664A. Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9,

work page doi:10.1039/c7sc02664a 2041
[14]

Top-10% recall

freezes λ=300 on BACE and applies it to every other dataset. Does per-dataset tuning win on the metric the paper reports (cross-sample churn between independent retrainings)? We test this with a Bayesian optimisation that minimises the argmax disagreement between two independent twin-bootstrap ensembles on a held-out validation fold, subject to an accurac...

2018
[15]

Test models

on RDKit-derived atom-and-bond graphs of BACE: 3 GINConv layers, hidden dimension 128, mean-pool readout, 50 epochs. All ten train-seeds and the canonical-data-seed protocol are unchanged. Table 15 reports per-method id-accuracy, class-flip rate, and sym-KL with95% paired-bootstrap CIs, plus paired∆vs. ERM on the same45seed-pairs. Table 15:GIN on BACE: ba...

2020