pith. sign in

arxiv: 2605.22593 · v1 · pith:TVQUCVMZnew · submitted 2026-05-21 · 💻 cs.LG

Do Deep Ensembles Actually Capture Uncertainty in Graph Neural Networks?

Pith reviewed 2026-05-22 07:03 UTC · model grok-4.3

classification 💻 cs.LG
keywords graph neural networksdeep ensemblesuncertainty quantificationepistemic uncertaintyepistemic collapsemessage passingfunctional convexityaleatoric epistemic decomposition
0
0 comments X

The pith

Deep ensembles provide only marginal gains over single graph neural networks because independently trained models converge to overly similar predictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether deep ensembles reliably quantify uncertainty when applied to graph neural networks that use message passing. It shows that ensembles add little value beyond a single model, and the small benefits mainly come from smoothing out random training variations in the point predictions themselves. An aleatoric-epistemic split of the uncertainty reveals that the networks agree too much on their outputs even though they start from different random initializations. This agreement removes the source of disagreement that ensembles normally use to measure epistemic uncertainty. Readers should care because graph data appears in many safety-critical settings where knowing when a prediction is uncertain matters for downstream decisions.

Core claim

Standard deep ensembles do not transfer their uncertainty-quantification success from other domains to message-passing graph neural networks. Across seven datasets the ensembles deliver only modest improvements over a lone model, and those gains arise chiefly from averaging optimization noise rather than from genuinely richer uncertainty estimates. The root cause is epistemic collapse: independently trained networks consistently produce nearly identical predictions because distinct parameter vectors map to almost the same function, a consequence of functional rather than weight-space convexity.

What carries the argument

Epistemic collapse, the convergence of independently trained networks to nearly identical predictions on graph data despite different parameters, which eliminates the disagreement needed for epistemic uncertainty.

If this is right

  • Ensembles cannot be treated as a default reliable method for epistemic uncertainty in graph neural networks.
  • Any observed performance lift from ensembles is explained by reduced training noise rather than better uncertainty.
  • New uncertainty methods tailored to the functional geometry of graph models are required.
  • The transfer of ensemble techniques from vision or tabular data to graphs must be re-examined rather than assumed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same collapse may appear in other structured prediction settings where the input graph imposes strong functional constraints.
  • Single-model uncertainty techniques or explicit diversity-promoting regularizers could be tested as direct remedies.
  • If functional convexity is the driver, then architectural changes that increase the expressivity of the message-passing layers might restore ensemble diversity.

Load-bearing premise

The lack of prediction diversity is caused by functional convexity in the solution space and occurs generally for message-passing graph neural networks rather than only in the specific architectures and seven datasets examined.

What would settle it

Observing substantially higher prediction disagreement and correspondingly stronger uncertainty calibration from ensembles on a new collection of graph datasets or architectures would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.22593 by Pedro C. Vieira, Pedro Ribeiro, Viacheslav Borovitskiy.

Figure 1
Figure 1. Figure 1: PEMS road network benchmark. Nodes with a black dot in the middle are train nodes. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Likelihood ratios exp(NLLGNN − NLLDE) and 95% Gaussian confidence intervals. This quantifies the relative predictive likelihood improvement of DE over GNN for both classification and regression datasets. Baselines indicate typical improvements observed in foundational DE literature. implement this subsampling to test a data-scarce regime, a setting where the quality of uncertainty estimates is especially c… view at source ↗
Figure 3
Figure 3. Figure 3: Deconstructing the source of NLL improvements. ( [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of epistemic, aleatoric, and total uncertainty across datasets, demonstrating [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Weight-space convexity analysis. (a) compares the point prediction performance and (b) compares the NLL of a single model against a model soup formed by averaging the weights of the ensemble members. The degraded performance of the model soup confirms that the models reside in different regions of a highly non-convex weight space. Markers represent the mean, thin bars represent ±σ and transparent bars repr… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of point estimation performance, [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Weight-space convexity analysis. Figure 7a compares the point prediction performance and Figure 7b the NLL of a single model against a “model soup” formed by averaging the weights of the ensemble members. Lower values are better for NLL and higher are better for point prediction. The degraded performance of the model soup confirms that the models reside in different regions of a highly non-convex parameter… view at source ↗
Figure 8
Figure 8. Figure 8: Evolution of NLL (a, c and e) and point estimation metric (b, d and f) as the number of base models in the ensemble increases. Red dashed line denotes a trivial baseline. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Evolution of NLL (a, c, e and g) and point estimation metric (b, d, f and h) as the number of base models in the ensemble increases. Red dashed line denotes a trivial baseline. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
read the original abstract

While deep ensembles are widely considered to be the default method for uncertainty quantification in deep learning, their effectiveness for graph-structured data is often simply assumed based on successes in domains like computer vision. We investigate standard deep ensembles specifically for message-passing graph neural networks. Benchmarking across seven datasets representing varied tasks and complexities, we reveal that ensembles provide surprisingly little improvement over a single model. Instead, the observed marginal gains stem primarily from stabilizing optimization noise in point predictions rather than yielding meaningfully better uncertainty estimates. Through an aleatoric-epistemic decomposition, we identify epistemic collapse: independently trained networks consistently converge to overly similar predictions. Because disagreement is the fundamental mechanism through which ensembles capture epistemic uncertainty, this lack of diversity neutralizes their key advantage. Analyzing this phenomenon further, we suggest this collapse is driven by functional rather than weight-space convexity, where distinct parameter solutions induce almost identical behavior. Our results suggest that deep ensemble success does not seamlessly transfer to graph machine learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript benchmarks deep ensembles as an uncertainty quantification method for message-passing graph neural networks. Across seven datasets, it reports that ensembles yield only marginal gains over single models, which arise mainly from stabilizing optimization noise in point predictions rather than from meaningfully improved uncertainty estimates. Using an aleatoric-epistemic decomposition, the authors identify epistemic collapse: independently trained networks converge to overly similar predictions. They attribute this to functional (rather than weight-space) convexity and conclude that deep-ensemble success does not transfer to graph machine learning.

Significance. If the central empirical findings hold, the work usefully challenges the default transfer of deep-ensemble uncertainty methods to GNNs and motivates GNN-specific alternatives. The aleatoric-epistemic decomposition and the explicit link between prediction similarity and the failure of disagreement-based epistemic uncertainty constitute a clear, falsifiable analysis. The paper's strength lies in its reproducible benchmarking protocol and the introduction of the epistemic-collapse observation as a concrete phenomenon to be explained or mitigated.

major comments (2)
  1. [§4] §4 (Experimental results): The claim that epistemic collapse is characteristic of message-passing GNNs in general rests on the seven chosen datasets and standard GCN/GIN architectures. Without ablations on alternative layers (GAT, GraphSAGE), heterophilic graphs, or larger-scale datasets, it remains possible that the observed prediction similarity is an artifact of the specific inductive biases, optimization settings, or dataset homophily rather than a general property; this directly affects the load-bearing conclusion that ensembles cannot capture epistemic uncertainty in GNNs.
  2. [§3.3] §3.3 (Aleatoric-epistemic decomposition): The decomposition treats disagreement across ensemble members as the primary source of epistemic uncertainty. The manuscript should explicitly verify that the chosen diversity metric remains valid under graph-structured dependencies (e.g., message-passing correlations) and report sensitivity to the number of ensemble members and training seeds; otherwise the quantitative attribution of marginal gains to optimization stabilization rather than uncertainty improvement is not fully isolated.
minor comments (2)
  1. [Abstract] Abstract and §1: The seven datasets are described only as 'representing varied tasks and complexities'; listing their names, sizes, and task types (node classification, graph classification, etc.) would allow readers to assess coverage immediately.
  2. [§4] Figure captions and §4: Ensure all figures reporting ensemble vs. single-model metrics include error bars over multiple random seeds and clearly label the uncertainty decomposition components.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments help clarify the scope and robustness of our findings on epistemic collapse in deep ensembles for message-passing GNNs. We address each major comment point-by-point below, indicating revisions made to the manuscript where appropriate.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental results): The claim that epistemic collapse is characteristic of message-passing GNNs in general rests on the seven chosen datasets and standard GCN/GIN architectures. Without ablations on alternative layers (GAT, GraphSAGE), heterophilic graphs, or larger-scale datasets, it remains possible that the observed prediction similarity is an artifact of the specific inductive biases, optimization settings, or dataset homophily rather than a general property; this directly affects the load-bearing conclusion that ensembles cannot capture epistemic uncertainty in GNNs.

    Authors: We appreciate the referee's emphasis on generalizability. Our experiments deliberately focused on canonical message-passing architectures (GCN and GIN) across seven datasets chosen to span different scales, tasks, and homophily levels, as these represent the most common inductive biases in the literature. We attribute epistemic collapse to functional convexity arising from the shared message-passing update rule rather than specific layer details. To address the concern, we have added new ablations in the revised Section 4 using GAT and GraphSAGE on both a homophilic dataset and a heterophilic one (Chameleon). These confirm similar levels of prediction similarity across ensemble members. For larger-scale datasets, we acknowledge practical compute limits prevented full replication but have expanded the discussion of scalability and potential limitations in the revised text. We believe these additions support the conclusion for standard message-passing GNNs while noting that future work could explore even broader settings. revision: yes

  2. Referee: [§3.3] §3.3 (Aleatoric-epistemic decomposition): The decomposition treats disagreement across ensemble members as the primary source of epistemic uncertainty. The manuscript should explicitly verify that the chosen diversity metric remains valid under graph-structured dependencies (e.g., message-passing correlations) and report sensitivity to the number of ensemble members and training seeds; otherwise the quantitative attribution of marginal gains to optimization stabilization rather than uncertainty improvement is not fully isolated.

    Authors: Thank you for this methodological suggestion. The diversity metric (prediction variance across members) is computed on the final node or graph outputs after message passing, so graph-induced correlations are already reflected in the forward passes of each network. To explicitly verify robustness, we have added sensitivity analyses in the revised Section 3.3 and a new appendix subsection. These vary ensemble size (3 to 10 members) and training seeds, showing that the observed low disagreement and attribution of gains to optimization stabilization remain consistent. We also include a brief check correlating the metric with an alternative epistemic uncertainty proxy on a controlled synthetic graph task. These revisions better isolate the effects and strengthen the aleatoric-epistemic decomposition. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observations on prediction similarity

full rationale

The paper is an empirical benchmarking study across seven datasets that directly measures prediction agreement among independently trained message-passing GNNs and reports marginal ensemble gains. No derivation chain, fitted-parameter prediction, or self-referential equation is present; the central observation of epistemic collapse follows from explicit experimental comparisons rather than reducing to inputs by construction. Self-citations, if any, are not load-bearing for the reported results, which remain falsifiable via the stated benchmarks and decomposition procedure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard supervised training assumptions for GNNs and the validity of the aleatoric-epistemic decomposition; no major free parameters or invented physical entities are introduced beyond the descriptive label 'epistemic collapse'.

axioms (1)
  • domain assumption Disagreement among ensemble members is the primary mechanism for capturing epistemic uncertainty
    Invoked when interpreting lack of diversity as neutralizing the ensemble advantage
invented entities (1)
  • epistemic collapse no independent evidence
    purpose: Descriptive term for the observed convergence of independently trained GNNs to similar predictions
    Introduced to explain why ensembles fail to produce diversity; no independent falsifiable handle provided in the abstract

pith-pipeline@v0.9.0 · 5696 in / 1263 out tokens · 50941 ms · 2026-05-22T07:03:56.347843+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 1 internal anchor

  1. [1]

    T. Abe, E. K. Buchanan, G. Pleiss, R. Zemel, and J. P. Cunningham. Deep ensembles work, but are they necessary? InAdvances in Neural Information Processing Systems, 2022. Cited on page 7

  2. [2]

    Ainsworth, J

    S. Ainsworth, J. Hayase, and S. Srinivasa. Git Re-Basin: Merging Models modulo Permuta- tion Symmetries. InInternational Conference on Learning Representations, 2023. Cited on page 9

  3. [3]

    Bazhenov, S

    G. Bazhenov, S. Ivanov, M. Panov, A. Zaytsev, and E. Burnaev. Towards OOD De- tection in Graph Classification from Uncertainty Estimation Perspective.arXiv preprint arXiv:2206.10691, 2022. Cited on page 3

  4. [4]

    Bazhenov, D

    G. Bazhenov, D. Kuznedelev, A. Malinin, A. Babenko, and L. Prokhorenkova. Evaluating Robustness and Uncertainty of Graph Models Under Structural Distributional Shifts. InAd- vances in Neural Information Processing Systems, 2023. Cited on pages 3, 13

  5. [5]

    Bazhenov, O

    G. Bazhenov, O. Platonov, and L. Prokhorenkova. GraphLand: Evaluating Graph Machine Learning Models on Diverse Industrial Data. InThe Thirty-ninth Annual Conference on Neu- ral Information Processing Systems Datasets and Benchmarks Track, 2025. Cited on pages 4, 5, 14, 15

  6. [6]

    Blundell, J

    C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks. InInternational Conference on International Conference on Machine Learning,

  7. [7]

    Borovitskiy

    V . Borovitskiy. PeMS Regression: A Benchmark Suite for Node Regression with Uncertainty, 2025.URL:https://github.com/vabor112/pems- regression. Cited on pages 2–4, 13

  8. [8]

    Borovitskiy, I

    V . Borovitskiy, I. Azangulov, A. Terenin, P. Mostowsky, M. Deisenroth, and N. Durrande. Mat´ern Gaussian processes on graphs. InInternational Conference on Artificial Intelligence and Statistics, 2021. Cited on pages 3, 4

  9. [9]

    Brody, U

    S. Brody, U. Alon, and E. Yahav. How Attentive are Graph Attention Networks? InInterna- tional Conference on Learning Representations, 2022. Cited on page 5

  10. [10]

    J. Busk, P. Bjørn Jørgensen, A. Bhowmik, M. N. Schmidt, O. Winther, and T. Vegge. Cal- ibrated uncertainty for molecular property prediction using ensembles of message passing neural networks.Machine Learning: Science and Technology, 3, 2021. Cited on page 3

  11. [11]

    Depeweg, J.-M

    S. Depeweg, J.-M. Hernandez-Lobato, F. Doshi-Velez, and S. Udluft. Decomposition of Un- certainty in Bayesian Deep Learning for Efficient and Risk-sensitive Learning. InInterna- tional Conference on Machine Learning, 2018. Cited on page 3

  12. [12]

    T. G. Dietterich. Ensemble Methods in Machine Learning. InMultiple Classifier Systems,

  13. [13]

    V . P. Dwivedi and X. Bresson. A generalization of transformer networks to graphs.arXiv preprint arXiv:2012.09699, 2020. Cited on page 1

  14. [14]

    Entezari, H

    R. Entezari, H. Sedghi, O. Saukh, and B. Neyshabur. The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks. InInternational Conference on Learning Representations, 2022. Cited on page 9

  15. [15]

    Fellaji and F

    M. Fellaji and F. Pennerath. The Epistemic Uncertainty Hole: an issue of Bayesian Neural Networks, 2024. Cited on page 7

  16. [16]

    S. Fort, H. Hu, and B. Lakshminarayanan. Deep Ensembles: A Loss Landscape Perspective. arXiv preprint arXiv:1912.02757, 2019. Cited on page 8

  17. [17]

    V . Fung, J. Zhang, E. Juarez, and B. G. Sumpter. Benchmarking graph neural networks for materials chemistry.npj Computational Materials, 7(1):84, 2021. Cited on page 5

  18. [18]

    Garnett.Bayesian optimization

    R. Garnett.Bayesian optimization. Cambridge University Press, 2023. Cited on page 1. 10

  19. [19]

    Gilmer, S

    J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for Quantum chemistry. InInternational Conference on Machine Learning, 2017. Cited on page 2

  20. [20]

    Glorot and Y

    X. Glorot and Y . Bengio. Understanding the difficulty of training deep feedforward neural networks. InInternational Conference on Artificial Intelligence and Statistics, 2010. Cited on page 13

  21. [21]

    A. Graves. Practical Variational Inference for Neural Networks. InAdvances in Neural Infor- mation Processing Systems, 2011. Cited on page 13

  22. [22]

    C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger. On calibration of modern neural networks. InInternational Conference on Machine Learning, 2017. Cited on page 5

  23. [23]

    Hirschfeld, K

    L. Hirschfeld, K. Swanson, K. Yang, R. Barzilay, and C. W. Coley. Uncertainty Quantification Using Neural Networks for Molecular Property Prediction.Journal of Chemical Information and Modeling, 60, 2020. Cited on page 3

  24. [24]

    Kendall and Y

    A. Kendall and Y . Gal. What uncertainties do we need in Bayesian deep learning for computer vision? InAdvances in Neural Information Processing Systems, 2017. Cited on pages 2, 7

  25. [25]

    T. N. Kipf and M. Welling. Semi-Supervised Classification with Graph Convolutional Net- works. InInternational Conference on Learning Representations, 2017. Cited on pages 1, 2, 5

  26. [26]

    A. Kirsch. (Implicit) Ensembles of Ensembles: Epistemic Uncertainty Collapse in Large Models.Transactions on Machine Learning Research, 2025. Cited on page 7

  27. [27]

    Krieg, W

    S. Krieg, W. Burgis, P. Soga, and N. Chawla. Deep Ensembles for Graphs with Higher- order Dependencies. InInternational Conference on Learning Representations, 2023. Cited on pages 1, 3

  28. [28]

    Lakshminarayanan, A

    B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and Scalable Predictive Uncer- tainty Estimation using Deep Ensembles. InAdvances in Neural Information Processing Sys- tems, 2017. Cited on pages 1–5, 7

  29. [29]

    Q. Lin, S. Yu, K. Sun, W. Zhao, O. Alfarraj, A. Tolba, and F. Xia. Robust Graph Neural Networks via Ensemble Learning.Mathematics, 10, 2022. Cited on page 3

  30. [30]

    J. Z. Liu, Z. Lin, S. Padhy, D. Tran, T. Bedrax-Weiss, and B. Lakshminarayanan. Simple and principled uncertainty estimation with deterministic deep learning via distance awareness. In Advances in Neural Information Processing Systems, 2020. Cited on page 5

  31. [31]

    Louizos and M

    C. Louizos and M. Welling. Multiplicative normalizing flows for variational Bayesian neural networks. InInternational Conference on Machine Learning, 2017. Cited on pages 4, 13

  32. [32]

    Louizos and M

    C. Louizos and M. Welling. Structured and efficient variational deep learning with matrix Gaussian posteriors. InInternational Conference on International Conference on Machine Learning, 2016. Cited on page 13

  33. [33]

    S. Luan, Q. Lu, C. Hua, X. Wang, J. Zhu, and X.-W. Chang. Re-evaluating the Advancements of Heterophilic Graph Learning.arXiv preprint arXiv:2409.05755, 2024. Cited on page 4

  34. [34]

    Manh Bui and A

    H. Manh Bui and A. Liu. Density-Regression: Efficient and Distance-aware Deep Regressor for Uncertainty Estimation under Distribution Shifts. InInternational Conference on Artificial Intelligence and Statistics, 2024. Cited on page 5

  35. [35]

    Mostowsky, V

    P. Mostowsky, V . Dutordoir, I. Azangulov, N. Jaquier, M. J. Hutchinson, A. Ravuri, L. Rozo, A. Terenin, and V . Borovitskiy. The GeometricKernels Package: Heat and Mat ´ern Kernels for Geometric Learning on Manifolds, Meshes, and Graphs.Journal of Machine Learning Research, 2025. Cited on page 2

  36. [36]

    V . T. Nguyen, D. A. Pham, A. T. Le, J. Peter, and G. Gust. Persistent Homology-induced Graph Ensembles for Time Series Regressions.arXiv preprint arXiv:2503.14240, 2025. Cited on page 3

  37. [37]

    J. Ojha, O. Presacan, P. G. Lind, E. Monteiro, and A. Yazidi. Navigating Uncertainty: A User- Perspective Survey of Trustworthiness of AI in Healthcare.ACM Trans. Comput. Healthcare, 6, 2025. Cited on page 1

  38. [38]

    Ovadia, E

    Y . Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J. Dillon, B. Lakshmi- narayanan, and J. Snoek. Can you trust your model’s uncertainty? Evaluating predictive un- certainty under dataset shift. InAdvances in Neural Information Processing Systems, 2019. Cited on pages 1, 3, 8, 13. 11

  39. [39]

    C. M. A. Rahman, G. Bhandari, N. M. Nasrabadi, A. H. Romero, and P. K. Gyawali. Enhanc- ing material property prediction with ensemble deep graph convolutional networks.Frontiers in Materials, 11, 2024. Cited on page 3

  40. [40]

    Ramp ´aˇsek, M

    L. Ramp ´aˇsek, M. Galkin, V . P. Dwivedi, A. T. Luu, G. Wolf, and D. Beaini. Recipe for a general, powerful, scalable graph transformer. InAdvances in Neural Information Processing Systems, 2022. Cited on page 1

  41. [41]

    Rozemberczki, C

    B. Rozemberczki, C. Allen, and R. Sarkar. Multi-Scale attributed node embedding.Journal of Complex Networks, 9, 2021. Cited on page 4

  42. [42]

    T. K. Rusch, M. M. Bronstein, and S. Mishra. A survey on oversmoothing in graph neural networks.arXiv preprint arXiv:2303.10993, 2023. Cited on page 1

  43. [43]

    Scalia, C

    G. Scalia, C. A. Grambow, B. Pernici, Y .-P. Li, and W. H. Green. Evaluating Scalable Uncer- tainty Estimation Methods for Deep Learning-Based Molecular Property Prediction.Journal of Chemical Information and Modeling, 60, 2020. Cited on pages 2, 3

  44. [44]

    K. Tran, W. Neiswanger, J. Yoon, Q. Zhang, E. Xing, and Z. W. Ulissi. Methods for comparing uncertainty quantifications for material property predictions.Machine Learning: Science and Technology, 1, 2020. Cited on pages 1, 5

  45. [45]

    Varivoda, R

    D. Varivoda, R. Dong, S. S. Omee, and J. Hu. Materials Property Prediction with Uncertainty Quantification: A Benchmark Study.arXiv preprint arXiv:2211.02235, 2022. Cited on page 3

  46. [46]

    Veli ˇckovi´c, G

    P. Veli ˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Li`o, and Y . Bengio. Graph Attention Networks. InInternational Conference on Learning Representations, 2018. Cited on pages 1, 2

  47. [47]

    Vinchurkar, K

    T. Vinchurkar, K. Abdelmaqsoud, and J. R. Kitchin. Uncertainty quantification in graph neu- ral networks with shallow ensembles.Machine Learning: Science and Technology, 6, 2025. Cited on page 3

  48. [48]

    F. Wang, Y . Liu, K. Liu, Y . Wang, S. Medya, and P. S. Yu. Uncertainty in Graph Neural Networks: A Survey.Transactions on Machine Learning Research, 2024. Cited on page 1

  49. [49]

    Y . Wen, P. Vicol, J. Ba, D. Tran, and R. Grosse. Flipout: Efficient Pseudo-Independent Weight Perturbations on Mini-Batches.arXiv preprint arXiv:1803.04386, 2018. Cited on pages 4, 13

  50. [50]

    Z. H. Wong, L. Yue, and Q. Yao. Ensemble Learning for Graph Neural Networks.arXiv preprint arXiv:2310.14166, 2023. Cited on page 3

  51. [51]

    Wortsman, G

    M. Wortsman, G. Ilharco, S. Y . Gadre, R. Roelofs, R. Gontijo-Lopes, A. S. Morcos, H. Namkoong, A. Farhadi, Y . Carmon, S. Kornblith, and L. Schmidt. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InInternational Conference on Machine Learning, 2022. Cited on pages 8, 14

  52. [52]

    F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger. Simplifying Graph Convolu- tional Networks. InInternational Conference on Machine Learning, 2019. Cited on page 8

  53. [53]

    Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V . Pande. MoleculeNet: a benchmark for molecular machine learning.Chemical Science, 9,

  54. [54]

    K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How Powerful are Graph Neural Networks? In International Conference on Learning Representations, 2019. Cited on page 1

  55. [55]

    Z. Yang, W. W. Cohen, and R. Salakhutdinov. Revisiting semi-supervised learning with graph embeddings. InInternational Conference on International Conference on Machine Learning,

  56. [56]

    model soup

    Q. Zhu, W. Li, H. Kim, Y . Xiang, K. Wardega, Z. Wang, Y . Wang, H. Liang, C. Huang, J. Fan, and H. Choi. Know the unknowns: addressing disturbances and uncertainties in autonomous systems. InInternational Conference on Computer-Aided Design, 2020. Cited on page 1. 12 A Extra Details on the Experiments Conducted Like in the main text each experiment is re...