pith. sign in

arxiv: 2605.29346 · v1 · pith:X4QM6J2Anew · submitted 2026-05-28 · 💻 cs.DC

Understanding and Reducing Metadata-Driven Host Overheads in Sampling-Based GNN Training

Pith reviewed 2026-06-29 05:57 UTC · model grok-4.3

classification 💻 cs.DC
keywords sampling-based GNNmetadata-driven executionhost overheadsCUDA GraphsGPU-resident executiondynamic workloadsmulti-GPU scaling
0
0 comments X

The pith

ZEROGNN removes the host from metadata-driven control in sampling-based GNN training by keeping runtime metadata on-device inside a fixed launch structure and provisioning a conservative execution envelope to restore CUDA Graph replayabilit

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that dynamic, metadata-driven execution in sampling-based GNN training places the CPU on the critical path, creating persistent host-device orchestration overhead and GPU-CPU synchronization that dominate runtime when GPU computation is small. ZEROGNN moves the entire control loop to the GPU, retains metadata on-device, and mediates variable execution inside a fixed launch structure. A conservative yet tight execution envelope is provisioned ahead of time so the structure becomes replayable under CUDA Graphs. If the approach holds, end-to-end speedups reach 5.28x, GPU execution fraction approaches 100 percent, memory use stays comparable to ideal allocation, and multi-GPU scaling improves because host bottlenecks disappear. A reader would care because the same metadata-driven pattern appears in many modern dynamic deep-learning workloads and removing the host from the loop directly improves hardware utilization.

Core claim

ZEROGNN removes the host from the metadata-driven control loop and enables fully GPU-resident execution under dynamic behavior. It keeps runtime metadata on-device, mediates dynamic execution within a fixed launch structure, and provisions a conservative yet tight execution envelope to restore CUDA Graph replayability.

What carries the argument

Fixed launch structure with on-device metadata and conservative execution envelope provisioning that restores CUDA Graph replayability for variable metadata-driven iterations.

If this is right

  • Up to 5.28x end-to-end speedup on sampling-based GNN workloads.
  • Near 100% GPU execution fraction.
  • Memory efficiency comparable to ideal metadata-informed allocation.
  • Strong multi-GPU scaling by eliminating host-side bottlenecks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same on-device metadata and fixed-structure technique could apply to other dynamic deep-learning patterns that currently force host mediation.
  • Provisioning conservative envelopes may become a reusable pattern for making variable GPU execution CUDA-Graph compatible in additional systems.
  • Removing host coordination could compound benefits in large-scale distributed training where host bottlenecks already limit scaling beyond two GPUs.

Load-bearing premise

A conservative yet tight execution envelope can be provisioned in advance that still permits fully GPU-resident dynamic behavior without unacceptable memory waste or correctness issues.

What would settle it

A sampling-based GNN workload whose runtime metadata exceeds the pre-provisioned envelope bounds, producing either out-of-memory errors or fallback to host-mediated execution that erases the reported speedup.

Figures

Figures reproduced from arXiv: 2605.29346 by Bin Ren, Guannan Wang, Pradeep Kumar, Saima Afrin, Yidong Gong, Yuchen Ma.

Figure 2
Figure 2. Figure 2: reports GPU execution fraction for GraphSAGE on Reddit across different batch sizes. We find that GPU execution fraction in both the overall pipeline and the training stage remains relatively low when the batch size is smaller than 4096. Specifically, when considering batch size 128, only 45% of the end-to-end runtime corresponds to active GPU computation, while the remaining 55% is GPU idle time. This low… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of end-to-end training runtime and corresponding GPU Utilization comparison between DGL and Gong et al across Different Batch Sizes 68% at batch size 256). This suggests that trimming frame￾work code improves throughput, but the GPU still cannot sustain near-continuous execution due to persistent HDOO and synchronization points. In summary, sampling-based GNN training frequently oper￾ates in a… view at source ↗
Figure 4
Figure 4. Figure 4: illustrates this behavior with a representative multi￾hop sampling GNN workflow. Within each hop ℓ, GPU ker￾nels preSampling(graph) and postSampling(subgraph) gen￾erate hop-specific runtime metadata such as the sampled ver￾tex/edge counts |𝑉ℓ | and |𝐸ℓ |. These metadata immediately trigger two nested dependency structures. (a) Intra-hop dependency. Within the same hop, later GPU execution depends on CPU-re… view at source ↗
Figure 5
Figure 5. Figure 5: illustrates the transformed execution flow. ZEROGNN uses: 1. Device-Resident Metadata Buffer (DRMB). to keep runtime metadata (e.g., sampled 𝑉 𝑁 𝑑 and 𝐸 𝑁 𝑑 , frontier sizes) in GPU memory and let downstream kernels consume it di￾rectly via device pointers, eliminating per-iteration GPU → CPU metadata round-trips. 2. Device-Side Launch Media￾tion (DLM) decouples kernel execution from per-iteration CPU scal… view at source ↗
Figure 6
Figure 6. Figure 6: shows a state-of-the-art SpMM kernel [11] maintains near-constant runtime even when the grid is over-allocated by a large margin (e.g., from +20% to +180%) on Reddit and OGBN-Products datasets, indicating that extra blocks can quickly return and incur negligible overhead. Based on this insight, it is clear that over- or under-allocation does not prevent the computation from being completed. Fur￾ther, over-… view at source ↗
Figure 8
Figure 8. Figure 8: Sampling Only Runtime Speedup Over DGL and Gong et al Across Different Datasets. Y￾axis Values Are Clipped [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: Accuracy measure of ZeroGNN and DGL [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 11
Figure 11. Figure 11: Memory usage efficiency comparison between ZEROGNN and the MaxSG (Naive Maximum Subgraph Allocation strategy) across different sampling depths. The MaxSG serves as the baseline (value = 1). Efficiency is measured on a log2 scale, where higher values indicate better memory efficiency. since the communicated data mainly consists of model pa￾rameters, which are relatively small in sampling-based GNN models. … view at source ↗
Figure 12
Figure 12. Figure 12: End-to-End Runtime Speedup For ZEROGNN Over Gong et al Across Different Batch Sizes On OGBN-papers100M Dataset. 10 [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: ZEROGNN End-to-End Training Runtime Comparison Under 2 GPUs Configurations [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗
Figure 19
Figure 19. Figure 19: End-to-End Training Runtime Comparison Across Differ￾ent Design Choices to handle Dynamic Dataflow. Normalized range and coefficient of variation. To com￾pare fluctuations across different sampling settings, we nor￾malize by the mean: 𝑟range = Range 𝜇 = 2𝑧 (𝑚) 𝑝 𝜎 𝜇 = 2𝑧 (𝑚) 𝑝 · CV, (24) where CV = 𝜎 𝜇 (25) is the coefficient of variation. Core insight: CV depends only on 𝑝𝑣 . Recall: |𝑉𝑠 | = ∑︁ 𝑣 𝐼𝑣, 𝐼𝑣 … view at source ↗
Figure 20
Figure 20. Figure 20: Distribution of sampled subgraph sizes (number of nodes) for ZEROGNN on the Reddit dataset. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_20.png] view at source ↗
read the original abstract

Modern deep learning workloads increasingly exhibit dynamic, metadata-driven execution, where runtime-generated information determines memory provisioning and kernel launch decisions. In sampling-based graph neural network (GNN) training, this behavior places the CPU on the critical path, introducing persistent host-device orchestration overhead and frequent GPU-CPU synchronization, which dominate end-to-end runtime when GPU computation is small. Existing approaches, including CUDA Graphs and GPU dynamic parallelism, fail to address this problem because the metadata-driven control loop remains host-mediated, and execution structure varies across iterations. We present ZEROGNN, a system that removes the host from the metadata-driven control loop and enables fully GPU-resident execution under dynamic behavior. ZEROGNN keeps runtime metadata on-device, mediates dynamic execution within a fixed launch structure, and provisions a conservative yet tight execution envelope to restore CUDA Graph replayability. Experiments on sampling-based GNN workloads show that ZEROGNN achieves up to 5.28 x end-to-end speedup, near 100% GPU execution fraction, and memory efficiency comparable to ideal metadata-informed allocation, while enabling strong multi-GPU scaling by eliminating host-side bottlenecks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces ZEROGNN, a system to remove the host from the metadata-driven control loop in sampling-based GNN training. It keeps runtime metadata on-device, mediates dynamic execution inside a fixed launch structure, and provisions a conservative yet tight execution envelope to restore CUDA Graph replayability. Central claims are up to 5.28× end-to-end speedup, near-100% GPU execution fraction, memory efficiency comparable to ideal metadata-informed allocation, and improved multi-GPU scaling.

Significance. If the central claims hold, the work would address a practical performance bottleneck in dynamic, metadata-driven DL workloads by enabling fully GPU-resident execution and CUDA Graph compatibility. This could have targeted impact on GNN training systems and broader relevance to host-device orchestration overheads.

major comments (2)
  1. [Abstract] Abstract: the claim that a conservative yet tight execution envelope can be provisioned in advance to accommodate variable sampling metadata (neighbor sample sizes, etc.) while remaining both tight enough for memory efficiency and loose enough to avoid host fallback is load-bearing for the 5.28× speedup and near-100% GPU fraction results, yet the abstract provides no derivation, bound, or sensitivity analysis demonstrating that such an envelope exists for the reported workloads.
  2. [Abstract] Abstract: empirical claims of speedups, GPU execution fraction, and memory efficiency are stated without any description of methods, datasets, error analysis, or experimental setup, preventing evaluation of whether the data support the claims.
minor comments (1)
  1. The abstract would be clearer if it briefly identified the specific GNN models, sampling algorithms, and graph datasets used to obtain the reported numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. The two major comments both concern the abstract's level of detail. We agree these points merit revision and will update the abstract in the next version to better support the central claims while preserving conciseness. Point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that a conservative yet tight execution envelope can be provisioned in advance to accommodate variable sampling metadata (neighbor sample sizes, etc.) while remaining both tight enough for memory efficiency and loose enough to avoid host fallback is load-bearing for the 5.28× speedup and near-100% GPU fraction results, yet the abstract provides no derivation, bound, or sensitivity analysis demonstrating that such an envelope exists for the reported workloads.

    Authors: The abstract summarizes the approach at a high level; the derivation of the envelope, its tightness bounds, and sensitivity analysis appear in Section 4.2 and Figure 7 of the full manuscript, confirming the envelope works for the evaluated workloads without host fallback. We will revise the abstract to add one sentence referencing the conservative provisioning strategy and its empirical validation on the reported datasets. revision: yes

  2. Referee: [Abstract] Abstract: empirical claims of speedups, GPU execution fraction, and memory efficiency are stated without any description of methods, datasets, error analysis, or experimental setup, preventing evaluation of whether the data support the claims.

    Authors: Abstracts are length-limited and conventionally omit full methodological detail, which is provided in Section 5 (Experiments), including datasets (Reddit, ogbn-products, etc.), hardware, and error analysis via repeated runs with standard deviation. To improve standalone readability we will insert a brief clause naming the primary datasets and evaluation methodology. revision: yes

Circularity Check

0 steps flagged

No circularity: systems design with external experimental validation

full rationale

The paper describes a systems technique (ZEROGNN) for eliminating host-mediated metadata loops in sampling-based GNN training via on-device metadata and a fixed launch structure with a pre-provisioned envelope. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the provided abstract or description. Central performance claims (speedup, GPU fraction, memory efficiency) are presented as outcomes of experiments on concrete workloads rather than reductions to prior self-citations or ansatzes. The approach is self-contained against external benchmarks (measured end-to-end runtimes), satisfying the criteria for a non-circular finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; ledger left empty pending full text.

pith-pipeline@v0.9.1-grok · 5744 in / 1124 out tokens · 17637 ms · 2026-06-29T05:57:39.042946+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 4 canonical work pages · 2 internal anchors

  1. [1]

    Residual Gated Graph ConvNets

    X. Bresson and T. Laurent. Residual gated graph convnets.arXiv preprint arXiv:1711.07553, 2017

  2. [2]

    Z. Cai, Q. Zhou, X. Yan, D. Zheng, X. Song, C. Zheng, J. Cheng, and G. Karypis. DSP: Efficient GNN Training with Multiple GPUs. InProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP ’23, page 392–404, 2023

  3. [3]

    Z. Chen, M. Yan, M. Zhu, L. Deng, G. Li, S. Li, and Y . Xie. fuseGNN: Accelerating Graph Convolutional Neural Network Training on GPGPU. InProceedings of the 39th International Conference on Computer-Aided Design, pages 1–9, 2020

  4. [4]

    Chiang, X

    W.-L. Chiang, X. Liu, S. Si, Y . Li, S. Bengio, and C.-J. Hsieh. Cluster- GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, page 257–266, New York, NY , USA, 2019. Association for Computing Machinery

  5. [5]

    G. Dai, G. Huang, S. Yang, Z. Yu, H. Zhang, Y . Ding, Y . Xie, H. Yang, and Y . Wang. Heuristic adaptability to input dynamics for SpMM on GPUs. InProceedings of the 59th ACM/IEEE Design Automation Conference, DAC ’22, page 595–600, New York, NY , USA, 2022. Association for Computing Machinery

  6. [6]

    W. Fan, Y . Ma, Q. Li, Y . He, E. Zhao, J. Tang, and D. Yin. Graph Neural Networks for Social Recommendation. InThe World Wide Web Conference, pages 417–426, 2019

  7. [7]

    Q. Fu, Y . Ji, and H. H. Huang. TLPGNN: A Lightweight Two-Level Par- allelism Paradigm for Graph Neural Network Computation on GPU. In Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing, pages 122–134, 2022

  8. [8]

    T. Gale, M. Zaharia, C. Young, and E. Elsen. Sparse GPU Kernels for Deep Learning. In2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 219–232. IEEE Computer Society, 2020

  9. [9]

    Gandhi and A

    S. Gandhi and A. P. Iyer. P3: Distributed Deep Graph Learning at Scale. In15th USENIX Symposium on Operating Systems Design and Implementation (OSDI’21), pages 551–568, 2021

  10. [10]

    Gong and P

    Y . Gong and P. Kumar. GNNBENCH: Fair and Productive Benchmark- ing for Single-GPU GNN System.arXiv preprint arXiv:2404.04118, 2024

  11. [11]

    Gong and P

    Y . Gong and P. Kumar. GNNOne: A Unified System Optimizations for GNN Kernels. InProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’24, page 15–27, New York, NY , USA, 2024. Association for Computing Machinery

  12. [12]

    Y . Gong, A. K. Tarafder, S. Afrin, and P. Kumar. Identifying and Ana- lyzing Pitfalls in GNN Systems. InProceedings of the 2025 USENIX Annual Technical Conference, 2025

  13. [13]

    Hamilton, P

    W. Hamilton, P. Bajaj, M. Zitnik, D. Jurafsky, and J. Leskovec. Em- bedding Logical Queries on Knowledge Graphs.Advances in Neural Information Processing Systems, 31:2026–2037, 2018

  14. [14]

    Hamilton, Z

    W. Hamilton, Z. Ying, and J. Leskovec. Inductive Representation Learn- ing on Large Graphs. InAdvances in neural information processing systems, pages 1024–1034, 2017

  15. [15]

    Y . Hu, Z. Ye, M. Wang, J. Yu, D. Zheng, M. Li, Z. Zhang, Z. Zhang, and Y . Wang. FeatGraph: A Flexible and Efficient Backend for Graph Neural Network Systems. InProceedings of the International Con- ference for High Performance Computing, Networking, Storage and Analysis, pages 1–13, 2020

  16. [16]

    Huang, G

    G. Huang, G. Dai, Y . Wang, and H. Yang. GE-SpMM: General-purpose Sparse Matrix-Matrix Multiplication on GPUs for Graph Neural Net- works. InSC20: International Conference for High Performance Com- puting, Networking, Storage and Analysis, pages 1–12. IEEE, 2020

  17. [17]

    Huang, J

    K. Huang, J. Zhai, Z. Zheng, Y . Yi, and X. Shen. Understanding and Bridging the Gaps in Current GNN Performance Optimizations. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 119–132, 2021

  18. [18]

    T. B. Jablin, J. A. Jablin, P. Prabhu, F. Liu, and D. I. August. Dynami- cally managed data for CPU-GPU architectures. InProceedings of the Tenth International Symposium on Code Generation and Optimization, pages 165–174, 2012

  19. [19]

    T. B. Jablin, P. Prabhu, J. A. Jablin, N. P. Johnson, S. R. Beard, and D. I. August. Automatic CPU-GPU communication management and optimization.SIGPLAN Not., 46(6):142–151, June 2011

  20. [20]

    Jangda, S

    A. Jangda, S. Polisetty, A. Guha, and M. Serafini. Accelerating Graph Sampling for Graph Machine Learning using GPUs. InProceedings of the Sixteenth European Conference on Computer Systems, 2021

  21. [21]

    Kaler, N

    T. Kaler, N. Stathas, A. Ouyang, A.-S. Iliopoulos, T. Schardl, C. E. Leiserson, and J. Chen. Accelerating Training and Inference of Graph Neural Networks with Fast Sampling and Pipelining.Proceedings of Machine Learning and Systems, 4:172–189, 2022

  22. [22]

    T. N. Kipf and M. Welling. Semi-Supervised Classification with Graph Convolutional Networks. In5th International Conference on Learning Representations (ICLR-17), 2017

  23. [23]

    Krahmer, S

    E. Krahmer, S. v. Erk, and A. Verleg. Graph-Based Generation of Referring Expressions.Computational Linguistics, 29(1):53–72, 2003

  24. [24]

    Li and B

    L. Li and B. Chapman. Compiler assisted hybrid implicit and explicit GPU memory management under unified address space. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19, New York, NY , USA, 2019. Association for Computing Machinery

  25. [25]

    M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V . Josi- fovski, J. Long, E. J. Shekita, and B.-Y . Su. Scaling distributed ma- chine learning with the parameter server. InProceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI’14, page 583–598, USA, 2014. USENIX Association

  26. [26]

    S. Li, Y . Zhao, R. Varma, O. Salpekar, P. Noordhuis, T. Li, A. Paszke, J. Smith, B. Vaughan, P. Damania, and S. Chintala. PyTorch distributed: experiences on accelerating data parallel training.Proc. VLDB Endow., 13(12):3005–3018, Aug. 2020

  27. [27]

    S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan. Improving GPGPU concurrency with elastic kernels.SIGARCH Comput. Archit. News, 41(1):407–418, Mar. 2013

  28. [28]

    Perera and P

    R. Perera and P. Nand. Recent Advances in Natural Language Genera- tion: A Survey and Classification of the Empirical Literature.Comput- ing and Informatics, 36(1):1–32, 2017

  29. [29]

    Schlichtkrull, T

    M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling. Modeling Relational Data with Graph Convolutional Networks. InEuropean Semantic Web Conference, pages 593–607. Springer, 2018

  30. [30]

    Horovod: fast and easy distributed deep learning in TensorFlow

    A. Sergeev and M. D. Balso. Horovod: fast and easy distributed deep learning in TensorFlow.ArXiv, abs/1802.05799, 2018

  31. [31]

    A. K. Tarafder, Y . Gong, and P. Kumar. Optimization of GNN Training Through Half-precision. InProceedings of the 34th International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’25, New York, NY , USA, 2025. Association for Computing Machinery

  32. [32]

    Veliˇckovi´c, G

    P. Veliˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y . Ben- gio. Graph Attention Networks.6th International Conference on Learning Representations (ICLR-18), 2018

  33. [33]

    Waleffe, J

    R. Waleffe, J. Mohoney, T. Rekatsinas, and S. Venkataraman. Mar- iusGNN: Resource-Efficient Out-of-Core Training of Graph Neural Networks. InEighteenth European Conference on Computer Systems (EuroSys’ 23), 2023

  34. [34]

    H. Wang, H. Ren, and J. Leskovec. Entity Context and Relational Paths for Knowledge Graph Completion.arXiv preprint arXiv:2002.06757, 2020. 13 Conference’17, July 2017, Washington, DC, USA Yidong Gong, Saima Afrin, Y uchen Ma, Guannan Wang, Bin Ren, and Pradeep Kumar

  35. [35]

    M. Wang, D. Zheng, Z. Ye, Q. Gan, M. Li, X. Song, J. Zhou, C. Ma, L. Yu, Y . Gai, T. Xiao, T. He, G. Karypis, J. Li, and Z. Zhang. Deep Graph Library: Towards Efficient And Scalable Deep Learning on Graphs. InICLR 2019 Workshop on Representation Learning on Graphs and Manifolds, 2019

  36. [36]

    Y . Wang, B. Feng, G. Li, S. Li, L. Deng, Y . Xie, and Y . Ding. GNNAdvi- sor: An Adaptive and Efficient Runtime System for GNN Acceleration on GPUs. In15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), pages 515–531, 2021

  37. [37]

    Y . Wu, K. Ma, Z. Cai, T. Jin, B. Li, C. Zheng, J. Cheng, and F. Yu. Seastar: Vertex-centric Programming for Graph Neural Networks. In Proceedings of the Sixteenth European Conference on Computer Sys- tems, pages 359–375, 2021

  38. [38]

    K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How Powerful are Graph Neural Networks?7th International Conference on Learning Represen- tations (ICLR-19), 2019

  39. [39]

    C. Yang, A. Buluç, and J. D. Owens. Design principles for sparse matrix multiplication on the gpu. InEuropean Conference on Parallel Processing, pages 672–687. Springer, 2018

  40. [40]

    D. Yang, J. Liu, J. Qi, and J. Lai. WholeGraph: A Fast Graph Neural Network Training Framework with Multi-GPU Distributed Shared Memory Architecture. 2022

  41. [41]

    J. Yang, D. Tang, X. Song, L. Wang, Q. Yin, R. Chen, W. Yu, and J. Zhou. GNNLab: A Factored System for Sample-Based GNN Training over GPUs. InProceedings of the Seventeenth European Conference on Computer Systems, pages 417–434, 2022

  42. [42]

    R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec. Graph Convolutional Neural Networks for Web-Scale Recommender Systems. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 974–983, 2018

  43. [43]

    Zhang, X

    J. Zhang, X. Shi, J. Xie, H. Ma, I. King, and D. Yeung. GaAN: Gated Attention Networks for Learning on Large and Spatiotemporal Graphs. InProceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, pages 339–349, 2018

  44. [44]

    Zitnik, M

    M. Zitnik, M. Agrawal, and J. Leskovec. Modeling Polypharmacy Side Effects with Graph Convolutional Networks.Bioinformatics, 34(13):i457–i466, 2018. 14 Metadata-Driven Host Overheads in GNN T raining Conference’17, July 2017, Washington, DC, USA A Proof of Lemma 4.1 Problem setting.We consider the standard multi-hop neigh- bor sampling procedure widely us...