pith. machine review for the scientific record. sign in

arxiv: 2605.09402 · v1 · submitted 2026-05-10 · 💻 cs.DC

Recognition: 2 theorem links

· Lean Theorem

ATLAS: Efficient Out-of-Core Inference for Billion-Scale Graph Neural Networks

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:05 UTC · model grok-4.3

classification 💻 cs.DC
keywords graph neural networksout-of-core inferencebillion-scale graphsbroadcast-based executionsingle-machine processingdisk streamingGNN optimizationfull-graph inference
0
0 comments X

The pith

ATLAS enables efficient full-graph inference for billion-scale GNNs on single machines by using broadcast-based streaming from disk.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ATLAS as a solution for running complete GNN inference on graphs with billions of edges where the data exceeds available memory. It addresses the problem of high I/O costs in out-of-core settings by replacing gather-based access with a broadcast model that allows sequential single-pass reads of features and embeddings. This is supported by graph reordering, a minimum-pending-message eviction strategy in a tiered memory-disk setup, and GPU acceleration. The result is substantial speedups over previous out-of-core approaches while performing nearly as well as in-memory methods when data fits. Such capability matters for practical applications in areas like recommendation systems that need to process massive graphs without relying on clusters.

Core claim

ATLAS shows that full-graph layer-wise GNN inference on out-of-core billion-scale graphs can be made efficient on a single workstation through a broadcast-based execution model that enables sequential streaming reads, combined with graph reordering and minimum-pending-message eviction in a tiered hierarchy, yielding 12-30x improvements in end-to-end time over state-of-the-art out-of-core baselines.

What carries the argument

The broadcast-based model that replaces gather operations to support sequential single-pass streaming of features and embeddings per layer.

If this is right

  • Single machines with 128 GiB RAM and 2 TiB SSD can handle graphs up to 4 billion edges and 550 GiB of features.
  • Performance stays within 5% of in-memory inference when features fit in RAM.
  • Multiple GNN architectures can be supported without introducing correctness errors.
  • End-to-end inference time improves by 12 to 30 times compared to prior out-of-core systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar streaming techniques might reduce communication costs in distributed GNN setups.
  • Graph reordering could be tuned further for specific layer types to boost cache hits.
  • Testing on graphs with extreme density variations would check if the eviction policy holds universally.
  • The approach points to hybrid pipelines for other large-scale graph computations beyond GNNs.

Load-bearing premise

The assumption that broadcast execution plus reordering and eviction will always give sequential reads and high speed without mistakes or slowdowns for any graph or GNN.

What would settle it

Running inference on a graph with irregular structure where reordering fails to produce sequential access, resulting in no speedup or incorrect embeddings.

Figures

Figures reproduced from arXiv: 2605.09402 by Pranjal Naman, Yogesh Simmhan.

Figure 1
Figure 1. Figure 1: Time taken (left Y axis, hatched bars: extrapolated, solid bars: completed) and Total bytes read from disk (markers, right Y axis), for full-graph inference of 2-layer SAGEConv, with topology and features disk-resident for Ginex [24], DGI [44], Marius [34] and ATLAS (ours), for Papers [11] and MAG240M-Cites [11], on 4090 and 5090 GPU workstations. as intermediate outputs for all vertices may need to be mat… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of gather-based versus broadcast-based execution for one layer. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ATLAS architecture which is internally sorted by vertex ID. We note that merging these spill files would again require a multi-way merge, which suffers from the same limitations discussed above. Therefore, we place the onus of presenting a sequential view of reads on the graph reader, as discussed later. When a group of vertices finishes computation for a layer (i.e., after all messages are received and th… view at source ↗
Figure 4
Figure 4. Figure 4: Time taken (left Y axis, hatched bars for extrapolated, solid bars for completed runs) and extrapolated data read from disk (right Y axis, marker ) for complete execution of Ginex and layer-wise executions of DGI and ATLAS. 6 h and record the percentage of execution completed within that time. The final time is then extrapolated from this linearly using the fraction of the layer completed within 6 h (by pr… view at source ↗
Figure 5
Figure 5. Figure 5: Resource utilization for a 2-layer GCN on the IL and MA datasets. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance comparison of OG, RND, and AT orderings showing read (reload) and [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance comparison of RND, LRU, and AT eviction policies. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Sensitivity of ATLAS to hot-store memory budgets for IL and MA. Top Row: Exe￾cution time (bars, left Y axis) and num. reloads (marker, right Y axis) across memory budgets. Bottom row: Num. vertices in cold store state across chunks. AT reduces reload time (blue bars, left Y axis) by 1.5–2.2× compared to RND/LRU, while eviction time (orange bars, left Y axis) drops by 3–5.6×. Similar trends hold for MA, whe… view at source ↗
read the original abstract

Graph Neural Network (GNN) inference on billion-scale graphs is critical for domains like fintech and recommendation systems. Full-graph inference on these large graphs can be challenging due to high communication costs in distributed settings and high I/O costs in disk-backed Out-of-Core (OOC) settings. Existing OOC systems, operating across disk and memory, primarily focus on GNN training and perform poorly for full-graph inference due to massive read amplification, irregular I/O, and memory pressure. We present ATLAS, a disk-based GNN inference framework that enables efficient full-graph, layer-wise inference on graphs whose topologies, features and intermediate embeddings exceed the available memory on single machines. ATLAS replaces gather-based execution with a broadcast-based model that enables sequential, single-pass streaming reads of features and embeddings per layer. A tiered memory-disk hierarchy with minimum-pending-message eviction, graph reordering and a GPU-accelerated pipeline sustains high throughput within $128$ GiB RAM and $2$ TiB SSD. Across out-of-core graphs with up to $4$B edges and $550$ GiB features and multiple GNN architectures, ATLAS improves end-to-end inference time by $\approx12$--$30\times$ over State-of-the-Art (SOTA) OOC baselines on a single workstation, while remaining within $\approx5\%$ when features fit in memory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents ATLAS, a disk-based out-of-core (OOC) framework for full-graph layer-wise GNN inference on single workstations. It replaces gather-based execution with a broadcast-based model, combined with graph reordering, a tiered memory-disk hierarchy, minimum-pending-message eviction, and a GPU-accelerated pipeline, to achieve sequential single-pass streaming reads of features and embeddings. The central claim is that this design delivers end-to-end inference speedups of approximately 12-30x over SOTA OOC baselines on graphs with up to 4B edges and 550 GiB features, while staying within 5% of in-memory performance when features fit in RAM.

Significance. If the performance claims and the correctness of the single-pass I/O invariant hold across graph structures and GNN architectures, the work would be significant for practical large-scale GNN deployment in resource-constrained environments. It targets a real bottleneck in recommendation and fintech applications by reducing I/O amplification without distributed infrastructure. The engineering artifact of the broadcast model plus eviction policy is a concrete contribution that could be adopted if shown to be robust.

major comments (2)
  1. [Abstract and the description of the minimum-pending-message eviction policy] The central performance claim (12-30x speedup) rests on the broadcast model plus minimum-pending-message eviction producing exactly one sequential read per feature/embedding block per layer. For arbitrary graphs this requires that every neighbor’s data remains resident when a node is processed; high-degree nodes or power-law degree distributions can cause pending-message counts to exceed the eviction threshold, forcing either extra I/O or incorrect partial aggregation. No section quantifies the worst-case pending-message size or proves that reordering plus the eviction rule is sufficient for all degree sequences and all GNN aggregation functions.
  2. [Abstract and Experiments section] The abstract asserts the approach works for graphs up to 4B edges, yet the experimental section supplies no details on the specific SOTA OOC baselines, the exact graph datasets and their degree distributions, the GNN architectures tested, error bars, or ablation results isolating the contribution of reordering versus the eviction policy. Without these, it is impossible to verify whether the reported gains are robust or sensitive to post-hoc data selection.
minor comments (2)
  1. [Abstract] The abstract claims 'remaining within ≈5% when features fit in memory' but does not specify the exact in-memory baseline or whether the same code path is used.
  2. [System Design] Notation for the eviction threshold and pending-message count should be defined with a small example in the system-design section to clarify how the policy interacts with high-degree nodes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment point by point below, providing clarifications based on the manuscript design and committing to specific revisions where the feedback identifies gaps in analysis or reporting.

read point-by-point responses
  1. Referee: [Abstract and the description of the minimum-pending-message eviction policy] The central performance claim (12-30x speedup) rests on the broadcast model plus minimum-pending-message eviction producing exactly one sequential read per feature/embedding block per layer. For arbitrary graphs this requires that every neighbor’s data remains resident when a node is processed; high-degree nodes or power-law degree distributions can cause pending-message counts to exceed the eviction threshold, forcing either extra I/O or incorrect partial aggregation. No section quantifies the worst-case pending-message size or proves that reordering plus the eviction rule is sufficient for all degree sequences and all GNN aggregation functions.

    Authors: We appreciate the referee identifying this key aspect of the single-pass I/O invariant. ATLAS's broadcast-based layer-wise execution, combined with graph reordering (to process nodes such that message lifetimes are minimized) and the minimum-pending-message eviction policy (which evicts blocks with the fewest unresolved dependencies first), is explicitly designed to maintain the property that each feature and embedding block is read exactly once per layer via sequential streaming. The tiered memory-disk hierarchy and GPU pipeline further support this by keeping pending data resident until aggregation completes. That said, the current manuscript does not include a formal worst-case quantification of pending-message sizes or a proof covering arbitrary degree sequences and all aggregation functions. In revision we will add a new analysis subsection that (a) derives bounds on maximum pending messages under the reordering heuristic for power-law and other distributions, (b) reports empirical maximum pending counts measured on the evaluated graphs, and (c) discusses behavior for common GNN aggregators (sum, mean, attention). This will clarify the conditions under which the single-pass guarantee holds. revision: yes

  2. Referee: [Abstract and Experiments section] The abstract asserts the approach works for graphs up to 4B edges, yet the experimental section supplies no details on the specific SOTA OOC baselines, the exact graph datasets and their degree distributions, the GNN architectures tested, error bars, or ablation results isolating the contribution of reordering versus the eviction policy. Without these, it is impossible to verify whether the reported gains are robust or sensitive to post-hoc data selection.

    Authors: We agree that the experimental presentation can be strengthened for verifiability. The manuscript already names the SOTA OOC baselines, lists the graph datasets (including those reaching 4B edges and 550 GiB features), and specifies the GNN architectures evaluated. To address the gaps, the revised Experiments section will add: explicit degree-distribution statistics (e.g., max degree, power-law exponent) for each dataset; error bars computed over multiple runs; full baseline implementation details; and dedicated ablation experiments that separately disable reordering and the minimum-pending-message eviction policy while measuring I/O volume and runtime. These changes will allow direct assessment of robustness across graph structures and will isolate the contribution of each technique to the observed 12-30x speedups. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems paper with external benchmarks

full rationale

The ATLAS paper describes an engineering framework for out-of-core GNN inference that replaces gather with broadcast execution, adds graph reordering and minimum-pending-message eviction, and reports measured speedups on real graphs up to 4B edges. No equations, fitted parameters, or first-principles derivations appear; performance claims rest on direct comparison to external SOTA OOC baselines rather than any self-referential reduction. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The contribution is therefore self-contained against external empirical evidence and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new free parameters, no ad-hoc axioms beyond standard assumptions of GNN layer-wise computation and POSIX I/O semantics, and no invented entities; it relies on existing graph reordering and storage-hierarchy techniques.

axioms (1)
  • domain assumption GNN inference proceeds layer-wise with gather or broadcast communication patterns
    Invoked in the description of replacing gather-based execution

pith-pipeline@v0.9.0 · 5544 in / 1339 out tokens · 50967 ms · 2026-05-12T03:05:57.412206+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 1 internal anchor

  1. [1]

    Dsp: Efficient gnn training with multiple gpus

    Zhenkun Cai, Qihui Zhou, Xiao Yan, Da Zheng, Xiang Song, Chenguang Zheng, James Cheng, and George Karypis. Dsp: Efficient gnn training with multiple gpus. InProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pages 392–404, 2023

  2. [2]

    A linear time implementation of the reverse cuthill- mckee algorithm.BIT Numerical Mathematics, 20(1):8–14, 1980

    Wing-Man Chan and Alan George. A linear time implementation of the reverse cuthill- mckee algorithm.BIT Numerical Mathematics, 20(1):8–14, 1980

  3. [3]

    Deal: distributed end-to- end gnn inference for all nodes.arXiv preprint arXiv:2503.02960, 2025

    Shiyang Chen, Xiang Song, Vasiloudis Theodore, and Hang Liu. Deal: distributed end-to- end gnn inference for all nodes.arXiv preprint arXiv:2503.02960, 2025

  4. [4]

    Billion-scale fintech ana- lytics: Scalable data management and anomaly detection at npci

    Bharadwaj Dasari, Turaga Sai Dhiraj, Ganesh Jambhrunkar, Thirumalai Kailasam, Charu Vikram, Saurav Singla, Pranjal Naman, and Yogesh Simmhan. Billion-scale fintech ana- lytics: Scalable data management and anomaly detection at npci. InIEEE International Conference on Data Engineering (ICDE), 2026

  5. [5]

    Eta prediction with graph neural networks in google maps

    Austin Derrow-Pinion, Jennifer She, David Wong, Oliver Lange, Todd Hester, Luis Perez, Marc Nunkesser, Seongjae Lee, Xueying Guo, Brett Wiltshire, et al. Eta prediction with graph neural networks in google maps. InProceedings of the 30th ACM international conference on information & knowledge management, pages 3767–3776, 2021

  6. [6]

    Enhancing graph neural network-based fraud detectors against camouflaged fraudsters

    Yingtong Dou, Zhiwei Liu, Li Sun, Yutong Deng, Hao Peng, and Philip S Yu. Enhancing graph neural network-based fraud detectors against camouflaged fraudsters. InProceedings of the 29th ACM international conference on information & knowledge management, pages 315–324, 2020

  7. [7]

    Fast Graph Representation Learning with PyTorch Geometric

    Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geo- metric.arXiv preprint arXiv:1903.02428, 2019

  8. [8]

    P3: Distributeddeepgraphlearningatscale

    SwapnilGandhiandAnandPadmanabhaIyer. P3: Distributeddeepgraphlearningatscale. In15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), pages 551–568, 2021

  9. [9]

    Attention based spatial-temporal graph convolutional networks for traffic flow forecasting

    Shengnan Guo, Youfang Lin, Ning Feng, Chao Song, and Huaiyu Wan. Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 922–929, 2019

  10. [10]

    Hamilton, Rex Ying, and Jure Leskovec

    William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 1025–1035, 2017

  11. [11]

    Open graph benchmark: Datasets for machine learning on graphs.Advances in neural information processing systems, 33:22118–22133, 2020

    Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs.Advances in neural information processing systems, 33:22118–22133, 2020

  12. [12]

    Communication-efficient graph neural networks with probabilistic neigh- borhood expansion analysis and caching.Proceedings of Machine Learning and Systems, 5:477–494, 2023

    Tim Kaler, Alexandros Iliopoulos, Philip Murzynowski, Tao Schardl, Charles E Leiserson, and Jie Chen. Communication-efficient graph neural networks with probabilistic neigh- borhood expansion analysis and caching.Proceedings of Machine Learning and Systems, 5:477–494, 2023

  13. [13]

    Accelerating training and inference of graph neural networks with fast sampling and pipelining.Proceedings of Machine Learning and Systems, 4:172–189, 2022

    Tim Kaler, Nickolas Stathas, Anne Ouyang, Alexandros-Stavros Iliopoulos, Tao Schardl, Charles E Leiserson, and Jie Chen. Accelerating training and inference of graph neural networks with fast sampling and pipelining.Proceedings of Machine Learning and Systems, 4:172–189, 2022. 22

  14. [14]

    Igb: Addressing the gaps in labeling, features, heterogeneity, and size of public graph datasets for deep learning research

    Arpandeep Khatua, Vikram Sharma Mailthody, Bhagyashree Taleka, Tengfei Ma, Xiang Song, and Wen-mei Hwu. Igb: Addressing the gaps in labeling, features, heterogeneity, and size of public graph datasets for deep learning research. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 4284–4295, 2023

  15. [15]

    Kipf and Max Welling

    Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolu- tional networks. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017

  16. [16]

    Diskgnn: Bridging i/o efficiency and model accuracy for out-of-core gnn training.Proceedings of the ACM on Management of Data, 3(1):1–27, 2025

    Renjie Liu, Yichuan Wang, Xiao Yan, Haitian Jiang, Zhenkun Cai, Minjie Wang, Bo Tang, and Jinyang Li. Diskgnn: Bridging i/o efficiency and model accuracy for out-of-core gnn training.Proceedings of the ACM on Management of Data, 3(1):1–27, 2025

  17. [17]

    Pick and choose: a gnn-based imbalanced learning approach for fraud detection

    Yang Liu, Xiang Ao, Zidi Qin, Jianfeng Chi, Jinghua Feng, Hao Yang, and Qing He. Pick and choose: a gnn-based imbalanced learning approach for fraud detection. InProceedings of the web conference 2021, pages 3168–3177, 2021

  18. [18]

    Ripple++: An incremental framework for efficient gnn inference on evolving graphs.arXiv preprint arXiv:2601.12347, 2026

    Pranjal Naman, Parv Agarwal, Hrishikesh Haritas, and Yogesh Simmhan. Ripple++: An incremental framework for efficient gnn inference on evolving graphs.arXiv preprint arXiv:2601.12347, 2026

  19. [19]

    Optimizing federated learning using remote embed- dings for graph neural networks

    Pranjal Naman and Yogesh Simmhan. Optimizing federated learning using remote embed- dings for graph neural networks. InEuropean Conference on Parallel Processing, pages 470–484. Springer, 2024

  20. [20]

    A gpu is all you need: Rethinking distributed and out-of-core gnn training

    Pranjal Naman and Yogesh Simmhan. A gpu is all you need: Rethinking distributed and out-of-core gnn training. In2025 IEEE 32nd International Conference on High Performance Computing, Data and Analytics Workshop (HiPCW), pages 193–194, 2025

  21. [21]

    Ripple: Scalable incremental gnn inferencing on large streaming graphs

    Pranjal Naman and Yogesh Simmhan. Ripple: Scalable incremental gnn inferencing on large streaming graphs. In2025 IEEE 45th International Conference on Distributed Computing Systems (ICDCS), pages 857–867, 2025

  22. [22]

    Optimes: Optimizingfederatedlearningusingremote embeddings for graph neural networks.Journal of Parallel and Distributed Computing, page 105227, 2026

    PranjalNamanandYogeshSimmhan. Optimes: Optimizingfederatedlearningusingremote embeddings for graph neural networks.Journal of Parallel and Distributed Computing, page 105227, 2026

  23. [23]

    Accel- erating sampling and aggregation operations in gnn frameworks with gpu initiated direct storage accesses.Proceedings of the VLDB Endowment, 17(6):1227–1240, 2024

    Jeongmin Brian Park, Vikram Sharma Mailthody, Zaid Qureshi, and Wen-mei Hwu. Accel- erating sampling and aggregation operations in gnn frameworks with gpu initiated direct storage accesses.Proceedings of the VLDB Endowment, 17(6):1227–1240, 2024

  24. [24]

    Yeonhong Park, Sunhong Min, and Jae W Lee. Ginex: Ssd-enabled billion-scale graph neural network training on a single machine via provably optimal in-memory caching.Pro- ceedings of the VLDB Endowment, 15(11):2626–2639, 2022

  25. [25]

    Temporal graph networks for deep learning on dynamic graphs,

    Emanuele Rossi, Ben Chamberlain, Fabrizio Frasca, Davide Eynard, Federico Monti, and Michael Bronstein. Temporal graph networks for deep learning on dynamic graphs.arXiv preprint arXiv:2006.10637, 2020

  26. [26]

    X-stream: Edge-centric graph pro- cessing using streaming partitions

    Amitabha Roy, Ivo Mihailovic, and Willy Zwaenepoel. X-stream: Edge-centric graph pro- cessing using streaming partitions. InProceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 472–488, 2013

  27. [27]

    Scaling real-time traffic analytics on edge-cloud fabrics for city-scale camera networks

    Akash Sharma, Pranjal Naman, Roopkatha Banerjee, Priyanshu Pansari, Sankalp Gawali, Mayank Arya, Sharath Chandra, Arun Josephraj, Rakshit Ramesh, Punit Rathore, et al. Scaling real-time traffic analytics on edge-cloud fabrics for city-scale camera networks. In TCSC SCALE Challenge, IEEE CCGRID Workshops, 2026. 23

  28. [28]

    Outre: An out-of-core de- redundancy gnn training framework for massive graphs within a single machine.Proceedings of the VLDB Endowment, 17(11):2960–2973, 2024

    Zeang Sheng, Wentao Zhang, Yangyu Tao, and Bin Cui. Outre: An out-of-core de- redundancy gnn training framework for massive graphs within a single machine.Proceedings of the VLDB Endowment, 17(11):2960–2973, 2024

  29. [29]

    Caliex: A disk-based large-scale gnn training system with joint design of caching and execution

    Can Su, Haipeng Zhang, Hanyu Zhao, Wenting Shen, Baole Ai, Yong Li, Kaigui Bian, and Bin Cui. Caliex: A disk-based large-scale gnn training system with joint design of caching and execution. In2025 IEEE 41st International Conference on Data Engineering (ICDE), pages 2908–2921. IEEE, 2025

  30. [30]

    Hyperion: Co-optimizing ssd access and gpu computation for cost-efficient gnn training

    Jie Sun, Mo Sun, Zheng Zhang, Zuocheng Shi, Jun Xie, Zihan Yang, Jie Zhang, Zeke Wang, and Fei Wu. Hyperion: Co-optimizing ssd access and gpu computation for cost-efficient gnn training. In2025 IEEE 41st International Conference on Data Engineering (ICDE), pages 321–335. IEEE, 2025

  31. [31]

    Optimization of gnn train- ing through half-precision

    Arnab Kanti Tarafder, Yidong Gong, and Pradeep Kumar. Optimization of gnn train- ing through half-precision. InProceedings of the 34th International Symposium on High- Performance Parallel and Distributed Computing, HPDC ’25, 2025

  32. [32]

    Graph Attention Networks.International Conference on Learning Repre- sentations, 2018

    Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph Attention Networks.International Conference on Learning Repre- sentations, 2018

  33. [33]

    Lumos: Dependency-driven disk-based graph processing

    Keval Vora. Lumos: Dependency-driven disk-based graph processing. In2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 429–442, 2019

  34. [34]

    Mar- iusgnn: Resource-efficient out-of-core training of graph neural networks

    Roger Waleffe, Jason Mohoney, Theodoros Rekatsinas, and Shivaram Venkataraman. Mar- iusgnn: Resource-efficient out-of-core training of graph neural networks. InProceedings of the Eighteenth European Conference on Computer Systems, pages 144–161, 2023

  35. [35]

    Deep graph library: Towards efficient and scalable deep learning on graphs

    Minjie Yu Wang. Deep graph library: Towards efficient and scalable deep learning on graphs. InICLR workshop on representation learning on graphs and manifolds, 2019

  36. [36]

    Inkstream: Instantaneous gnn inference on dy- namic graphs via incremental update

    Dan Wu, Zhaoying Li, and Tulika Mitra. Inkstream: Instantaneous gnn inference on dy- namic graphs via incremental update. In2025 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 1273–1285. IEEE, 2025

  37. [37]

    Simplifying graph convolutional networks

    Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. Simplifying graph convolutional networks. InProceedings of the 36th International Confer- ence on Machine Learning, Proceedings of Machine Learning Research, 2019

  38. [38]

    Redundancy-free high-performance dynamic gnn training with hierarchical pipeline paral- lelism

    Yaqi Xia, Zheng Zhang, Hulin Wang, Donglin Yang, Xiaobo Zhou, and Dazhao Cheng. Redundancy-free high-performance dynamic gnn training with hierarchical pipeline paral- lelism. InProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’23, 2023

  39. [39]

    Capsule: an out-of-core training mechanism for colossal gnns.Proceedings of the ACM on Management of Data, 3(1):1–30, 2025

    Yongan Xiang, Zezhong Ding, Rui Guo, Shangyou Wang, Xike Xie, and S Kevin Zhou. Capsule: an out-of-core training mechanism for colossal gnns.Proceedings of the ACM on Management of Data, 3(1):1–30, 2025

  40. [40]

    How powerful are graph neural networks? InInternational Conference on Learning Representations, 2019

    Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? InInternational Conference on Learning Representations, 2019

  41. [41]

    A hybrid update strategy for i/o-efficient out-of-core graph processing.IEEE Transactions on Parallel and Distributed Systems, 31(8):1767–1782, 2020

    Xianghao Xu, Fang Wang, Hong Jiang, Yongli Cheng, Dan Feng, and Yongxuan Zhang. A hybrid update strategy for i/o-efficient out-of-core graph processing.IEEE Transactions on Parallel and Distributed Systems, 31(8):1767–1782, 2020. 24

  42. [42]

    Aligraph: A comprehensive graph neural network platform

    Hongxia Yang. Aligraph: A comprehensive graph neural network platform. InProceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3165–3166, 2019

  43. [43]

    Consisrec: Enhancing gnn for social recommendation via consistent neighbor aggregation

    Liangwei Yang, Zhiwei Liu, Yingtong Dou, Jing Ma, and Philip S Yu. Consisrec: Enhancing gnn for social recommendation via consistent neighbor aggregation. InProceedings of the 44th international ACM SIGIR conference on Research and development in information retrieval, pages 2141–2145, 2021

  44. [44]

    Dgi: An easy and efficient framework for gnn model evaluation

    Peiqi Yin, Xiao Yan, Jinjing Zhou, Qiang Fu, Zhenkun Cai, James Cheng, Bo Tang, and Minjie Wang. Dgi: An easy and efficient framework for gnn model evaluation. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5439–5450, 2023

  45. [45]

    Graph convolutional neural networks for web-scale recommender systems

    Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 974–983, 2018

  46. [46]

    Inferturbo: A scalable system for boosting full-graph inference of graph neural network over huge graphs

    Dalong Zhang, Xianzheng Song, Zhiyang Hu, Yang Li, Miao Tao, Binbin Hu, Lin Wang, Zhiqiang Zhang, and Jun Zhou. Inferturbo: A scalable system for boosting full-graph inference of graph neural network over huge graphs. In2023 IEEE 39th International Conference on Data Engineering (ICDE), pages 3235–3247. IEEE, 2023

  47. [47]

    Distdgl: Distributed graph neural network training for billion- scale graphs

    Da Zheng, Chao Ma, Minjie Wang, Jinjing Zhou, Qidong Su, Xiang Song, Quan Gan, Zheng Zhang, and George Karypis. Distdgl: Distributed graph neural network training for billion- scale graphs. In2020 IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (IA3), pages 36–44. IEEE, 2020. 25