arxiv: 2605.06576 · v1 · submitted 2026-05-07 · 💻 cs.LG

Recognition: unknown

On the Safety of Graph Representation Learning

Chuxu Zhang, Shawn Spitzel, Soonwoo Kwon, Tianyi Ma, Xiaoguang Guo, Yanfang Ye, Zehong Wang, Ziming Li

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:13 UTC · model grok-4.3

classification 💻 cs.LG

keywords graph representation learningsafety evaluationrobustnessout-of-distribution generalizationclass imbalancefairnessinterpretabilitygraph foundation models

0 comments

The pith

Safety in graph representation learning depends on how a method's design interacts with the specific graph stress rather than on which family the method belongs to.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a standardized benchmark called GRL-Safety that tests twelve representative graph methods across twenty-five datasets on five distinct safety axes. It shows that performance under corruption, distribution shift, imbalance, fairness constraints, and interpretability demands is driven by the pairing of representation choices with the stressed aspect of the graph, not by broad categories like embeddings versus GNNs. A sympathetic reader would care because real deployments routinely encounter these stresses, and knowing which design choices survive them lets practitioners select or improve models without relying on clean-task scores alone. The results also indicate that even the strongest current methods leave several regimes under-protected, pointing to needed advances in objectives or adaptation strategies.

Core claim

We introduce GRL-Safety, a multi-axis safety evaluation benchmark for graph representation learning. It evaluates twelve methods spanning topology-only embeddings, supervised GNNs, self-supervised models, and graph foundation models on twenty-five datasets under standardized conditions that preserve each method's native adaptation. The benchmark reports per-axis and per-sub-condition results on corruption robustness, out-of-distribution generalization, class imbalance, fairness, and interpretation. Analysis of the outcomes yields three cross-axis insights: safety behavior is shaped by the interaction between representation design and the stressed graph factor rather than by method family; so

What carries the argument

The GRL-Safety benchmark, which applies five safety axes (corruption robustness, OOD generalization, class imbalance, fairness, interpretation) to twelve methods on twenty-five datasets while preserving method-native adaptation and reporting disaggregated scores.

If this is right

Method selection for deployment must account for the expected graph stresses rather than defaulting to a single family.
Graph foundation models exhibit strengths on particular axes but do not dominate across all of them.
Several regimes, including certain corruption and fairness settings, remain difficult for all evaluated methods and therefore require new robustness or adaptation techniques.
Future evaluations should continue to report per-axis and per-sub-condition results instead of single aggregate scores.
Representation design choices can be tuned to anticipated stresses once the interaction pattern is understood.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers could build stress-aware selection tools that map expected deployment factors to the representation features shown to handle them best.
The same multi-axis testing approach could be applied to other representation-learning domains such as images or sequences to check whether similar design-stress interactions appear.
Training pipelines might incorporate explicit objectives that encourage robustness across multiple axes simultaneously rather than optimizing for one at a time.

Load-bearing premise

The twenty-five chosen datasets and five safety axes, run under standardized conditions, are sufficient to capture the deployment stresses that matter for graph representation methods.

What would settle it

A new or existing GRL method that ranks first on every safety axis and every sub-condition without any axis-specific tuning or additional objectives.

Figures

Figures reproduced from arXiv: 2605.06576 by Chuxu Zhang, Shawn Spitzel, Soonwoo Kwon, Tianyi Ma, Xiaoguang Guo, Yanfang Ye, Zehong Wang, Ziming Li.

**Figure 1.** Figure 1: Overview of GRL-Safety. We evaluate twelve GRL methods across five safety axes under view at source ↗

**Figure 2.** Figure 2: Corruption severity trajectories. Test accuracy under feature- and structure-noise sweeps on WikiCS and Cora. Curves denote methods; per-dataset curves in Appendix Figures 6–7. 3.2.2 OOD Generalization : Shifts Across Graph Contexts OOD generalization evaluates whether graph representations remain useful when the graph context changes across structure, time, molecular scaffold, or entity set. We study four… view at source ↗

**Figure 3.** Figure 3: Node-classification OOD shifts. Larger values indicate weaker OOD generalization; darker cells and longer bars denote larger drops view at source ↗

**Figure 4.** Figure 4: Major- and minor-class recall under class imbalance . At imbalance ratio ρ=10, bars show minor-class recall and lines show major-class recall on four representative node-classification datasets. High major-class recall does not necessarily imply high minor-class recall. 3.2.4 Fairness : Structural Head–Tail Disparity Fairness evaluates whether prediction quality differs across groups. In GRL, a natural gro… view at source ↗

**Figure 5.** Figure 5: Structural head–tail fairness. Accuracy across all 12 methods on four node-classification datasets. Each method has two bars: head nodes (solid, top-degree quintile) and tail nodes (hatched, bottom-degree quintile). Red labels report the signed gap ∆ = Acchead − Acctail in percentage points; positive values indicate head-favoring behavior and negative values indicate tail-favoring behavior. Error bars show… view at source ↗

**Figure 6.** Figure 6: Per-dataset feature-noise severity curves on GRL-Safety datasets. Each panel view at source ↗

**Figure 7.** Figure 7: Per-dataset edge-deletion severity curves on GRL-Safety datasets. Each panel re view at source ↗

**Figure 8.** Figure 8: Node-level saliency lift in fidelity space. Each marker denotes one method on a representative NC dataset, plotted as the improvement of gradient saliency over a random-edge baseline: (∆Fid+, −∆Fid−). The dashed line marks equal lift on the two fidelity components. Full per-dataset decomposition is in view at source ↗

read the original abstract

Graph representation learning (GRL) has evolved from topology-only graph embeddings to task-specific supervised GNNs, and more recently to reusable representations and graph foundation models (GFMs). However, existing evaluations mainly measure clean transfer, adaptation, and task coverage. It remains unclear whether GRL methods stay reliable when deployment stresses affect graph signals, graph contexts, label support, structural groups, or predictive evidence. We introduce GRL-Safety, a multi-axis safety evaluation benchmark for GRL. GRL-Safety evaluates twelve representative methods, spanning topology-only embedding methods, supervised GNNs, self-supervised graph models, and GFMs, on twenty-five graph datasets under standardized evaluation conditions while preserving method-native adaptation. The evaluation covers five safety axes: corruption robustness, OOD generalization, class imbalance, fairness, and interpretation, with per-axis and sub-condition reporting rather than a single aggregate score. Our analysis yields three cross-axis insights that can inspire future research. First, safety behavior is shaped by the interaction between representation design and the stressed graph factor, rather than by method family alone. Second, foundation-era methods show axis-specific strengths rather than broad safety dominance. Third, several deployment regimes remain difficult even for the best evaluated method, revealing capability gaps that require new robustness, adaptation, or training objectives beyond model selection. The benchmark, evaluation protocols, and code are available at: https://github.com/GXG-CS/GRL-Safety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GRL-Safety gives the field a first multi-axis benchmark for safety stresses in graph models, but the headline claim on design-factor interactions needs tighter controls to rule out dataset or protocol confounds.

read the letter

The paper introduces GRL-Safety, which runs twelve methods spanning embeddings, supervised GNNs, self-supervised models, and graph foundation models on twenty-five datasets across five axes: corruption robustness, OOD generalization, class imbalance, fairness, and interpretation. It reports per-axis and sub-condition results rather than one aggregate number, and the code plus protocols are public. That is the concrete new piece: a shared, standardized safety testbed where none existed before for this class of models. The evaluation tries to keep each method's native adaptation while holding the protocol fixed, which is a reasonable way to make comparisons fairer than before. The three cross-axis observations they list are a direct output of that setup and give readers concrete gaps to target next. The second and third insights, on axis-specific strengths for foundation models and persistent hard regimes, read as straightforward takeaways from the numbers. The first insight, that safety tracks specific representation choices interacting with particular stressed factors rather than broad method families, is the one that carries more weight and therefore needs more scrutiny. The stress-test note flags the risk that the twenty-five datasets or the way the five axes are applied could correlate with training styles or domains, which would make the interaction patterns look stronger than they are. The abstract does not spell out the domain breakdown or the exact stress implementations per method family, so it is not yet clear whether the design choices are truly isolated. If the full paper shows balanced coverage across citation, social, molecular, and other graphs plus uniform stress application, that concern shrinks; otherwise the claim stays provisional. This work is aimed at graph ML researchers who care about deployment reliability rather than clean accuracy alone. It deserves peer review because a public benchmark with this scope can anchor future comparisons even if the interaction analysis needs extra controls or expanded datasets in revision. I would send it out.

Referee Report

3 major / 3 minor

Summary. The paper introduces GRL-Safety, a multi-axis benchmark that evaluates twelve GRL methods (topology-only embeddings, supervised GNNs, self-supervised models, and GFMs) on twenty-five datasets under standardized conditions that preserve method-native adaptation. It reports results across five safety axes—corruption robustness, OOD generalization, class imbalance, fairness, and interpretation—with per-axis and sub-condition breakdowns rather than aggregate scores. The analysis yields three cross-axis insights: safety behavior arises from representation-design × stressed-factor interactions rather than method family; foundation-era methods exhibit axis-specific strengths; and several deployment regimes remain difficult for all evaluated methods.

Significance. If the evaluation protocol is free of selection confounds, the work supplies a reproducible, multi-axis safety lens on GRL that moves beyond clean-task performance. The public release of the benchmark, protocols, and code is a clear strength that supports follow-on research. The interaction insight, if substantiated, would shift focus from family-level comparisons toward design-factor pairings and would usefully identify concrete capability gaps.

major comments (3)

[Abstract and §4.3] Abstract and §4.3 (Cross-axis Insights): The central claim that 'safety behavior is shaped by the interaction between representation design and the stressed graph factor, rather than by method family alone' is load-bearing. The manuscript does not report a quantitative decomposition (e.g., variance partitioning or interaction-term significance in a mixed-effects model) that isolates design choices (message-passing vs. contrastive objectives, positional encodings) from family-level effects while controlling for dataset domain. Without this, the observed patterns could still be driven by correlations between the 25 datasets and the method families evaluated on them.
[§3.1 and Table 1] §3.1 and Table 1: The claim that the twenty-five datasets and five axes 'sufficiently capture the relevant deployment stresses' is not supported by an explicit diversity analysis. No table or appendix quantifies domain coverage (citation, social, molecular, etc.), stress orthogonality (e.g., correlation between corruption type and method family), or balance across graph properties that could interact with the evaluated designs. This leaves the interaction insight vulnerable to selection bias.
[§3.2] §3.2 (Evaluation Protocol): The description of 'standardized evaluation conditions while preserving method-native adaptation' is insufficiently detailed for reproducibility of the fairness and interpretation axes. For instance, it is unclear how contrastive or self-supervised objectives are adapted to fairness constraints without introducing protocol-level bias that could favor or disfavor particular families, undermining cross-family comparisons.

minor comments (3)

[Figure 3 and §4.2] Figure 3 and §4.2: The per-sub-condition plots are informative but dense; a compact summary heatmap aggregating the five axes would improve readability without loss of information.
[Related Work] Related Work section: Prior safety benchmarks in graph learning (e.g., robustness surveys or fairness toolkits) are cited, but the positioning relative to non-graph safety evaluation frameworks (e.g., those for language or vision models) could be sharpened to clarify the unique contribution of GRL-Safety.
[Appendix B] Appendix B: The list of hyperparameters and adaptation choices for each of the twelve methods is welcome, but a single consolidated table mapping each method to its exact adaptation procedure per axis would reduce cross-referencing.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, clarifying the empirical basis for our claims and committing to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses

Referee: [Abstract and §4.3] The central claim that 'safety behavior is shaped by the interaction between representation design and the stressed graph factor, rather than by method family alone' is load-bearing. The manuscript does not report a quantitative decomposition (e.g., variance partitioning or interaction-term significance in a mixed-effects model) that isolates design choices (message-passing vs. contrastive objectives, positional encodings) from family-level effects while controlling for dataset domain. Without this, the observed patterns could still be driven by correlations between the 25 datasets and the method families evaluated on them.

Authors: The claim rests on systematic per-subcondition breakdowns across 25 datasets that reveal design-specific patterns (e.g., contrastive objectives outperforming message-passing under certain OOD stresses, independent of family). While we did not include a formal mixed-effects decomposition, the consistency of these patterns across domains supports the interaction view. In revision we will add a post-hoc variance partitioning analysis using mixed-effects models on the safety metrics, treating design choices and family as factors while controlling for dataset domain and properties. revision: yes
Referee: [§3.1 and Table 1] The claim that the twenty-five datasets and five axes 'sufficiently capture the relevant deployment stresses' is not supported by an explicit diversity analysis. No table or appendix quantifies domain coverage (citation, social, molecular, etc.), stress orthogonality (e.g., correlation between corruption type and method family), or balance across graph properties that could interact with the evaluated designs. This leaves the interaction insight vulnerable to selection bias.

Authors: Table 1 and the appendix already enumerate the datasets by domain and stress type, chosen to span citation, social, molecular, and other graphs with varying sizes and densities. To directly address potential selection bias, we will add an appendix with quantitative diversity metrics: domain distribution, pairwise correlations between stress types and method families, and balance statistics on key graph properties (e.g., degree distribution, homophily). revision: yes
Referee: [§3.2] The description of 'standardized evaluation conditions while preserving method-native adaptation' is insufficiently detailed for reproducibility of the fairness and interpretation axes. For instance, it is unclear how contrastive or self-supervised objectives are adapted to fairness constraints without introducing protocol-level bias that could favor or disfavor particular families, undermining cross-family comparisons.

Authors: Adaptations for fairness (e.g., group-aware reweighting or adversarial objectives for contrastive losses) and interpretation (gradient- or attention-based methods) are implemented in the released code and described at high level in the supplement. We agree the main-text description in §3.2 can be expanded. In revision we will add explicit step-by-step protocol details for each method family and axis, including how native objectives are preserved while incorporating fairness constraints, to eliminate any ambiguity. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no circular derivation chain

full rationale

The paper introduces GRL-Safety as an empirical multi-axis benchmark, evaluates 12 methods on 25 datasets across 5 safety axes under standardized conditions, and draws three cross-axis insights directly from the resulting performance patterns. No mathematical derivations, equations, fitted parameters presented as predictions, or load-bearing self-citations appear in the abstract or described structure; the claims rest on experimental outcomes with public code and datasets rather than any self-referential reduction. This is a standard empirical evaluation paper whose central results are independently falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the selected 12 methods, 25 datasets, and 5 axes form a representative sample of GRL safety issues; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The five safety axes (corruption robustness, OOD generalization, class imbalance, fairness, interpretation) cover the primary deployment stresses for GRL.
Invoked when defining the benchmark scope in the abstract.

pith-pipeline@v0.9.0 · 5575 in / 1155 out tokens · 35285 ms · 2026-05-08T12:13:41.537809+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

87 extracted references · 14 canonical work pages · 1 internal anchor

[1]

Evaluating explain- ability for graph neural networks.Scientific Data, 10(1):144, 2023

Chirag Agarwal, Owen Queen, Himabindu Lakkaraju, and Marinka Zitnik. Evaluating explain- ability for graph neural networks.Scientific Data, 10(1):144, 2023

2023
[2]

GraphFramEx: Towards systematic evaluation of explainability methods for graph neural networks

Kenza Amara, Rex Ying, Zitao Zhang, Zhihao Han, Yinan Shan, Ulrik Brandes, Sebastian Schemm, and Ce Zhang. GraphFramEx: Towards systematic evaluation of explainability methods for graph neural networks. InLearning on Graphs Conference (LoG), 2022

2022
[3]

GInX-Eval: Towards in-distribution evaluation of graph neural network explanations

Kenza Amara, Mennatallah El-Assady, and Rex Ying. GInX-Eval: Towards in-distribution evaluation of graph neural network explanations. InInternational Conference on Learning Representations (ICLR), 2024

2024
[4]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review arXiv 2021
[5]

LLaGA: Large language and graph assistant

Runjin Chen, Tong Zhao, Ajay Jaiswal, Neil Shah, and Zhangyang Wang. LLaGA: Large language and graph assistant. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

2024
[6]

arXiv preprint arXiv:2406.10727 , year=

Zhikai Chen, Haitao Mao, Jingzhe Liu, Yu Song, Bingheng Li, Wei Jin, Bahare Fatemi, Anton Tsitsulin, Bryan Perozzi, Hui Liu, and Jiliang Tang. Text-space graph foundation models: Comprehensive benchmarks and new insights.arXiv preprint arXiv:2406.10727, 2024

work page arXiv 2024
[7]

A comprehensive survey on trustworthy graph neural networks: Privacy, robustness, fairness, and explainability.Machine Intelligence Research, 21(6):1011–1061, 2024

Enyan Dai, Tianxiang Zhao, Huaisheng Zhu, Junjie Xu, Zhimeng Guo, Hui Liu, Jiliang Tang, and Suhang Wang. A comprehensive survey on trustworthy graph neural networks: Privacy, robustness, fairness, and explainability.Machine Intelligence Research, 21(6):1011–1061, 2024

2024
[8]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2019

2019
[9]

A benchmark for fairness-aware graph learning.arXiv preprint arXiv:2407.12112, 2024

Yushun Dong, Song Wang, Zhenyu Lei, Zaiyi Zheng, Jing Ma, Chen Chen, and Jundong Li. A benchmark for fairness-aware graph learning.arXiv preprint arXiv:2407.12112, 2024

work page arXiv 2024
[10]

PyGDebias: A python library for debiasing in graph learning

Yushun Dong, Zhenyu Wang, Zaiyi Liu, Han Zhao, Suhang Wang, and Jundong Li. PyGDebias: A python library for debiasing in graph learning. InProceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM), 2024

2024
[11]

Yingtong Dou, Zhiwei Liu, Li Sun, Yutong Deng, Hao Peng, and Philip S. Yu. Enhancing graph neural network-based fraud detectors against camouflaged fraudsters. InProceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM), 2020

2020
[12]

Universal prompt tuning for graph neural networks

Taoran Fang, Yunchao Zhang, Yang Yang, Chunping Wang, and Lei Chen. Universal prompt tuning for graph neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[13]

TAGLAS: An atlas of text-attributed graph datasets in the era of large graph and language models, 2024

Jiarui Feng, Hao Liu, Lecheng Kong, Mingfang Zhu, Yixin Chen, and Muhan Zhang. TAGLAS: An atlas of text-attributed graph datasets in the era of large graph and language models, 2024

2024
[14]

node2vec: Scalable feature learning for networks

Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 855–864, 2016

2016
[15]

GOOD: A graph out-of-distribution benchmark

Shurui Gui, Xiner Li, Limei Wang, and Shuiwang Ji. GOOD: A graph out-of-distribution benchmark. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2022

2022
[16]

Generalizing gnns with tokenized mixture of experts.arXiv preprint arXiv:2602.09258, 2026

Xiaoguang Guo, Zehong Wang, Jiazheng Li, Shawn Spitzel, Qi Yang, Kaize Ding, Jundong Li, and Chuxu Zhang. Generalizing gnns with tokenized mixture of experts.arXiv preprint arXiv:2602.09258, 2026. 11

work page arXiv 2026
[17]

Hamilton, Rex Ying, and Jure Leskovec

William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

2017
[18]

UniGraph2: Learning a unified embedding space to bind multimodal graphs

Yufei He, Yuan Sui, Xiaoxin He, Yue Liu, Yifei Sun, and Bryan Hooi. UniGraph2: Learning a unified embedding space to bind multimodal graphs. InProceedings of the ACM Web Conference (WWW), 2025

2025
[19]

GraphMAE: Self-supervised masked graph autoencoders

Zhenyu Hou, Xiao Liu, Yukuo Cen, Yuxiao Dong, Hongxia Yang, Chunjie Wang, and Jie Tang. GraphMAE: Self-supervised masked graph autoencoders. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2022

2022
[20]

Open graph benchmark: Datasets for machine learning on graphs

Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

2020
[21]

PRODIGY: Enabling in-context learning over graphs

Qian Huang, Hongyu Ren, Peng Chen, Gregor Kržmanc, Daniel Zeng, Percy Liang, and Jure Leskovec. PRODIGY: Enabling in-context learning over graphs. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[22]

Tem- poral graph benchmark for machine learning on temporal graphs

Shenyang Huang, Farimah Poursafaei, Jacob Danovitch, Matthias Fey, Weihua Hu, Emanuele Rossi, Jure Leskovec, Michael Bronstein, Guillaume Rabusseau, and Reihaneh Rabbany. Tem- poral graph benchmark for machine learning on temporal graphs. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2023

2023
[23]

Multi-task self-supervised graph neural networks enable stronger task generalization

Mingxuan Ju, Tong Zhao, Qianlong Wen, Wenhao Yu, Neil Shah, Yanfang Ye, and Chuxu Zhang. Multi-task self-supervised graph neural networks enable stronger task generalization. In International Conference on Learning Representations (ICLR), 2023

2023
[24]

Kipf and Max Welling

Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. InInternational Conference on Learning Representations (ICLR), 2017

2017
[25]

DeepRobust: A platform for adversarial at- tacks and defenses

Yaxin Li, Wei Jin, Han Xu, and Jiliang Tang. DeepRobust: A platform for adversarial at- tacks and defenses. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), Demonstration Track, 2021

2021
[26]

ZeroG: Investigating cross- dataset zero-shot transferability in graphs

Yuhan Li, Peisong Wang, Zhixun Li, Jeffrey Xu Yu, and Jia Li. ZeroG: Investigating cross- dataset zero-shot transferability in graphs. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2024

2024
[27]

Newsp: A new search process for continuous subgraph matching over dynamic graphs

Ziming Li, Youhuan Li, Xinhuan Chen, Lei Zou, Yang Li, Xiaofeng Yang, and Hongbo Jiang. Newsp: A new search process for continuous subgraph matching over dynamic graphs. InIEEE International Conference on Data Engineering(ICDE), 2024

2024
[28]

Graph neural networks for databases: A survey

Ziming Li, Youhuan Li, Yuyu Luo, Guoliang Li, and Chuxu Zhang. Graph neural networks for databases: A survey. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence(IJCAI), 2025

2025
[29]

Graph is a substrate across data modalities.arXiv preprint arXiv:2601.22384, 2026

Ziming Li, Xiaoming Wu, Zehong Wang, Jiazheng Li, Yijun Tian, Jinhe Bi, Yunpu Ma, Yanfang Ye, and Chuxu Zhang. Graph is a substrate across data modalities.arXiv preprint arXiv:2601.22384, 2026

work page arXiv 2026
[30]

Holistic evaluation of language models.Transactions on Machine Learning Research, 2023

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, et al. Holistic evaluation of language models.Transactions on Machine Learning Research, 2023

2023
[31]

One for all: Towards training one graph model for all classification tasks

Hao Liu, Jiarui Feng, Lecheng Kong, Ningyue Liang, Dacheng Tao, Yixin Chen, and Muhan Zhang. One for all: Towards training one graph model for all classification tasks. InInternational Conference on Learning Representations (ICLR), 2024

2024
[32]

Graph neural networks with adaptive residual

Xiaorui Liu, Jiayuan Ding, Wei Jin, Han Xu, Yao Ma, Zitao Liu, and Jiliang Tang. Graph neural networks with adaptive residual. InAdvances in Neural Information Processing Systems (NeurIPS), 2021. 12

2021
[33]

Yihong Ma, Yijun Tian, Nuno Moniz, and Nitesh V . Chawla. Class-imbalanced learning on graphs: A survey.ACM Computing Surveys, 57(8), 2025. doi: 10.1145/3718734

work page doi:10.1145/3718734 2025
[34]

GraphENS: Neighbor-aware ego network synthesis for class-imbalanced node classification

Joonhyung Park, Jaeyun Song, and Eunho Yang. GraphENS: Neighbor-aware ego network synthesis for class-imbalanced node classification. InInternational Conference on Learning Representations (ICLR), 2022

2022
[35]

DeepWalk: Online learning of social repre- sentations

Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. DeepWalk: Online learning of social repre- sentations. InProceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 701–710, 2014

2014
[36]

GCC: Graph contrastive coding for graph neural network pre-training

Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding, Kuansan Wang, and Jie Tang. GCC: Graph contrastive coding for graph neural network pre-training. In Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2020

2020
[37]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceed- ings of the 38th International Conference on Machine Learning (ICML), 2021

2021
[38]

Sentence-BERT: Sentence embeddings using siamese BERT- networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT- networks. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 3982–3992, 2019

2019
[39]

Leonardo F. R. Ribeiro, Pedro H. P. Saverese, and Daniel R. Figueiredo. struc2vec: Learning node representations from structural identity. InProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 385–394, 2017

2017
[40]

DropEdge: Towards deep graph convolutional networks on node classification

Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. DropEdge: Towards deep graph convolutional networks on node classification. InInternational Conference on Learning Representations (ICLR), 2020

2020
[41]

Stokes, Kevin Yang, Kyle Swanson, Wengong Jin, Andres Cubillos-Ruiz, Nina M

Jonathan M. Stokes, Kevin Yang, Kyle Swanson, Wengong Jin, Andres Cubillos-Ruiz, Nina M. Donghia, Craig R. MacNair, Shawn French, Lindsey A. Carfrae, Zohar Bloom-Ackermann, et al. A deep learning approach to antibiotic discovery.Cell, 180(4):688–702.e13, 2020. doi: 10.1016/j.cell.2020.01.021

work page doi:10.1016/j.cell.2020.01.021 2020
[42]

GPPT: Graph pre-training and prompt tuning to generalize graph neural networks

Mingchen Sun, Kaixiong Zhou, Xin He, Ying Wang, and Xin Wang. GPPT: Graph pre-training and prompt tuning to generalize graph neural networks. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2022

2022
[43]

GraphGPT: Graph instruction tuning for large language models

Jiabin Tang, Yuhao Yang, Wei Wei, Lei Shi, Lixin Su, Suqi Cheng, Dawei Yin, and Chao Huang. GraphGPT: Graph instruction tuning for large language models. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2024

2024
[44]

LINE: Large- scale information network embedding

Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. LINE: Large- scale information network embedding. InProceedings of the 24th International Conference on World Wide Web (WWW), pages 1067–1077, 2015

2015
[45]

Investigating and mitigating degree-related biases in graph convolu- tional networks

Xianfeng Tang, Huaxiu Yao, Yiwei Sun, Yiqi Wang, Jiliang Tang, Charu Aggarwal, Prasenjit Mitra, and Suhang Wang. Investigating and mitigating degree-related biases in graph convolu- tional networks. InProceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM), pages 1435–1444, 2020

2020
[46]

Dyer, Rémi Munos, Petar Veliˇckovi´c, and Michal Valko

Shantanu Thakoor, Corentin Tallec, Mohammad Gheshlaghi Azar, Mehdi Azabou, Eva L. Dyer, Rémi Munos, Petar Veliˇckovi´c, and Michal Valko. Large-scale representation learning on graphs via bootstrapping. InInternational Conference on Learning Representations (ICLR), 2022

2022
[47]

Heterogeneous graph masked autoencoders

Yijun Tian, Kaiwen Dong, Chunhui Zhang, Chuxu Zhang, and Nitesh V Chawla. Heterogeneous graph masked autoencoders. InProceedings of the AAAI conference on artificial intelligence (AAAI), volume 37, pages 9997–10005, 2023. 13

2023
[48]

Graph attention networks

Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. InInternational Conference on Learning Representations (ICLR), 2018

2018
[49]

Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm

Petar Veliˇckovi´c, William Fedus, William L. Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. Deep graph infomax. InInternational Conference on Learning Representations (ICLR), 2019

2019
[50]

DecodingTrust: A comprehensive assess- ment of trustworthiness in GPT models

Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. DecodingTrust: A comprehensive assess- ment of trustworthiness in GPT models. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2023

2023
[51]

Generative risk mini- mization for out-of-distribution generalization on graphs.Transactions on Machine Learning Research, 2025

Song Wang, Zhen Tan, Yaochen Zhu, Chuxu Zhang, and Jundong Li. Generative risk mini- mization for out-of-distribution generalization on graphs.Transactions on Machine Learning Research, 2025. arXiv:2502.07968

work page arXiv 2025
[52]

Safety in graph machine learning: Threats and safeguards.IEEE Transactions on Knowledge and Data Engineering, 2026

Song Wang, Yushun Dong, Binchi Zhang, Zihan Chen, Xingbo Fu, Yinhan He, Cong Shen, Chuxu Zhang, Nitesh V Chawla, and Jundong Li. Safety in graph machine learning: Threats and safeguards.IEEE Transactions on Knowledge and Data Engineering, 2026

2026
[53]

Chawla, Chuxu Zhang, and Yanfang Ye

Zehong Wang, Zheyuan Zhang, Nitesh V . Chawla, Chuxu Zhang, and Yanfang Ye. GFT: Graph foundation model with transferable tree vocabulary. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[54]

Graph foundation models: A comprehensive survey,

Zehong Wang, Zheyuan Liu, Tianyi Ma, Jiazheng Li, Zheyuan Zhang, Xingbo Fu, Yiyang Li, Zhengqing Yuan, Wei Song, Yijun Ma, Qingkai Zeng, Xiusi Chen, Jianan Zhao, Jundong Li, Meng Jiang, Pietro Lio, Nitesh Chawla, Chuxu Zhang, and Yanfang Ye. Graph foundation models: A comprehensive survey.arXiv preprint arXiv:2505.15116, 2025

work page arXiv 2025
[55]

Chawla, Chuxu Zhang, and Yanfang Ye

Zehong Wang, Zheyuan Zhang, Tianyi Ma, Nitesh V . Chawla, Chuxu Zhang, and Yanfang Ye. Towards graph foundation models: Learning generalities across graphs via task-trees. In International Conference on Machine Learning (ICML), 2025

2025
[56]

Beyond message passing: Neural graph pattern machine

Zehong Wang, Zheyuan Zhang, Tianyi Ma, Nitesh V Chawla, Chuxu Zhang, and Yanfang Ye. Beyond message passing: Neural graph pattern machine. InInternational Conference on Machine Learning, pages 65496–65517. PMLR, 2025

2025
[57]

Generative graph pattern machine

Zehong Wang, Zheyuan Zhang, Tianyi Ma, Chuxu Zhang, and Yanfang Ye. Generative graph pattern machine. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[58]

NoisyGL: A comprehensive benchmark for graph neural networks under label noise

Zhonghao Wang, Danyu Sun, Sheng Zhou, Haobo Wang, Jiapei Fan, Longtao Huang, and Jiajun Bu. NoisyGL: A comprehensive benchmark for graph neural networks under label noise. In Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2024

2024
[59]

A comprehensive survey on graph neural networks.IEEE transactions on neural networks and learning systems, 32(1):4–24, 2020

Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. A comprehensive survey on graph neural networks.IEEE transactions on neural networks and learning systems, 32(1):4–24, 2020

2020
[60]

Xia and C

Lianghao Xia and Chao Huang. AnyGraph: Graph foundation model in the wild.arXiv preprint arXiv:2408.10700, 2024

work page arXiv 2024
[61]

GraphFM: A comprehensive benchmark for graph foundation model.arXiv preprint arXiv:2406.08310, 2024

Yuhao Xu, Xinqi Liu, Keyu Duan, Yi Fang, Yu-Neng Chuang, Daochen Zha, and Qiaoyu Tan. GraphFM: A comprehensive benchmark for graph foundation model.arXiv preprint arXiv:2406.08310, 2024

work page arXiv 2024
[62]

Hamilton, and Jure Leskovec

Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton, and Jure Leskovec. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2018. 14

2018
[63]

Graph contrastive learning with augmentations

Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. Graph contrastive learning with augmentations. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

2020
[64]

Evaluating progress in graph foundation models: A comprehensive benchmark and new insights.arXiv preprint arXiv:2603.10033, 2026

Xingtong Yu, Shenghua Ye, Ruijuan Liang, Chang Zhou, Hong Cheng, Xinming Zhang, and Yuan Fang. Evaluating progress in graph foundation models: A comprehensive benchmark and new insights.arXiv preprint arXiv:2603.10033, 2026

work page arXiv 2026
[65]

Graph cross supervised learning via generalized knowledge

Xiangchi Yuan, Yijun Tian, Chunhui Zhang, Yanfang Ye, Nitesh V Chawla, and Chuxu Zhang. Graph cross supervised learning via generalized knowledge. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 4083–4094, 2024. doi: 10.1145/3637528.3671830

work page doi:10.1145/3637528.3671830 2024
[66]

Mitigating emergent robustness degradation while scaling graph learning

Xiangchi Yuan, Chunhui Zhang, Yijun Tian, Yanfang Ye, and Chuxu Zhang. Mitigating emergent robustness degradation while scaling graph learning. InInternational Conference on Learning Representations, 2024

2024
[67]

Het- erogeneous graph neural network

Chuxu Zhang, Dongjin Song, Chao Huang, Ananthram Swami, and Nitesh V Chawla. Het- erogeneous graph neural network. InProceedings of the 25th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pages 793–803, 2019

2019
[68]

TrustGLM: Evaluating the robustness of graph LLMs against prompt, text, and structure attacks.arXiv preprint arXiv:2506.11844, 2025

Qihai Zhang, Xinyue Sheng, Yuanfu Sun, and Qiaoyu Tan. TrustGLM: Evaluating the robustness of graph LLMs against prompt, text, and structure attacks.arXiv preprint arXiv:2506.11844, 2025

work page arXiv 2025
[69]

GraphSMOTE: Imbalanced node classifi- cation on graphs with graph neural networks

Tianxiang Zhao, Xiang Zhang, and Suhang Wang. GraphSMOTE: Imbalanced node classifi- cation on graphs with graph neural networks. InProceedings of the 14th ACM International Conference on Web Search and Data Mining (WSDM), 2021

2021
[70]

Graph robustness benchmark: Benchmarking the adversarial robustness of graph machine learning

Qinkai Zheng, Xu Zou, Yuxiao Dong, Yukuo Cen, Da Yin, Jiarong Xu, Yang Yang, and Jie Tang. Graph robustness benchmark: Benchmarking the adversarial robustness of graph machine learning. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2021. 15 A Datasets Dataset statistics.We use twenty-five text-attributed grap...

2021
[71]

For BGRL [ 46]: num_layers=2, num_hidden=768, lr=1e-4, weight_decay=1e-5, max_epoch=50, batch_size=1024, ema_tau=0.99, warmup_frac=0.1, drop_edge=0.2, drop_feat=0.2, predictor_hidden=512, optimizer=AdamW
[72]

For GraphMAE [ 19]: num_layers=2, num_hidden=768, num_heads=4, lr=1e-3, weight_decay=2e-4, max_epoch=50, batch_size=1024, mask_rate=0.5, replace_rate=0.0, loss=sce, alpha_l=3.0, in_drop=0.2, attn_drop=0.1, optimizer=AdamW
[73]

For GFT [ 53]: num_layers=2, hidden_dim=768, dropout=0.15, pretrain_lr=1e-4, pretrain_weight_decay=1e-5, pretrain_epochs=50, pretrain_batch_size=1024, codebook_size=128, codebook_head=4, codebook_decay=0.8, feat_p=0.2, edge_p=0.2, topo_recon_ratio=0.1, optimizer=AdamW
[74]

For GIT [ 55]: num_layers=2, hidden_dim=768, dropout=0.15, lr=1e-7, weight_decay=1e-8, epochs=20, batch_size=4096, fanout=10, ema=0.99, feat_p=0.2, edge_p=0.2, topo_recon_ratio=0.1, optimizer=AdamW
[75]

For UniGraph2 [ 18]: num_layers=3, num_hidden=768, num_experts=8, num_selected_experts=2, lr=1e-4, weight_decay=1e-5, max_epoch=50, batch_size=64, gamma=2.0, lambda_spd=0.5, feat_drop_rate=0.1, edge_mask_rate=0.1, fanout=10, optimizer=AdamW
[76]

For OFA [ 31]: num_layers=7, num_hidden=768, dropout=0.15, lr=1e-3, num_epochs=50, batch_size=128, num_relations=5, JK=none, optimizer=AdamW
[77]

C.2 Downstream adaptation hyperparameters For each method-dataset cell we apply the method’s native downstream adaptation procedure end-to- end with AdamW and five random seeds

For LLaGA [ 5], following the original two-stage recipe, only the projector is trained with the Vicuna-7B language model frozen: llm_backbone=Vicuna-7B (frozen), projector_type=2-layer-mlp, projector_hidden=4096, lr=5e-4, weight_decay=1e-4, epochs=10, batch_size=4, sample_size=10, use_hop=2, max_text_len=128, max_steps=5000, dtype=bfloat16, optimizer=Adam...
[78]

Predictions use a frozen- feature ℓ2-regularised logistic-regression head (scikit-learn, C=1.0, max_iter=1000, solver=lbfgs)

For DeepWalk [ 35], walk embeddings are fitted on the downstream graph with Skip-gram on uniform random walks ( p=q=1): emb_dim=128, walk_length=20, walks_per_node=10, context_size=10, num_negative=1, walk_epochs=100, batch_size=128, lr=1e-2, optimizer=SparseAdam . Predictions use a frozen- feature ℓ2-regularised logistic-regression head (scikit-learn, C=...
[79]

For Node2vec [14]: same Skip-gram and logistic-regression configuration as DeepWalk, with biased second-order walks (p=1.0, q=0.5, BFS-leaning)
[80]

For GCN, GAT, and GraphSAGE: num_layers=2, num_hidden=768, lr=1e-3, dropout=0.2, weight_decay=1e-4, max_epochs=500, patience=200. 18

Showing first 80 references.