When Design Rules Break: Benchmark Composition Determines Whether Label Informativeness Predicts GNN Aggregator Choice

Neha Sharma; Ritesh Sharma

arxiv: 2606.10249 · v1 · pith:TUSPXJNXnew · submitted 2026-06-08 · 💻 cs.LG · cs.SI

When Design Rules Break: Benchmark Composition Determines Whether Label Informativeness Predicts GNN Aggregator Choice

Neha Sharma , Ritesh Sharma This is my paper

Pith reviewed 2026-06-27 16:52 UTC · model grok-4.3

classification 💻 cs.LG cs.SI

keywords GNN aggregator choicelabel informativenessbenchmark compositionFacebook-100sum vs mean aggregationspectral gapstochastic block modelnode classification

0 comments

The pith

Benchmark composition determines whether label informativeness predicts GNN aggregator choice

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the rule that label informativeness guides selection between sum and mean aggregation in GNNs holds across benchmark families. The rule works on citation and heterophilic graphs but breaks when dense Facebook-100 friendship networks are added, where sum aggregation is strongly preferred despite near-zero informativeness and yields 7-10% gains. Stochastic block model ablations that match degree scales fail to reproduce the pattern, while the spectral gap distinguishes the Facebook-100 graphs with the effect localized to one-hop neighborhoods. A sympathetic reader would care because apparent GNN design rules may reflect the particular mix of datasets rather than universal properties of graphs or learning.

Core claim

Label informativeness predicts the GIN-Sum versus GIN-Mean performance gap well on legacy benchmarks but degrades substantially when Facebook-100 graphs are included. In these dense friendship networks, near-zero label informativeness coexists with a strong preference for sum aggregation, producing gains of 7-10% and up to 13% under extended training. Stochastic block model ablations, including degree-corrected variants, fail to reproduce this behavior, indicating that mean degree alone does not explain the effect. Among several label-independent graph statistics, the spectral gap uniquely distinguishes these graphs from other low-informativeness datasets, with the effect localized to one-ho

What carries the argument

The performance gap between sum and mean aggregation (in GIN and related models) and its correlation with label informativeness, which holds or fails depending on the composition of the benchmark suite.

If this is right

Edge homophily is only weakly predictive of the sum versus mean gap across the full set of datasets.
The spectral gap distinguishes Facebook-100 graphs from other low-informativeness datasets, and the aggregator effect is localized to one-hop neighborhoods.
PNA can underperform the best single-aggregator GIN on standard citation benchmarks.
Training length interacts with aggregator choice, with extended training amplifying the sum advantage on Facebook-100 graphs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adaptive aggregation methods will need to target the specific properties of dense friendship networks rather than relying on label informativeness alone.
Other real-world social graphs with similar spectral properties may exhibit the same sum preference even when label informativeness is low.
GNN evaluation suites should deliberately include multiple benchmark regimes to expose when design rules fail to generalize.
The interaction between training schedule and aggregator performance suggests that longer training could be used as a simple way to exploit sum aggregation in certain low-informativeness settings.

Load-bearing premise

Stochastic block models with degree correction adequately isolate basic statistics such as mean degree from whatever produces the Facebook-100 sum preference.

What would settle it

An experiment that tunes a degree-corrected stochastic block model to also match the spectral gap of Facebook-100 graphs and then checks whether the sum preference appears on the generated graphs.

Figures

Figures reproduced from arXiv: 2606.10249 by Neha Sharma, Ritesh Sharma.

**Figure 2.** Figure 2: PNA (Corso et al., 2020) accuracy minus the best single-aggregator GIN variant, plotted against [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Per-dataset GIN-Sum minus GIN-Mean gap (x-axis) versus GraphSAGE-Sum minus GraphSAGE [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: GIN-Sum minus GIN-Mean accuracy gap on stochastic block model graphs across a [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

read the original abstract

We examine whether graph neural network (GNN) design rules generalize across benchmark families by studying aggregator selection (sum, mean, max) on 24 node-classification datasets spanning citation, heterophilic, LINKX Facebook-100, co-purchase, and co-authorship graphs. Edge homophily is only weakly predictive of the GIN-Sum versus GIN-Mean performance gap. Label informativeness predicts this gap well on legacy benchmarks but degrades substantially when Facebook-100 graphs are included. In these dense friendship networks, near-zero label informativeness coexists with a strong preference for sum aggregation, producing gains of 7-10% and up to 13% under extended training. Stochastic block model ablations, including degree-corrected variants matching Facebook-100 degree scales, fail to reproduce this behavior, indicating that mean degree alone does not explain the effect. Among several label-independent graph statistics, the spectral gap uniquely distinguishes these graphs from other low-informativeness datasets, with the effect localized to one-hop neighborhoods and replicated across architectures. We further identify training regimes that interact with aggregator choice and show that PNA can underperform the best single-aggregator GIN on standard citation benchmarks. Our results suggest that benchmark composition, rather than numerical insufficiency, determines whether design rules appear to generalize, and that the Facebook-100 regime provides a concrete target for future adaptive aggregation methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Facebook-100 graphs break the label informativeness rule for GNN aggregator choice, and degree-corrected SBMs do not explain the sum preference.

read the letter

The paper shows that label informativeness stops predicting whether sum or mean aggregation wins once Facebook-100 graphs are added to the test suite. On those dense networks the preference for sum persists even with near-zero informativeness, and the gap reaches 7-10 percent.

The work covers 24 datasets across several families and runs the same aggregator comparison on all of them. It then checks a short list of label-independent statistics and reports that spectral gap is the only one that cleanly separates the Facebook-100 behavior, with the effect showing up in one-hop neighborhoods. The training-regime interaction and the PNA comparison are also straightforward observations.

The SBM ablations are the soft spot. Matching degree scales does not guarantee that the models reproduce the spectral-gap distribution or other local structure of the real graphs, so the claim that mean degree is ruled out rests on an incomplete control. The abstract gives no error bars or statistical tests, which leaves the size of the reported gaps harder to judge.

This is useful for anyone who designs or validates new aggregation functions and wants to know where the usual rules stop generalizing. It deserves a serious referee because the empirical pattern is clear enough to warrant closer scrutiny of the controls and the statistics, even if the causal isolation needs tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript examines whether GNN aggregator selection rules (sum vs. mean vs. max) generalize across benchmark families by evaluating node classification performance on 24 datasets spanning citation networks, heterophilic graphs, LINKX Facebook-100, co-purchase, and co-authorship graphs. It reports that edge homophily is only weakly predictive of the GIN-Sum vs. GIN-Mean gap, while label informativeness predicts the gap well on legacy benchmarks but loses predictive power when Facebook-100 graphs are added. In these dense friendship networks, near-zero label informativeness coexists with a strong sum preference yielding 7-10% gains (up to 13% under extended training). Degree-corrected SBM ablations matching Facebook-100 degree scales fail to reproduce the sum preference. Among label-independent statistics, spectral gap uniquely distinguishes the Facebook-100 regime, with the effect localized to one-hop neighborhoods and replicated across architectures. The paper concludes that benchmark composition, rather than numerical insufficiency, determines whether design rules appear to generalize.

Significance. If the results hold, the work provides concrete evidence that benchmark composition can determine the apparent validity of GNN design heuristics, identifying the Facebook-100 regime as a distinct target for adaptive aggregation methods. Strengths include the scale of the empirical evaluation (24 datasets), the use of SBM ablations to test structural explanations, comparisons of multiple label-independent statistics, and replication across architectures and training regimes. This supplies a falsifiable, regime-specific target rather than relying solely on existing benchmarks.

major comments (2)

[SBM ablations] § on SBM ablations: The central claim that mean degree alone does not explain the Facebook-100 sum preference rests on the reported failure of degree-corrected SBMs (matching FB-100 degree scales) to reproduce the observed behavior. However, the paper identifies spectral gap as the unique distinguisher among label-independent statistics; without reporting whether the SBM variants reproduce the spectral-gap distribution (or other one-hop neighborhood statistics) of the real Facebook-100 graphs, the isolation of mean degree remains incomplete and the residual preference could still be attributable to unmatched structural properties.
[Results on performance gaps] Results on performance gaps (Facebook-100 rows): The reported gains of 7-10% (up to 13% under extended training) for sum aggregation are presented as evidence of a distinct regime, but the manuscript provides no error bars, run-to-run standard deviations, or statistical significance tests. Given that the load-bearing claim concerns both the existence and magnitude of these gaps when label informativeness is near zero, the absence of these details leaves the reliability of the effect sizes and cross-dataset comparisons open to question.

minor comments (2)

[Abstract and methods] The abstract states that the effect is 'localized to one-hop neighborhoods' but the main text should explicitly define how this localization was measured (e.g., via modified neighborhood statistics or ablation on k-hop subgraphs) to allow replication.
[Tables/figures] Table or figure captions for the 24-dataset results should include the exact number of runs per entry and whether the same random seeds were used across aggregators.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of our empirical claims. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [SBM ablations] § on SBM ablations: The central claim that mean degree alone does not explain the Facebook-100 sum preference rests on the reported failure of degree-corrected SBMs (matching FB-100 degree scales) to reproduce the observed behavior. However, the paper identifies spectral gap as the unique distinguisher among label-independent statistics; without reporting whether the SBM variants reproduce the spectral-gap distribution (or other one-hop neighborhood statistics) of the real Facebook-100 graphs, the isolation of mean degree remains incomplete and the residual preference could still be attributable to unmatched structural properties.

Authors: We agree that verifying whether the degree-corrected SBMs match the spectral gap (and other one-hop statistics) of the Facebook-100 graphs is necessary to fully isolate mean degree as an explanation. In the revised manuscript, we will add a direct comparison of the spectral gap distributions between the real Facebook-100 graphs and the generated SBM variants, along with any other relevant neighborhood statistics. This will clarify whether the residual sum preference can be attributed to unmatched structural properties. revision: yes
Referee: [Results on performance gaps] Results on performance gaps (Facebook-100 rows): The reported gains of 7-10% (up to 13% under extended training) for sum aggregation are presented as evidence of a distinct regime, but the manuscript provides no error bars, run-to-run standard deviations, or statistical significance tests. Given that the load-bearing claim concerns both the existence and magnitude of these gaps when label informativeness is near zero, the absence of these details leaves the reliability of the effect sizes and cross-dataset comparisons open to question.

Authors: We acknowledge that reporting error bars, run-to-run standard deviations, and statistical significance tests is essential for substantiating the performance gaps on Facebook-100. In the revised version, we will include these details: mean accuracies with standard deviations over multiple random seeds, and paired statistical tests (e.g., t-tests) for the key sum vs. mean comparisons. This will support the reported gains of 7-10% (up to 13% under extended training) with appropriate measures of reliability. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark comparison with no derivations or fitted reductions

full rationale

The paper conducts an empirical study comparing GNN aggregator performance (sum/mean/max) across 24 public datasets and SBM variants. All central claims rest on observed accuracy differences, label informativeness correlations, and failure of degree-matched SBMs to reproduce Facebook-100 behavior. No equations, parameter fits, or derivations are present that could reduce a prediction to its input by construction. No self-citations are invoked as load-bearing uniqueness theorems. The work is self-contained against external benchmarks and does not rename known results or smuggle ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical study with no new mathematical axioms, free parameters, or invented entities; it relies on standard GNN training assumptions and publicly available datasets.

axioms (1)

domain assumption Standard GNN training assumptions such as consistent hyperparameter choices across datasets
The comparison of aggregator performance implicitly assumes training protocols do not favor one aggregator on particular graph families.

pith-pipeline@v0.9.1-grok · 5783 in / 1381 out tokens · 32469 ms · 2026-06-27T16:52:02.554226+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 1 canonical work pages

[1]

International Conference on Learning Representations (ICLR) , year=

Semi-Supervised Classification with Graph Convolutional Networks , author=. International Conference on Learning Representations (ICLR) , year=
[2]

International Conference on Learning Representations (ICLR) , year=

Graph Attention Networks , author=. International Conference on Learning Representations (ICLR) , year=
[3]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Inductive Representation Learning on Large Graphs , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[4]

International Conference on Learning Representations (ICLR) , year=

How Powerful are Graph Neural Networks? , author=. International Conference on Learning Representations (ICLR) , year=
[5]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Principal Neighbourhood Aggregation for Graph Nets , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[6]

Advances in Neural Information Processing Systems (NeurIPS) , year=

When Do Graph Neural Networks Help with Node Classification? Investigating the Impact of Homophily Principle on Node Distinguishability , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[7]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Beyond Homophily in Graph Neural Networks: Current Limitations and Effective Designs , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[8]

International Conference on Learning Representations (ICLR) , year=

Is Homophily a Necessity for Graph Neural Networks? , author=. International Conference on Learning Representations (ICLR) , year=
[9]

International Conference on Learning Representations (ICLR) , year=

A Critical Look at the Evaluation of GNNs Under Heterophily: Are We Really Making Progress? , author=. International Conference on Learning Representations (ICLR) , year=
[10]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[11]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Classic GNNs are Strong Baselines: Reassessing GNNs for Node Classification , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[12]

International Conference on Machine Learning (ICML) , year=

Revisiting Semi-Supervised Learning with Graph Embeddings , author=. International Conference on Machine Learning (ICML) , year=
[13]

International Conference on Learning Representations (ICLR) , year=

Geom-GCN: Geometric Graph Convolutional Networks , author=. International Conference on Learning Representations (ICLR) , year=
[14]

KDD , year=

Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks , author=. KDD , year=
[15]

Relational Representation Learning Workshop, NeurIPS , year=

Pitfalls of Graph Neural Network Evaluation , author=. Relational Representation Learning Workshop, NeurIPS , year=
[16]

KDD , year=

Social Influence Analysis in Large-scale Networks , author=. KDD , year=
[17]

Simplifying approach to node classification in Graph Neural Networks , journal =

Sunil Kumar Maurya and Xin Liu and Tsuyoshi Murata , keywords =. Simplifying approach to node classification in Graph Neural Networks , journal =. 2022 , issn =. doi:https://doi.org/10.1016/j.jocs.2022.101695 , url =

work page doi:10.1016/j.jocs.2022.101695 2022
[18]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Beyond fixed depth: Adaptive graph neural networks for node classification under varying homophily , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[19]

Revisiting Neighborhood Aggregation in Graph Neural Networks for Node Classification using Statistical Signal Processing , year=

Ghogho, Mounir , booktitle=. Revisiting Neighborhood Aggregation in Graph Neural Networks for Node Classification using Statistical Signal Processing , year=
[20]

Adaptive Universal Generalized

Chien, Eli and Peng, Jianhao and Li, Pan and Milenkovic, Olgica , booktitle =. Adaptive Universal Generalized
[21]

Abu-El-Haija, Sami and Perozzi, Bryan and Kapoor, Amol and Alipourfard, Nazanin and Lerman, Kristina and Harutyunyan, Hrayr and Ver Steeg, Greg and Galstyan, Aram , booktitle =
[22]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Revisiting Heterophily For Graph Neural Networks , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[23]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Open Graph Benchmark: Datasets for Machine Learning on Graphs , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[1] [1]

International Conference on Learning Representations (ICLR) , year=

Semi-Supervised Classification with Graph Convolutional Networks , author=. International Conference on Learning Representations (ICLR) , year=

[2] [2]

International Conference on Learning Representations (ICLR) , year=

Graph Attention Networks , author=. International Conference on Learning Representations (ICLR) , year=

[3] [3]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Inductive Representation Learning on Large Graphs , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[4] [4]

International Conference on Learning Representations (ICLR) , year=

How Powerful are Graph Neural Networks? , author=. International Conference on Learning Representations (ICLR) , year=

[5] [5]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Principal Neighbourhood Aggregation for Graph Nets , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[6] [6]

Advances in Neural Information Processing Systems (NeurIPS) , year=

When Do Graph Neural Networks Help with Node Classification? Investigating the Impact of Homophily Principle on Node Distinguishability , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[7] [7]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Beyond Homophily in Graph Neural Networks: Current Limitations and Effective Designs , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[8] [8]

International Conference on Learning Representations (ICLR) , year=

Is Homophily a Necessity for Graph Neural Networks? , author=. International Conference on Learning Representations (ICLR) , year=

[9] [9]

International Conference on Learning Representations (ICLR) , year=

A Critical Look at the Evaluation of GNNs Under Heterophily: Are We Really Making Progress? , author=. International Conference on Learning Representations (ICLR) , year=

[10] [10]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[11] [11]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Classic GNNs are Strong Baselines: Reassessing GNNs for Node Classification , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[12] [12]

International Conference on Machine Learning (ICML) , year=

Revisiting Semi-Supervised Learning with Graph Embeddings , author=. International Conference on Machine Learning (ICML) , year=

[13] [13]

International Conference on Learning Representations (ICLR) , year=

Geom-GCN: Geometric Graph Convolutional Networks , author=. International Conference on Learning Representations (ICLR) , year=

[14] [14]

KDD , year=

Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks , author=. KDD , year=

[15] [15]

Relational Representation Learning Workshop, NeurIPS , year=

Pitfalls of Graph Neural Network Evaluation , author=. Relational Representation Learning Workshop, NeurIPS , year=

[16] [16]

KDD , year=

Social Influence Analysis in Large-scale Networks , author=. KDD , year=

[17] [17]

Simplifying approach to node classification in Graph Neural Networks , journal =

Sunil Kumar Maurya and Xin Liu and Tsuyoshi Murata , keywords =. Simplifying approach to node classification in Graph Neural Networks , journal =. 2022 , issn =. doi:https://doi.org/10.1016/j.jocs.2022.101695 , url =

work page doi:10.1016/j.jocs.2022.101695 2022

[18] [18]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Beyond fixed depth: Adaptive graph neural networks for node classification under varying homophily , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[19] [19]

Revisiting Neighborhood Aggregation in Graph Neural Networks for Node Classification using Statistical Signal Processing , year=

Ghogho, Mounir , booktitle=. Revisiting Neighborhood Aggregation in Graph Neural Networks for Node Classification using Statistical Signal Processing , year=

[20] [20]

Adaptive Universal Generalized

Chien, Eli and Peng, Jianhao and Li, Pan and Milenkovic, Olgica , booktitle =. Adaptive Universal Generalized

[21] [21]

Abu-El-Haija, Sami and Perozzi, Bryan and Kapoor, Amol and Alipourfard, Nazanin and Lerman, Kristina and Harutyunyan, Hrayr and Ver Steeg, Greg and Galstyan, Aram , booktitle =

[22] [22]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Revisiting Heterophily For Graph Neural Networks , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[23] [23]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Open Graph Benchmark: Datasets for Machine Learning on Graphs , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =