pith. sign in

arxiv: 2601.01123 · v2 · pith:JPWQ5VKHnew · submitted 2026-01-03 · 💻 cs.LG · cs.AI

Learning from Historical Activations in Graph Neural Networks

Pith reviewed 2026-05-21 17:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords graph neural networksgraph poolingattention mechanismshistorical activationsdeep GNNsover-smoothinggraph classification
0
0 comments X

The pith

Graph neural networks improve classification accuracy by attending to node activations from every previous layer instead of only the final one.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HISTOGRAPH, a final aggregation layer for GNNs that uses a two-stage attention process to incorporate intermediate activations produced during the forward pass. Standard pooling methods rely only on the last layer's output, which can lose task-relevant details when node representations shift across layers or when deep networks encounter over-smoothing. HISTOGRAPH first performs layer-wise attention to weigh historical activations uniformly, then applies node-wise attention to refine the graph-level descriptor using the graph structure. A sympathetic reader would care because the method offers a way to make both shallow and deep GNNs more effective on graph classification without redesigning the underlying message-passing layers.

Core claim

HISTOGRAPH is a novel two-stage attention-based final aggregation layer that first applies a unified layer-wise attention over intermediate activations, followed by node-wise attention. By modeling the evolution of node representations across layers, our HISTOGRAPH leverages both the activation history of nodes and the graph structure to refine features used for final prediction.

What carries the argument

HISTOGRAPH, the two-stage attention mechanism consisting of unified layer-wise attention over all intermediate activations followed by node-wise attention, which extracts and combines historical node features while respecting graph structure.

If this is right

  • Graph classification accuracy rises on multiple standard benchmarks compared with pooling methods that use only the final layer.
  • Gains are largest and most consistent when the underlying GNN is deep, where over-smoothing normally hurts performance.
  • The final graph descriptor becomes more informative because it incorporates node representation trajectories rather than a single snapshot.
  • Traditional GNN pipelines can adopt the new aggregation layer as a drop-in replacement with no change to message-passing or training procedure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If historical activations prove useful here, similar layer-wise attention could be tested in sequential models outside graphs, such as recurrent networks where hidden states evolve over time steps.
  • The approach might lessen the practical cost of choosing network depth, because earlier layers remain accessible even if later layers become less informative.
  • Extending the same two-stage attention to node-level or edge-level tasks on graphs would test whether the benefit is specific to global pooling or more general.

Load-bearing premise

Intermediate-layer activations contain additional task-relevant information not already present in the final-layer features, and a two-stage attention mechanism can extract and combine this information without introducing harmful noise or overfitting.

What would settle it

An ablation experiment that disables the layer-wise attention component and measures no drop in accuracy on the same set of graph classification benchmarks would show that historical activations do not supply the claimed additional value.

Figures

Figures reproduced from arXiv: 2601.01123 by Hadar Sinai, Haggai Maron, Moshe Eliasof, Yaniv Galron.

Figure 1
Figure 1. Figure 1: Overview of HISTOGRAPH. (1) Given input node features X0 and adjacency A, a backbone GNN produces historical graph activations X1, .., XL−1. (2) The Layer-wise attention module uses the final-layer embedding as a query to attend over all historical states while averaging across nodes, yielding per-node aggregated embeddings H. (3) A Node-wise self-attention module refines H by modeling interactions across … view at source ↗
Figure 2
Figure 2. Figure 2: Visualizations on the IMDB-B dataset with 64-layer HISTOGRAPH. (left) Attention patterns across layers under different training regimes. (right) Embedding evolution throughout training, measured by the normed difference between final and intermediate representations. 0 0 0 0 0 0 0 0 1 (a) Input -1 0 0 0 0 0 0 0 1 (b) Target -0.07 -0.05 -0.03 -0.00 0.02 0.04 0.06 0.08 -0.05 (c) GCN -1 0 0 0 0 0 0 0 1 (d) HI… view at source ↗
Figure 3
Figure 3. Figure 3: Graph and signal transformations: (a) input node features; (b) prediction target, the node [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average training time per epoch (in log scale) for GCN backbones with 3 and 32 layers, [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Barbell graph illustrating a distribution shift: a singleton node (right) is connected to a [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
read the original abstract

Graph Neural Networks (GNNs) have demonstrated remarkable success in various domains such as social networks, molecular chemistry, and more. A crucial component of GNNs is the pooling procedure, in which the node features calculated by the model are combined to form an informative final descriptor to be used for the downstream task. However, previous graph pooling schemes rely on the last GNN layer features as an input to the pooling or classifier layers, potentially under-utilizing important activations of previous layers produced during the forward pass of the model, which we regard as historical graph activations. This gap is particularly pronounced in cases where a node's representation can shift significantly over the course of many graph neural layers, and worsened by graph-specific challenges such as over-smoothing in deep architectures. To bridge this gap, we introduce HISTOGRAPH, a novel two-stage attention-based final aggregation layer that first applies a unified layer-wise attention over intermediate activations, followed by node-wise attention. By modeling the evolution of node representations across layers, our HISTOGRAPH leverages both the activation history of nodes and the graph structure to refine features used for final prediction. Empirical results on multiple graph classification benchmarks demonstrate that HISTOGRAPH offers strong performance that consistently improves traditional techniques, with particularly strong robustness in deep GNNs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces HISTOGRAPH, a two-stage attention-based final aggregation layer for GNNs. It first performs layer-wise attention over intermediate (historical) node activations from all layers, then applies node-wise attention, with the goal of leveraging representation evolution across layers to improve graph classification performance and robustness in deep GNNs prone to over-smoothing.

Significance. If the empirical claims hold under proper controls, the approach could meaningfully address under-utilization of intermediate activations in standard GNN pooling, offering a practical way to mitigate over-smoothing without altering the base GNN architecture.

major comments (2)
  1. Abstract: the central empirical claim ('strong performance that consistently improves traditional techniques, with particularly strong robustness in deep GNNs') is presented without any mention of datasets, baselines, number of runs, statistical tests, or ablation results, preventing assessment of whether the reported gains support the historical-activation premise.
  2. Method (two-stage attention description): the mechanism adds learnable parameters for both layer-wise and node-wise attention stages on top of the base GNN. No experiment is described that holds total parameter count fixed while removing the historical component (e.g., identical node-wise attention applied only to final-layer features). Without this control, it remains unclear whether gains derive from historical activations or simply from increased attention capacity.
minor comments (1)
  1. Abstract and introduction: the term 'historical graph activations' is used without a precise definition or notation distinguishing it from residual connections or layer-wise concatenation already common in GNN literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our work on HISTOGRAPH. We address each major comment in turn below, indicating the revisions we will make.

read point-by-point responses
  1. Referee: Abstract: the central empirical claim ('strong performance that consistently improves traditional techniques, with particularly strong robustness in deep GNNs') is presented without any mention of datasets, baselines, number of runs, statistical tests, or ablation results, preventing assessment of whether the reported gains support the historical-activation premise.

    Authors: We agree that the abstract would be strengthened by including concrete details on the experimental setup. In the revised manuscript, we will expand the abstract to reference the specific graph classification benchmarks (e.g., MUTAG, PROTEINS, NCI1), the standard pooling baselines, that results are averaged over multiple runs with reported standard deviations, and the key ablation studies isolating the historical component. revision: yes

  2. Referee: Method (two-stage attention description): the mechanism adds learnable parameters for both layer-wise and node-wise attention stages on top of the base GNN. No experiment is described that holds total parameter count fixed while removing the historical component (e.g., identical node-wise attention applied only to final-layer features). Without this control, it remains unclear whether gains derive from historical activations or simply from increased attention capacity.

    Authors: This is a fair observation on the need for tighter controls. The layer-wise attention is specifically formulated to operate over the full history of node activations across layers, which is distinct from final-layer-only attention. To isolate this effect, we will add a new ablation in the revision: a parameter-matched variant that applies equivalent node-wise attention only to the final-layer features (by adjusting hidden dimensions or adding dummy parameters to equalize total count). Results from this control will be reported to demonstrate that gains arise from the historical component rather than capacity alone. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture with independent components

full rationale

The paper describes HISTOGRAPH as a two-stage attention mechanism that aggregates historical layer activations in GNNs. No equations, derivations, or parameter-fitting steps are present in the provided text that reduce any claimed result to its own inputs by construction. The method is introduced as an architectural addition whose performance is evaluated empirically on benchmarks; the central claim does not rely on self-citation chains, uniqueness theorems from prior author work, or renaming of known results. The derivation chain is therefore self-contained and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that historical activations carry useful extra signal and on the modeling choice of a two-stage attention architecture; no free parameters or new physical entities are introduced in the abstract.

axioms (1)
  • domain assumption Intermediate activations from earlier GNN layers contain task-relevant information not captured by the final layer alone.
    This premise motivates the entire HISTOGRAPH design and is invoked to explain why standard last-layer pooling is insufficient.
invented entities (1)
  • HISTOGRAPH two-stage attention layer no independent evidence
    purpose: To aggregate historical node activations using layer-wise then node-wise attention.
    New architectural component introduced by the paper.

pith-pipeline@v0.9.0 · 5755 in / 1229 out tokens · 54520 ms · 2026-05-21T17:09:00.403031+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 8 internal anchors

  1. [1]

    On the bottleneck of graph neural networks and its practical implications

    Uri Alon and Eran Yahav. On the bottleneck of graph neural networks and its practical implications. arXiv preprint arXiv:2006.05205,

  2. [2]

    On vanishing gradients, over- smoothing, and over-squashing in gnns: Bridging recurrent and graph learning.arXiv preprint arXiv:2502.10818,

    Álvaro Arroyo, Alessio Gravina, Benjamin Gutteridge, Federico Barbero, Claudio Gallicchio, Xiaowen Dong, Michael Bronstein, and Pierre Vandergheynst. On vanishing gradients, over- smoothing, and over-squashing in gnns: Bridging recurrent and graph learning.arXiv preprint arXiv:2502.10818,

  3. [3]

    URL https://www.wandb. com/. Software available from wandb.com. Xavier Bresson and Thomas Laurent. Residual gated graph convnets.arXiv preprint arXiv:1711.07553,

  4. [4]

    A Note on Over-Smoothing for Graph Neural Networks, June 2020

    Chen Cai and Yusu Wang. A note on over-smoothing for graph neural networks.arXiv preprint arXiv:2006.13318,

  5. [5]

    Message-passing state-space models: Improving graph learning with modern sequence modeling.arXiv preprint arXiv:2505.18728,

    Andrea Ceni, Alessio Gravina, Claudio Gallicchio, Davide Bacciu, Carola-Bibiane Schonlieb, and Moshe Eliasof. Message-passing state-space models: Improving graph learning with modern sequence modeling.arXiv preprint arXiv:2505.18728,

  6. [6]

    Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li

    doi: 10.1109/TKDE.2022.3208063. Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. Simple and deep graph convolutional networks. InInternational conference on machine learning, pp. 1725–1735. PMLR,

  7. [7]

    Adaptive universal generalized pagerank graph neural network

    11 Eli Chien, Jianhao Peng, Pan Li, and Olgica Milenkovic. Adaptive universal generalized pagerank graph neural network.arXiv preprint arXiv:2006.07988,

  8. [8]

    Edge Contraction Pooling for Graph Neural Networks

    Frederik Diehl. Edge contraction pooling for graph neural networks.arXiv preprint arXiv:1905.10990,

  9. [9]

    Graph u-nets

    Hongyang Gao and Shuiwang Ji. Graph u-nets. Ininternational conference on machine learning, pp. 2083–2092. PMLR,

  10. [10]

    Predict then propa- gate: Graph neural networks meet personalized pagerank

    Johannes Gasteiger, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate: Graph neural networks meet personalized pagerank.arXiv preprint arXiv:1810.05997,

  11. [11]

    Semi-Supervised Classification with Graph Convolutional Networks

    Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907,

  12. [12]

    Graph pooling for graph neural networks: Progress, challenges, and opportunities.arXiv preprint arXiv:2204.07321,

    URL https: //openreview.net/forum?id=Q-UHqMorzil. Chuang Liu, Yibing Zhan, Chang Li, Bo Du, Jia Wu, Wenbin Hu, Tongliang Liu, and Dacheng Tao. Graph pooling for graph neural networks: Progress, challenges, and opportunities.arXiv preprint arXiv:2204.07321,

  13. [13]

    ISBN 978- 1-4503-6201-6

    ACM. ISBN 978- 1-4503-6201-6. doi: 10.1145/3292500.3330982. URL http://doi.acm.org/10.1145/ 3292500.3330982. Haggai Maron, Heli Ben-Hamu, Hadar Serviansky, and Yaron Lipman. Provably powerful graph networks.Advances in neural information processing systems, 32,

  14. [14]

    Generalized laplacian positional encoding for graph representation learning

    Sohir Maskey, Ali Parviz, Maximilian Thiessen, Hannes Stärk, Ylli Sadikaj, and Haggai Maron. Generalized laplacian positional encoding for graph representation learning. InNeurIPS 2022 Workshop on Symmetry and Geometry in Neural Representations,

  15. [15]

    TUDataset: A collection of benchmark datasets for learning with graphs

    URL https: //openreview.net/forum?id=BNhhZwAlVNC. Christopher Morris, Nils M Kriege, Franka Bause, Kristian Kersting, Petra Mutzel, and Marion Neumann. Tudataset: A collection of benchmark datasets for learning with graphs.arXiv preprint arXiv:2007.08663,

  16. [16]

    Revisiting Graph Neural Networks: All We Have is Low-Pass Filters

    13 Hoang Nt and Takanori Maehara. Revisiting graph neural networks: All we have is low-pass filters. arXiv preprint arXiv:1905.09550,

  17. [17]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library

    A Paszke. Pytorch: An imperative style, high-performance deep learning library.arXiv preprint arXiv:1912.01703,

  18. [18]

    K., Bronstein, M

    T Konstantin Rusch, Michael M Bronstein, and Siddhartha Mishra. A survey on oversmoothing in graph neural networks.arXiv preprint arXiv:2303.10993,

  19. [19]

    Understanding over-squashing and bottlenecks on graphs via curvature.arXiv preprint arXiv:2111.14522,

    Jake Topping, Francesco Di Giovanni, Benjamin Paul Chamberlain, Xiaowen Dong, and Michael M Bronstein. Understanding over-squashing and bottlenecks on graphs via curvature.arXiv preprint arXiv:2111.14522,

  20. [20]

    Graph Attention Networks

    Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks.arXiv preprint arXiv:1710.10903,

  21. [21]

    Order Matters: Sequence to sequence for sets

    Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence for sets. arXiv preprint arXiv:1511.06391,

  22. [22]

    doi: 10.1109/tpami.2020.2999032

    ISSN 1939-3539. doi: 10.1109/tpami.2020.2999032. URL http://dx.doi.org/10.1109/TPAMI.2020. 2999032. Lanning Wei, Huan Zhao, Quanming Yao, and Zhiqiang He. Pooling architecture search for graph classification. InProceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 2091–2100,

  23. [23]

    Visualizing and understanding convolutional networks

    Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pp. 818–833. Springer,

  24. [24]

    A complete expressiveness hierarchy for subgraph gnns via subgraph weisfeiler-lehman tests

    Bohang Zhang, Guhao Feng, Yiheng Du, Di He, and Liwei Wang. A complete expressiveness hierarchy for subgraph gnns via subgraph weisfeiler-lehman tests. InInternational Conference on Machine Learning, pp. 41019–41077. PMLR, 2023a. Bohang Zhang, Shengjie Luo, Liwei Wang, and Di He. Rethinking the expressive power of gnns via graph biconnectivity.arXiv prepr...

  25. [25]

    15 A DATASETSTATISTICS Tables 7 and 8 summarize the statistics of the datasets used in our experiments. Table 7 covers molecular property prediction datasets from the Open Graph Benchmark (OGB), includingMOLHIV, MOLBBBP,MOLTOX21, andTOXCAST, reporting the number of graphs, number of prediction classes, and average number of nodes per graph. Table 8 presen...

  26. [26]

    All experiments were run on NVIDIA L40, NVIDIA A100 and GeForce RTX 4090 GPUs

    (offered under MIT license). All experiments were run on NVIDIA L40, NVIDIA A100 and GeForce RTX 4090 GPUs. For logging, hyperpa- rameter tuning, and model selection, we used the Weights and Biases (W&B) framework (Biewald, 2020). In the subsection below, we provide details on the hyperparameter configurations used across our experiments. 16 Algorithm 1HI...

  27. [27]

    These findings underscore the robustness of HISTOGRAPHas a drop-in replacement for readout functions across diverse settings

    shows that incorporating deeper historical context enhances predictive performance, with the best results obtained when more layers are retained. These findings underscore the robustness of HISTOGRAPHas a drop-in replacement for readout functions across diverse settings. Finally, we present an additional ablation in Table 16, which examines the effect of ...