pith. sign in

arxiv: 2605.16112 · v1 · pith:OJXTKG5Bnew · submitted 2026-05-15 · 💻 cs.LG · cs.AI

Attention Dispersion in Dynamic Graph Transformers: Diagnosis and a Transferable Fix

Pith reviewed 2026-05-20 19:59 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords continuous-time dynamic graphstransformersattention dispersiontemporal distribution shiftdifferential attentiongraph neural networks
0
0 comments X

The pith

Differential attention corrects attention dispersion in dynamic graph Transformers by focusing on critical nodes under temporal shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Transformers for continuous-time dynamic graphs fail to focus on critical nodes under temporal distribution shifts because these shifts weaken attention contrast and produce overly dispersed distributions. Through controlled ablations comparing distinguished historical neighbors to random ones, it shows that prediction depends on critical nodes carrying more predictive signal. Replacing standard attention with differential attention suppresses common-mode patterns and amplifies distinctive token signals, yielding consistent gains especially on high-shift datasets. Attention measurements confirm lower entropy and higher mass on critical nodes. The resulting DiffDyG model reaches state-of-the-art performance across nine benchmarks.

Core claim

Existing dynamic graph Transformers suffer from attention dispersion under temporal distribution shift, as the reduced contrast causes attention to spread evenly instead of concentrating on critical historical neighbors that carry stronger predictive signals than arbitrary nodes; differential attention addresses this by subtracting common-mode components to highlight token-specific differences, which reduces entropy and directs more attention mass to those critical nodes, improving accuracy in a transferable way.

What carries the argument

differential attention, which suppresses common-mode attention and amplifies distinctive token-level signals

If this is right

  • Adding differential attention to existing CTDG Transformer baselines yields consistent performance gains.
  • The gains concentrate on datasets with high temporal distribution shift.
  • Attention entropy decreases and attention mass on critical nodes increases.
  • The combined DiffDyG model achieves state-of-the-art results on nine benchmarks under multiple negative sampling protocols.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The attention dispersion diagnosis may apply to other transformer models that process sequences with temporal or distributional changes.
  • Differential attention could be tested as a lightweight addition in non-graph sequence tasks facing similar focus problems.

Load-bearing premise

The controlled ablation correctly isolates that prediction depends on a distinct class of critical nodes with more predictive signal than random historical neighbors.

What would settle it

If applying differential attention on high-shift datasets produces no measurable drop in attention entropy and no rise in attention mass allocated to the identified critical nodes, the diagnosis and mechanism would be falsified.

Figures

Figures reproduced from arXiv: 2605.16112 by Jinhao Zhang, Kangfei Zhao, Long-Kai Huang, Qiuhao Zeng.

Figure 1
Figure 1. Figure 1: Performance on 9 datasets with varying MMD. Legend reports Pearson’s R between AP and MMD. The consistency across architectures is important. DyGFormer, TIDFormer, and TCL construct historical sequences in different ways, yet they degrade on the same high-shift datasets. In particular, the best existing Transformer reaches only 71.1% AP on US Legis., 66.5% on UN Trade, and 69.0% on UN Vote. This pattern su… view at source ↗
Figure 2
Figure 2. Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Critical-node ablation. The x-axis shows the retention ratio of critical nodes: 100% keeps all critical nodes, while 0% masks all critical nodes. Non-critical nodes are kept unchanged. 100% 90% 50% 0% Wikipedia Reddit UCI Enron MOOC Can. Parl. US Legis. UN Trade UN Vote 98.81 95 91.25 85.2 98.82 95.87 93.36 89.11 95.51 93.15 76.54 69.14 91.33 83.08 79.22 63.25 86.48 82.12 66.57 58.46 97.36 88.24 81.23 72.1… view at source ↗
Figure 4
Figure 4. Figure 4: Random-node ablation. For each retention ratio, we mask the same number of randomly selected historical nodes as in the corresponding critical-node ablation. Thus, 0% masks a random set with the same cardinality as the full critical-node set, rather than masking all historical nodes. Therefore, this control asks whether the structurally and temporally defined nodes in K(u, v, t) are more informative than a… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of models’ AP, parameter size (MB), and training time (seconds per epoch). [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
read the original abstract

Transformer-based architectures have become the dominant paradigm for Continuous-Time Dynamic Graph (CTDG) learning, yet their performance remains limited on temporally shifted datasets. In this work, we identify attention dispersion as a shared failure mode of dynamic graph Transformers under temporal distribution shift. Through controlled ablation contrasting structurally and temporally distinguished historical neighbors against random ones, we show that prediction depends on a class of critical nodes that carry consistently more predictive signal than arbitrary neighbors. However, existing Transformers fail to focus on these nodes even when they are present in the input, as temporal shift weakens attention contrast and produces overly dispersed attention distributions. This diagnosis suggests a simple and transferable fix: replace standard attention with differential attention, which suppresses common-mode attention and amplifies distinctive token-level signals. When added to three representative CTDG Transformer baselines, differential attention consistently improves performance, with gains concentrated on high-shift datasets. Attention-level measurements further confirm the mechanism, showing reduced attention entropy and increased attention mass on critical nodes. Building on these findings, we introduce DiffDyG, a reference implementation combining differential attention with standard input encodings. Across 9 benchmarks and three negative sampling protocols, DiffDyG achieves SOTA performance, with especially large gains on the most shifted datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper diagnoses attention dispersion as a failure mode in Transformer-based models for Continuous-Time Dynamic Graphs (CTDG) under temporal distribution shift. Through ablations, it claims that prediction relies on a class of critical nodes carrying more predictive signal than arbitrary neighbors, but existing models produce overly dispersed attention distributions that fail to focus on these nodes even when present. It proposes differential attention to suppress common-mode signals and amplify distinctive token-level information, reports consistent gains when added to three baselines (especially on high-shift data), provides attention measurements confirming reduced entropy and increased mass on critical nodes, and introduces DiffDyG which achieves SOTA across 9 benchmarks and three negative sampling protocols.

Significance. If the central diagnosis and mechanism hold after addressing controls, the work would offer a practical, transferable architectural adjustment for improving robustness of dynamic graph Transformers to temporal shifts, with potential value in domains such as link prediction on evolving networks. The combination of ablation-based diagnosis, attention-level diagnostics, and empirical gains on shifted datasets provides a coherent empirical narrative; the reference implementation DiffDyG further aids reproducibility.

major comments (1)
  1. Ablation study (described in the abstract and §4): the contrast between structurally/temporally distinguished historical neighbors and random ones does not report matching the random baseline on node degree, recency, or embedding similarity. Without such controls, observed performance gaps could be explained by these established confounders rather than the existence of a privileged critical-node class, weakening the empirical support for the claim that existing Transformers fail to focus on critical nodes even when present.
minor comments (2)
  1. Experimental results would benefit from reporting error bars, exact statistical tests, and a more complete experimental protocol to allow readers to assess the reliability of the reported consistent gains.
  2. Notation for differential attention could be clarified with an explicit equation or pseudocode block early in the methods section to make the mechanism easier to implement from the text alone.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our diagnosis of attention dispersion in CTDG Transformers. We address the single major comment below and will revise the manuscript to incorporate additional controls in the ablation study.

read point-by-point responses
  1. Referee: Ablation study (described in the abstract and §4): the contrast between structurally/temporally distinguished historical neighbors and random ones does not report matching the random baseline on node degree, recency, or embedding similarity. Without such controls, observed performance gaps could be explained by these established confounders rather than the existence of a privileged critical-node class, weakening the empirical support for the claim that existing Transformers fail to focus on critical nodes even when present.

    Authors: We agree that matching the random baseline on node degree, recency, and embedding similarity would provide a stronger control and help rule out these potential confounders. In the revised manuscript we will add new ablation experiments in §4 that sample random neighbors to match the empirical distributions of these three properties within each critical-node set. This will isolate the contribution of structural and temporal distinction more cleanly. We expect the performance gap to persist under these matched conditions, but we will report the results transparently regardless of outcome. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical diagnosis and architectural modification remain independent of inputs

full rationale

The paper's chain consists of an empirical observation of attention dispersion under temporal shift, a controlled ablation showing predictive value of distinguished historical neighbors, and the introduction of differential attention as a fix. No equations or derivations reduce a claimed prediction to a fitted parameter or self-citation by construction. The ablation and benchmark results are presented as falsifiable measurements rather than tautological outputs. Self-citations, if present, are not load-bearing for the central diagnosis. This matches the default case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the empirical validity of the ablation experiments and the assumption that observed attention dispersion is the primary driver of performance loss under temporal shift.

axioms (1)
  • domain assumption Critical nodes carry consistently more predictive signal than arbitrary neighbors
    Established via the controlled ablation contrasting distinguished historical neighbors against random ones.
invented entities (1)
  • differential attention no independent evidence
    purpose: Suppress common-mode attention and amplify distinctive token-level signals
    Introduced as a drop-in replacement for standard attention to address dispersion.

pith-pipeline@v0.9.0 · 5750 in / 1270 out tokens · 46520 ms · 2026-05-20T19:59:02.213535+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 1 internal anchor

  1. [1]

    Ainslie, J

    J. Ainslie, J. Lee-Thorp, M. de Jong, Y . Zemlyanskiy, F. Lebrón, and S. Sanghai. GQA: training generalized multi-query transformer models from multi-head checkpoints. InProc. EMNLP, pages 4895–4901. Association for Computational Linguistics, 2023

  2. [2]

    Cheng, L

    K. Cheng, L. Peng, J. Ye, L. Sun, and B. Du. Co-neighbor encoding schema: A light-cost structure encoding method for dynamic link prediction. InProc. KDD, pages 421–432. ACM, 2024

  3. [3]

    W. Cong, Y . Wu, Y . Tian, M. Gu, Y . Xia, M. Mahdavi, and C. J. Chen. Dynamic graph representation learning via graph transformer networks.CoRR, abs/2111.10447, 2021

  4. [4]

    W. Cong, S. Zhang, J. Kang, B. Yuan, H. Wu, X. Zhou, H. Tong, and M. Mahdavi. Do we really need complicated model architectures for temporal networks? InProc. ICLR. OpenReview.net, 2023

  5. [5]

    T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. InProc. NeurIPS, 2022

  6. [6]

    Gretton, K

    A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test.The journal of machine learning research, 13(1):723–773, 2012

  7. [7]

    Z. Han, J. Jiang, Y . Wang, Y . Ma, and V . Tresp. The graph hawkes network for reasoning on temporal knowledge graphs. InLearning with Temporal Point Processes Workshop at the 33rd Conference on Neural Information Processing Systems (TPP@ NeurIPS 2019)), 2019

  8. [8]

    W. Hu, Y . Yang, Z. Cheng, C. Yang, and X. Ren. Time-series event prediction with evolutionary state graph. InProc. WSDM, pages 580–588. ACM, 2021

  9. [9]

    Huang, Y

    S. Huang, Y . Hitti, G. Rabusseau, and R. Rabbany. Laplacian change point detection for dynamic graphs. InProc. KDD, pages 349–358. ACM, 2020

  10. [10]

    Huang, Y

    X. Huang, Y . Yang, Y . Wang, C. Wang, Z. Zhang, J. Xu, L. Chen, and M. Vazirgiannis. Dgraph: A large-scale financial dataset for graph anomaly detection. InAdvanced in NeurIPS, 2022

  11. [11]

    Kumar, X

    S. Kumar, X. Zhang, and J. Leskovec. Predicting dynamic embedding trajectory in temporal interaction networks. InProc. KDD, pages 1269–1278. ACM, 2019

  12. [12]

    Y . Li, Y . Shen, L. Chen, and M. Yuan. Zebra: When temporal graph neural networks meet temporal personalized pagerank.Proc. VLDB Endow., 16(6):1332–1345, 2023. 10

  13. [13]

    X. Lu, L. Sun, T. Zhu, and W. Lv. Improving temporal link prediction via temporal walk matrix projection. InAdvanced in NeurIPS, 2024

  14. [14]

    Pareja, G

    A. Pareja, G. Domeniconi, J. Chen, T. Ma, T. Suzumura, H. Kanezashi, T. Kaler, T. B. Schardl, and C. E. Leiserson. Evolvegcn: Evolving graph convolutional networks for dynamic graphs. InProc. AAAI, pages 5363–5370. AAAI Press, 2020

  15. [15]

    J. Peng, Z. Wei, and Y . Ye. Tidformer: Exploiting temporal and interactive dynamics makes A great dynamic graph transformer. InProc. KDD, pages 2245–2256. ACM, 2025

  16. [16]

    Poursafaei, S

    F. Poursafaei, S. Huang, K. Pelrine, and R. Rabbany. Towards better evaluation for dynamic link prediction. InAdvanced in NeurIPS, 2022

  17. [17]

    Temporal Graph Networks for Deep Learning on Dynamic Graphs

    E. Rossi, B. Chamberlain, F. Frasca, D. Eynard, F. Monti, and M. M. Bronstein. Temporal graph networks for deep learning on dynamic graphs.CoRR, abs/2006.10637, 2020

  18. [18]

    Sankar, Y

    A. Sankar, Y . Wu, L. Gou, W. Zhang, and H. Yang. Dysat: Deep neural representation learning on dynamic graphs via self-attention networks. InProc. WSDM, pages 519–527. ACM, 2020

  19. [19]

    A. H. Souza, D. Mesquita, S. Kaski, and V . Garg. Provably expressive temporal graph networks. InAdvanced in NeurIPS, 2022

  20. [20]

    J. Su, M. H. M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  21. [21]

    J. Su, D. Zou, and C. Wu. PRES: toward scalable memory-based dynamic graph neural networks. InICLR. OpenReview.net, 2024

  22. [22]

    Trivedi, H

    R. Trivedi, H. Dai, Y . Wang, and L. Song. Know-evolve: Deep temporal reasoning for dynamic knowledge graphs. InProc. ICML, volume 70, pages 3462–3471. PMLR, 2017

  23. [23]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. InAdvanced in NIPS, pages 5998–6008, 2017

  24. [24]

    L. Wang, X. Chang, S. Li, Y . Chu, H. Li, W. Zhang, X. He, L. Song, J. Zhou, and H. Yang. TCL: transformer-based dynamic graph modelling via contrastive learning.CoRR, abs/2105.07944, 2021

  25. [25]

    X. Wang, D. Lyu, M. Li, Y . Xia, Q. Yang, X. Wang, X. Wang, P. Cui, Y . Yang, B. Sun, and Z. Guo. APAN: asynchronous propagation attention network for real-time temporal graph embedding. InProc. SIGMOD, pages 2628–2638. ACM, 2021

  26. [26]

    Y . Wang, Y . Chang, Y . Liu, J. Leskovec, and P. Li. Inductive representation learning in temporal networks via causal anonymous walks. InICLR. OpenReview.net, 2021

  27. [27]

    Y . Wu, Y . Fang, and L. Liao. On the feasibility of simple transformer for dynamic graph modeling. InProc. Web Conference, pages 870–880. ACM, 2024

  28. [28]

    D. Xu, C. Ruan, E. Körpeoglu, S. Kumar, and K. Achan. Inductive representation learning on temporal graphs. InICLR. OpenReview.net, 2020

  29. [29]

    T. Ye, L. Dong, Y . Xia, Y . Sun, Y . Zhu, G. Huang, and F. Wei. Differential transformer. InProc. ICLR. OpenReview.net, 2025

  30. [30]

    J. You, T. Du, and J. Leskovec. ROLAND: graph learning framework for dynamic graphs. In Proc. KDD, pages 2358–2366. ACM, 2022

  31. [31]

    L. Yu, L. Sun, B. Du, and W. Lv. Towards better dynamic graph learning: New architecture and unified library. InAdvanced in NeurIPS, 2023

  32. [32]

    Z. Zhao, X. Zhu, T. Xu, A. Lizhiyu, Y . Yu, X. Li, Z. Yin, and E. Chen. Time-interval aware share recommendation via bi-directional continuous time dynamic graphs. InProc. SIGIR, pages 822–831. ACM, 2023

  33. [33]

    yes” on the same bill. The edge weight equals the number of such shared “yes

    T. Zou, Y . Mao, J. Ye, and B. Du. Repeat-aware neighbor sampling for dynamic graph learning. InProc. KDD, pages 4722–4733. ACM, 2024. 11 A Additional experimental details A.1 Details of datasets Table 8:Summary of dynamic graph datasets Datasets Domains #Nodes #Links #Node & Link Feat. Bipartite Duration Time Granularity # Steps Wikipedia Social 9,227 15...