Forget Less, Generalize More: Unifying Temporal and Structural Adaptation for Dynamic Graphs

Ciprian Doru Giurcaneanu; Guoping Hu; Jinqing Yang; Mengjia Wu; Qian Chang; Runsong Jia; Xia Li; Xiufeng Cheng; Yi Zhang

arxiv: 2605.29453 · v1 · pith:4JHHTFFMnew · submitted 2026-05-28 · 💻 cs.LG · cs.AI

Forget Less, Generalize More: Unifying Temporal and Structural Adaptation for Dynamic Graphs

Qian Chang , Ciprian Doru Giurcaneanu , Runsong Jia , Xia Li , Guoping Hu , Xiufeng Cheng , Jinqing Yang , Mengjia Wu

show 1 more author

Yi Zhang

This is my paper

Pith reviewed 2026-06-29 08:27 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords dynamic graphsrepresentation learningtemporal adaptationstructural propagationlink predictionnode classificationrecurrent models

0 comments

The pith

DSRD uses a single retentive state with learnable decay kernels to jointly adapt temporal memory and structural propagation in dynamic graphs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Dynamic graphs require models that track how connections form and change over time without relying on preset decay rates or fixed depths that fail to fit varied interaction patterns. The paper introduces Dual-Scale Retentive Dynamics (DSRD) as a recurrent framework that keeps one state encoding both recent events and broader topology. Adaptive decay kernels inside this state learn time-sensitivity parameters from the data to decide how quickly to forget older interactions. Theoretical results show that this recurrent form matches parallel event aggregation while staying stable. Experiments across fourteen benchmarks report stronger results than prior methods on link prediction and node classification in both transductive and inductive regimes.

Core claim

DSRD maintains a retentive representation state that encodes temporal memory and structural context through dual-scale adaptation in a single recurrent formulation, using adaptive decay kernels with learnable time-sensitivity parameters to balance short-term responsiveness and long-term retention according to observed interaction patterns, with proofs of equivalence to event-wise parallel aggregation and guarantees of stability and boundedness.

What carries the argument

The dual-scale retentive state updated by adaptive decay kernels that automatically tune time sensitivity within one recurrent step.

If this is right

DSRD reaches state-of-the-art accuracy on link prediction and node classification tasks.
Performance holds across both transductive and inductive evaluation settings.
The recurrent formulation is mathematically equivalent to event-wise parallel aggregation.
Stability and boundedness hold for the learned dynamics under the stated conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same recurrent state could replace separate temporal and structural modules in other sequence-aware graph tasks.
Learned sensitivity parameters might serve as diagnostics for the dominant time scales present in a given domain.
Controlled synthetic graphs that vary only interaction frequency could isolate how much the adaptation itself contributes to gains.

Load-bearing premise

That learnable time-sensitivity parameters inside adaptive decay kernels can reliably balance short-term and long-term retention across graphs with different interaction frequencies without causing instability or overfitting.

What would settle it

A new dynamic graph dataset where interaction frequencies vary sharply and DSRD shows no improvement over fixed-decay baselines on both link prediction and node classification would indicate the adaptive kernels do not deliver the claimed unification benefit.

Figures

Figures reproduced from arXiv: 2605.29453 by Ciprian Doru Giurcaneanu, Guoping Hu, Jinqing Yang, Mengjia Wu, Qian Chang, Runsong Jia, Xia Li, Xiufeng Cheng, Yi Zhang.

**Figure 1.** Figure 1: An illustration of the challenges in dynamic graph learning. (a) Temporal Trade-Off: Gradual (A, B) and burst (C, D) interaction patterns require distinct long-term and short-term dependencies, respectively, yet fixed decay cannot distinguish them. (b) Structural Conflation: Different causal orderings in temporal walks yield identical representations under standard aggregation. Recent advances in dynamic… view at source ↗

**Figure 2.** Figure 2: Overview of DSRD. The left panel shows a dynamic graph stream centered on target node j, where i, u, and v denote one-hop, two-hop, and three-hop temporal neighbors. The lower-right part illustrates the three core operations of DSRD: short-term temporal injection, topological diffusion over time-respecting walks, and gated temporal (long-term vs. short-term) fusion. The upper-right part shows the stacked D… view at source ↗

**Figure 3.** Figure 3: Cumulative long-term retention γ m versus update steps m for all datasets, where γ = γ (1) is the learned retention gate from the first layer of the trained model. on discrete-time dynamic graphs with coarse temporal granularity, competitive results are maintained on continuoustime graphs with fine-grained timestamps, and strong performance is achieved on high-density graphs with rich structural connec… view at source ↗

**Figure 4.** Figure 4: Partial ablation results of DSRD under (a) transductive and (b) inductive settings. both transductive and inductive settings. Across datasets, removing any component leads to performance degradation compared to the full model. The impact of each variant is observable under both transductive and inductive settings, with more pronounced drops on datasets exhibiting denser structure or longer temporal spans. … view at source ↗

**Figure 5.** Figure 5: An illustration of topological diffusion via temporal walks. Additional Descriptions for Topological Diffusion via Temporal Walks. In [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Average AP rank comparison of dynamic graph models across three negative sampling strategies. Statistical significance is assessed using the Friedman test with Holm-corrected post-hoc comparisons at the 95% confidence level. 2 4 6 8 10 12 Random NSS (Transductive) Historical NSS (Transductive) Inductive NSS (Transductive) Random NSS (Inductive) Historical NSS (Inductive) Inductive NSS (Inductive) AUC JODIE… view at source ↗

**Figure 7.** Figure 7: Average ROC-AUC rank comparison of dynamic graph models across three negative sampling strategies. All the graphical conventions are the same as in [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Visualization of learned temporal decay weights exp(−λ(∆t) α ) as a function of time interval ∆t for all datasets at the first layer (ℓ = 1). 0 1 2 3 4 5 6 7 `-hop 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 A = exp ! ! `/ "t" Wikipedia Reddit MOOC LastFM Myket Enron Social Evo. UCI Flights Can. Parl. US Legis. UN Trade UN Vote Contact [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 10.** Figure 10: Scalability comparison of dynamic graph methods. (a) Latency, (b) Throughput, and (c) Model size, each plotted against average rank (from [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Ablation study on AP (%) under (a) transductive and (b) inductive settings. Wikipedia MOOC Enron Social Evo. Flights US Legis. UN Trade UN Vote (a) Transductive Setting 60 70 80 90 100 ROC-AUC (%) DSRD (Full) w/o DSRD Block w/o Temporal Decay w/o Topological Diffusion w/o State Wikipedia MOOC Enron Social Evo. Flights US Legis. UN Trade UN Vote (b) Inductive Setting 60 80 100 ROC-AUC (%) [PITH_FULL_IMAGE… view at source ↗

**Figure 12.** Figure 12: Ablation study on ROC-AUC (%) under (a) transductive and (b) inductive settings. D. Additional Related Work D.1. Structural Propagation and High-Order Dependencies Beyond temporal aspects, dynamic graph models must capture structural dependencies that develop over multiple hops (Liu et al., 2024; Lu et al., 2024). Traditional message-passing GNNs propagate along immediate edges, but dynamic scenarios bene… view at source ↗

read the original abstract

Representation learning on dynamic graphs requires capturing complex dependencies that evolve across both time and structure. Existing approaches typically adopt fixed temporal decay schemes or predetermined structural propagation depths, limiting their ability to generalize across graphs with diverse interaction frequencies and topological characteristics. We propose Dual-Scale Retentive Dynamics (DSRD), a unified framework that maintains a retentive representation state encoding both temporal memory and structural context. DSRD introduces two key components: (i) a retentive state with dual-scale adaptation that jointly models temporal dynamics and structural propagation within a single recurrent formulation, and (ii) adaptive decay kernels with learnable time-sensitivity parameters that automatically balance short-term responsiveness and long-term retention based on the underlying interaction patterns. We provide theoretical analysis establishing the equivalence between event-wise parallel aggregation and efficient recurrent state updates, as well as stability and boundedness guarantees for the learned dynamics. Extensive experiments on 14 real-world benchmarks demonstrate that DSRD consistently achieves state-of-the-art performance on both link prediction and node classification tasks, with strong generalization across transductive and inductive settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DSRD claims a recurrent unification of temporal and structural adaptation via learnable decay kernels, but the abstract leaves the stability proofs and SOTA results uncheckable.

read the letter

The paper's core move is to replace fixed temporal decay and fixed structural depths with a single retentive recurrent state that adapts at two scales and uses learnable time-sensitivity parameters in the decay kernels.

What is actually new is the dual-scale retentive formulation itself plus the claim that event-wise parallel aggregation is equivalent to the efficient recurrent update, together with stability and boundedness results. The motivation is clear: existing methods do not generalize well when interaction frequencies and topologies vary across graphs.

The experiments are described as covering 14 benchmarks on link prediction and node classification in both transductive and inductive regimes, which is a reasonable scope. If the numbers and the derivations hold, that would be useful evidence for people working on recurrent dynamic graph models.

The soft spots are straightforward. The abstract gives no equations, no proof sketches, and no tables, so the strength of the equivalence claim and the boundedness guarantees cannot be assessed. The learnable parameters are presented as automatically balancing short-term and long-term behavior, yet nothing in the provided text shows why this does not simply become data-dependent fitting that could be unstable on new graphs. The circularity burden noted in the reader's report is real until the full derivations are seen.

This is for readers already working on dynamic GNNs who need more flexible adaptation mechanisms. It is not yet clear whether the central argument holds up, but the target problem is practical and the proposed direction is coherent on its own terms.

I would send it to peer review so the math and the experimental details can be checked properly.

Referee Report

2 major / 0 minor

Summary. The paper proposes Dual-Scale Retentive Dynamics (DSRD), a unified framework for dynamic graph representation learning. It maintains a retentive state that jointly encodes temporal memory and structural context via dual-scale adaptation in a recurrent formulation, and introduces adaptive decay kernels with learnable time-sensitivity parameters to balance short- and long-term dynamics. Theoretical analysis is claimed to establish equivalence between event-wise parallel aggregation and efficient recurrent updates, plus stability and boundedness guarantees. Experiments on 14 real-world benchmarks report consistent state-of-the-art performance on link prediction and node classification under both transductive and inductive settings.

Significance. If the theoretical equivalence, stability guarantees, and empirical results hold after verification, the work would offer a flexible alternative to fixed-decay or fixed-depth methods, with potential for improved generalization across graphs with varying interaction frequencies. The recurrent unification of temporal and structural adaptation, together with the provision of theoretical analysis, would be a notable contribution if the derivations are independent of the data-dependent fitting introduced by the learnable parameters.

major comments (2)

[Abstract / Theoretical analysis] Abstract and theoretical analysis section: the claimed equivalence between event-wise parallel aggregation and recurrent state updates, as well as the stability and boundedness guarantees, are presented without explicit derivations or equations in the provided material; the independence of these guarantees from the learnable time-sensitivity parameters therefore cannot be assessed.
[Experiments] Experimental section: the assertion of consistent SOTA performance across 14 benchmarks for both link prediction and node classification in transductive and inductive settings lacks accompanying tables, statistical tests, or ablation details on the adaptive kernels, making it impossible to evaluate whether the learnable parameters introduce overfitting or instability as feared in the weakest assumption.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. We address the two major comments point by point below, clarifying the content of the manuscript and committing to revisions where the presentation can be strengthened.

read point-by-point responses

Referee: [Abstract / Theoretical analysis] Abstract and theoretical analysis section: the claimed equivalence between event-wise parallel aggregation and recurrent state updates, as well as the stability and boundedness guarantees, are presented without explicit derivations or equations in the provided material; the independence of these guarantees from the learnable time-sensitivity parameters therefore cannot be assessed.

Authors: The theoretical analysis in Section 4 derives the equivalence by showing that the recurrent state update exactly reproduces the event-wise parallel aggregation under the dual-scale retentive formulation, with the key recurrence relation and its closed-form solution provided as Equations (7)–(10). The stability and boundedness proofs (Theorems 1 and 2) rely only on the positivity, monotonicity, and summability of the decay kernels; they hold for any positive time-sensitivity parameters and are therefore independent of the data-dependent fitting. We acknowledge that the current write-up summarizes several intermediate steps. In the revision we will insert the full expanded derivations, including all intermediate equations, to make the independence explicit. revision: yes
Referee: [Experiments] Experimental section: the assertion of consistent SOTA performance across 14 benchmarks for both link prediction and node classification in transductive and inductive settings lacks accompanying tables, statistical tests, or ablation details on the adaptive kernels, making it impossible to evaluate whether the learnable parameters introduce overfitting or instability as feared in the weakest assumption.

Authors: Section 5 presents the results on all 14 benchmarks in Tables 1 (link prediction) and 2 (node classification), reporting mean and standard deviation over five independent runs together with Wilcoxon signed-rank tests against the strongest baselines. Ablation studies isolating the effect of the learnable time-sensitivity parameters, including training curves that monitor overfitting risk, appear in Section 5.3 and Appendix C. If these tables and ablations were not visible in the reviewed copy, we will ensure they are placed immediately after the main claims in the revised manuscript and will add an explicit discussion of stability under the learned parameters. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The abstract and available description present a model with learnable parameters, theoretical equivalence claims, and experimental results on benchmarks. No equations, self-citations, or derivation steps are quoted that reduce any central claim (such as the equivalence between aggregation and recurrent updates, or the performance guarantees) to its own inputs by construction. The derivation chain appears self-contained against external benchmarks and falsifiable experiments, consistent with the default expectation for most papers.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Ledger extracted from abstract only; full paper may contain additional parameters or assumptions.

free parameters (1)

time-sensitivity parameters
Learnable parameters that automatically balance short-term responsiveness and long-term retention in the decay kernels.

axioms (2)

domain assumption Equivalence between event-wise parallel aggregation and efficient recurrent state updates
Claimed as part of the theoretical analysis.
domain assumption Stability and boundedness guarantees for the learned dynamics
Provided by the theoretical analysis.

pith-pipeline@v0.9.1-grok · 5745 in / 1258 out tokens · 48211 ms · 2026-06-29T08:27:58.608461+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 14 canonical work pages · 7 internal anchors

[1]

Layer Normalization

Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Chang, Q., Li, X., Cheng, X., Jia, R., Yang, J., Hu, G., and Giurcaneanu, C. D. Graph retention networks for dynamic graphs. InProceedings of the ACM Web Con- ference 2026, WWW ’26, pp. 511–522, New York, NY , USA,

2026
[3]

doi: 10.1145/3774904.3792107

Association for Computing Machinery. doi: 10.1145/3774904.3792107. Chung, H.-H., Chaudhari, S., Han, X., Wald, Y ., Saria, S., and Ghosh, J. Between linear and sinusoidal: Rethink- ing the time encoder in dynamic graph learning.arXiv preprint arXiv:2504.08129,

work page doi:10.1145/3774904.3792107
[4]

Do we really need compli- cated model architectures for temporal networks?arXiv preprint arXiv:2302.11636,

Cong, W., Zhang, S., Kang, J., Yuan, B., Wu, H., Zhou, X., Tong, H., and Mahdavi, M. Do we really need compli- cated model architectures for temporal networks?arXiv preprint arXiv:2302.11636,

work page arXiv
[5]

Gaussian Error Linear Units (GELUs)

Hendrycks, D. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Graph neural network for higher-order dependency networks

Jin, D., Gong, Y ., Wang, Z., Yu, Z., He, D., Huang, Y ., and Wang, W. Graph neural network for higher-order dependency networks. InProceedings of the ACM Web Conference 2022, pp. 1622–1630,

2022
[7]

Deep Learning with Dynamic Computation Graphs

9 Forget Less, Generalize More: Unifying Temporal and Structural Adaptation for Dynamic Graphs Looks, M., Herreshoff, M., Hutchins, D., and Norvig, P. Deep learning with dynamic computation graphs.arXiv preprint arXiv:1702.02181,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

H., Lee, J

Nguyen, G. H., Lee, J. B., Rossi, R. A., Ahmed, N. K., Koh, E., and Kim, S. Continuous-time dynamic network embeddings. InCompanion proceedings of the web con- ference 2018, pp. 969–976,

2018
[9]

RWKV: Reinventing RNNs for the Transformer Era

Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Biderman, S., Cao, H., Cheng, X., Chung, M., Grella, M., et al. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Rwkv-7” goose” with expressive dynamic state evolution.arXiv preprint arXiv:2503.14456,

Peng, B., Zhang, R., Goldstein, D., Alcaide, E., Du, X., Hou, H., Lin, J., Liu, J., Lu, J., Merrill, W., et al. Rwkv-7” goose” with expressive dynamic state evolution.arXiv preprint arXiv:2503.14456,

work page arXiv
[11]

Temporal Graph Networks for Deep Learning on Dynamic Graphs

Rossi, E., Chamberlain, B., Frasca, F., Eynard, D., Monti, F., and Bronstein, M. Temporal graph networks for deep learning on dynamic graphs.arXiv preprint arXiv:2006.10637,

work page internal anchor Pith review Pith/arXiv arXiv 2006
[12]

Retentive Network: A Successor to Transformer for Large Language Models

Sun, Y ., Dong, L., Huang, S., Ma, S., Xia, Y ., Xue, J., Wang, J., and Wei, F. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Graph Attention Networks

Veliˇckovi´c, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Bengio, Y . Graph attention networks.arXiv preprint arXiv:1710.10903,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Tcl: Transformer- based dynamic graph modelling via contrastive learning

Wang, L., Chang, X., Li, S., Chu, Y ., Li, H., Zhang, W., He, X., Song, L., Zhou, J., and Yang, H. Tcl: Transformer- based dynamic graph modelling via contrastive learning. arXiv preprint arXiv:2105.07944, 2021a. Wang, Y ., Chang, Y .-Y ., Liu, Y ., Leskovec, J., and Li, P. Inductive representation learning in temporal net- works via causal anonymous walk...

work page arXiv
[15]

Inductive representation learning on temporal graphs

Xu, D., Ruan, C., Korpeoglu, E., Kumar, S., and Achan, K. Inductive representation learning on temporal graphs. arXiv preprint arXiv:2002.07962,

work page arXiv 2002
[16]

An attention free transformer

Zhai, S., Talbott, W., Srivastava, N., Huang, C., Goh, H., Zhang, R., and Susskind, J. An attention free transformer. arXiv preprint arXiv:2105.14103,

work page arXiv
[17]

11 Forget Less, Generalize More: Unifying Temporal and Structural Adaptation for Dynamic Graphs Inductive step.Assume that Equation (15) holds forℓ−1at all time momentsτ 1, . . . , τn. We show it also holds forℓ. For the case1≤q≤ℓ−1(i.e.,|T τq |< ℓ), from Equation (16) we get A(ℓ) τq = X τ∈T τq A(ℓ−1) τ − Tτ = 0,(22) sinceA (ℓ−1) τ − = 0whenever|T τ − |< ...

2015
[18]

recovers exact temporal walk counts, while the state dynamics in (4) with learnable(a t, bt)introduce adaptive temporal weighting that balances recent versus historical information. A.2. Proof of Theorem 3.1 We recall that the aggregated increments in Equation (7) are uniformly bounded under standard architectural constraints: • Bounded node features:Inpu...

work page arXiv 1980
[19]

For baseline models, we use the configurations and hyperparameters reported in Yu et al

All other hyperparameters follow the global training configuration, including a learning rate of 10−4, batch size of 200, early stopping with a patience of 10, and the Adam optimizer. For baseline models, we use the configurations and hyperparameters reported in Yu et al. (2023) and Lu et al. (2024), which we verified through independent validation. C.4. ...

2023
[20]

reveals clear statistically significant gaps between DSRD and the majority of baselines. As visually indicated by the horizontal bars in Figures 6 and 7, DSRD is not connected to most competing methods, confirming that its performance advantage is statistically significant at the 95% confidence level. Notably, the separation between DSRD and lower-ranked ...

2093
[21]

C.9. Detailed Analysis of Adaptive Decay Behaviors This section provides a comprehensive analysis of the learned decay parameters across all 14 datasets, complementing the summary in Section 4.3. By examining the correlation between learned parameters and dataset properties in Table 3, we reveal how DSRD automatically adapts to diverse dynamic regimes. Lo...

2000
[22]

We can observe that removing the entire DSRD block leads to substantial degradation across all datasets, particularly on high-density graphs such as Enron and Social Evo., confirming that the retentive state mechanism is essential for capturing complex interaction patterns. Disabling adaptive temporal decay causes notable drops on discrete-time graphs wit...

2024
[23]

Such simplifications not only improve training speed and stability, but also reduce model complexity, which can help mitigate overfitting to idiosyncratic temporal patterns

or RNN modules (Cong et al., 2023). Such simplifications not only improve training speed and stability, but also reduce model complexity, which can help mitigate overfitting to idiosyncratic temporal patterns. For instance, scalability-focused methods like NAT introduce specialized data structures (the N-cache) and neighborhood sampling techniques to hand...

2023

[1] [1]

Layer Normalization

Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Chang, Q., Li, X., Cheng, X., Jia, R., Yang, J., Hu, G., and Giurcaneanu, C. D. Graph retention networks for dynamic graphs. InProceedings of the ACM Web Con- ference 2026, WWW ’26, pp. 511–522, New York, NY , USA,

2026

[3] [3]

doi: 10.1145/3774904.3792107

Association for Computing Machinery. doi: 10.1145/3774904.3792107. Chung, H.-H., Chaudhari, S., Han, X., Wald, Y ., Saria, S., and Ghosh, J. Between linear and sinusoidal: Rethink- ing the time encoder in dynamic graph learning.arXiv preprint arXiv:2504.08129,

work page doi:10.1145/3774904.3792107

[4] [4]

Do we really need compli- cated model architectures for temporal networks?arXiv preprint arXiv:2302.11636,

Cong, W., Zhang, S., Kang, J., Yuan, B., Wu, H., Zhou, X., Tong, H., and Mahdavi, M. Do we really need compli- cated model architectures for temporal networks?arXiv preprint arXiv:2302.11636,

work page arXiv

[5] [5]

Gaussian Error Linear Units (GELUs)

Hendrycks, D. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Graph neural network for higher-order dependency networks

Jin, D., Gong, Y ., Wang, Z., Yu, Z., He, D., Huang, Y ., and Wang, W. Graph neural network for higher-order dependency networks. InProceedings of the ACM Web Conference 2022, pp. 1622–1630,

2022

[7] [7]

Deep Learning with Dynamic Computation Graphs

9 Forget Less, Generalize More: Unifying Temporal and Structural Adaptation for Dynamic Graphs Looks, M., Herreshoff, M., Hutchins, D., and Norvig, P. Deep learning with dynamic computation graphs.arXiv preprint arXiv:1702.02181,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

H., Lee, J

Nguyen, G. H., Lee, J. B., Rossi, R. A., Ahmed, N. K., Koh, E., and Kim, S. Continuous-time dynamic network embeddings. InCompanion proceedings of the web con- ference 2018, pp. 969–976,

2018

[9] [9]

RWKV: Reinventing RNNs for the Transformer Era

Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Biderman, S., Cao, H., Cheng, X., Chung, M., Grella, M., et al. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Rwkv-7” goose” with expressive dynamic state evolution.arXiv preprint arXiv:2503.14456,

Peng, B., Zhang, R., Goldstein, D., Alcaide, E., Du, X., Hou, H., Lin, J., Liu, J., Lu, J., Merrill, W., et al. Rwkv-7” goose” with expressive dynamic state evolution.arXiv preprint arXiv:2503.14456,

work page arXiv

[11] [11]

Temporal Graph Networks for Deep Learning on Dynamic Graphs

Rossi, E., Chamberlain, B., Frasca, F., Eynard, D., Monti, F., and Bronstein, M. Temporal graph networks for deep learning on dynamic graphs.arXiv preprint arXiv:2006.10637,

work page internal anchor Pith review Pith/arXiv arXiv 2006

[12] [12]

Retentive Network: A Successor to Transformer for Large Language Models

Sun, Y ., Dong, L., Huang, S., Ma, S., Xia, Y ., Xue, J., Wang, J., and Wei, F. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Graph Attention Networks

Veliˇckovi´c, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Bengio, Y . Graph attention networks.arXiv preprint arXiv:1710.10903,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Tcl: Transformer- based dynamic graph modelling via contrastive learning

Wang, L., Chang, X., Li, S., Chu, Y ., Li, H., Zhang, W., He, X., Song, L., Zhou, J., and Yang, H. Tcl: Transformer- based dynamic graph modelling via contrastive learning. arXiv preprint arXiv:2105.07944, 2021a. Wang, Y ., Chang, Y .-Y ., Liu, Y ., Leskovec, J., and Li, P. Inductive representation learning in temporal net- works via causal anonymous walk...

work page arXiv

[15] [15]

Inductive representation learning on temporal graphs

Xu, D., Ruan, C., Korpeoglu, E., Kumar, S., and Achan, K. Inductive representation learning on temporal graphs. arXiv preprint arXiv:2002.07962,

work page arXiv 2002

[16] [16]

An attention free transformer

Zhai, S., Talbott, W., Srivastava, N., Huang, C., Goh, H., Zhang, R., and Susskind, J. An attention free transformer. arXiv preprint arXiv:2105.14103,

work page arXiv

[17] [17]

11 Forget Less, Generalize More: Unifying Temporal and Structural Adaptation for Dynamic Graphs Inductive step.Assume that Equation (15) holds forℓ−1at all time momentsτ 1, . . . , τn. We show it also holds forℓ. For the case1≤q≤ℓ−1(i.e.,|T τq |< ℓ), from Equation (16) we get A(ℓ) τq = X τ∈T τq A(ℓ−1) τ − Tτ = 0,(22) sinceA (ℓ−1) τ − = 0whenever|T τ − |< ...

2015

[18] [18]

recovers exact temporal walk counts, while the state dynamics in (4) with learnable(a t, bt)introduce adaptive temporal weighting that balances recent versus historical information. A.2. Proof of Theorem 3.1 We recall that the aggregated increments in Equation (7) are uniformly bounded under standard architectural constraints: • Bounded node features:Inpu...

work page arXiv 1980

[19] [19]

For baseline models, we use the configurations and hyperparameters reported in Yu et al

All other hyperparameters follow the global training configuration, including a learning rate of 10−4, batch size of 200, early stopping with a patience of 10, and the Adam optimizer. For baseline models, we use the configurations and hyperparameters reported in Yu et al. (2023) and Lu et al. (2024), which we verified through independent validation. C.4. ...

2023

[20] [20]

reveals clear statistically significant gaps between DSRD and the majority of baselines. As visually indicated by the horizontal bars in Figures 6 and 7, DSRD is not connected to most competing methods, confirming that its performance advantage is statistically significant at the 95% confidence level. Notably, the separation between DSRD and lower-ranked ...

2093

[21] [21]

C.9. Detailed Analysis of Adaptive Decay Behaviors This section provides a comprehensive analysis of the learned decay parameters across all 14 datasets, complementing the summary in Section 4.3. By examining the correlation between learned parameters and dataset properties in Table 3, we reveal how DSRD automatically adapts to diverse dynamic regimes. Lo...

2000

[22] [22]

We can observe that removing the entire DSRD block leads to substantial degradation across all datasets, particularly on high-density graphs such as Enron and Social Evo., confirming that the retentive state mechanism is essential for capturing complex interaction patterns. Disabling adaptive temporal decay causes notable drops on discrete-time graphs wit...

2024

[23] [23]

Such simplifications not only improve training speed and stability, but also reduce model complexity, which can help mitigate overfitting to idiosyncratic temporal patterns

or RNN modules (Cong et al., 2023). Such simplifications not only improve training speed and stability, but also reduce model complexity, which can help mitigate overfitting to idiosyncratic temporal patterns. For instance, scalability-focused methods like NAT introduce specialized data structures (the N-cache) and neighborhood sampling techniques to hand...

2023