arxiv: 2604.22781 · v1 · submitted 2026-04-03 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

BiTA: Bidirectional Gated Recurrent Unit-Transformer Aggregator in a Temporal Graph Network Framework for Alert Prediction in Computer Networks

Mohsen Rezvani, Zahra Makki Nayeri

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords alert predictiontemporal graph networksbidirectional GRUtransformer aggregatorcyber threat detectionTGN frameworknetwork intrusiondynamic graphs

0 comments

The pith

BiTA redesigns the temporal aggregation step inside TGNs by combining bidirectional GRU sequential encoding with Transformer long-range context over each node's history.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BiTA to fix a limitation in existing temporal graph networks for network alert prediction. Current TGN models use only unidirectional or single-mechanism aggregation, which misses the recursive, multi-scale timing patterns typical of real attacks. BiTA keeps the rest of the TGN memory and message-passing unchanged but replaces the aggregator with a joint bidirectional GRU-Transformer block that processes both forward-backward sequences and distant relations in the same temporal neighborhood. Experiments on real alert logs show higher AUC, average precision, mean reciprocal rank, and per-category accuracy in both transductive and inductive regimes.

Core claim

BiTA redesigns the temporal aggregation function within the TGN framework by jointly encoding bidirectional sequential dependencies and long-range contextual relations over each node's temporal neighborhood, enabling complementary temporal reasoning at different scales while preserving the original TGN memory and message-passing structure. On real-world alert datasets the method yields measurable gains in AUC, average precision, mean reciprocal rank, and per-category accuracy versus prior TGN variants under both transductive and inductive evaluation.

What carries the argument

Bidirectional Gated Recurrent Unit-Transformer Aggregator (BiTA), which processes each node's temporal neighborhood by running a bidirectional GRU for sequential order and a Transformer for distant relations in one step.

If this is right

The same TGN memory and message-passing backbone can support richer temporal reasoning without architectural overhaul.
Attack prediction improves on both seen and previously unseen nodes and edges.
The approach remains computationally scalable for real-time use because it reuses the original TGN structure.
Per-category accuracy rises, allowing more precise identification of distinct threat types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar bidirectional-plus-context aggregators could be swapped into other temporal graph tasks such as user behavior modeling or traffic forecasting.
The design suggests a general pattern for upgrading any memory-based temporal model by adding a parallel long-range attention path.
If the performance lift holds on larger, noisier logs it could reduce the need for deeper or wider networks in intrusion detection pipelines.

Load-bearing premise

Jointly encoding bidirectional sequences and long-range relations through GRU plus Transformer will capture the multi-scale recursive timing of attacks more reliably than unidirectional or single-mechanism aggregators.

What would settle it

An ablation study on the same alert datasets in which the bidirectional GRU or the Transformer component is removed and the resulting model shows no drop in AUC or average precision relative to the full BiTA.

Figures

Figures reproduced from arXiv: 2604.22781 by Mohsen Rezvani, Zahra Makki Nayeri.

**Figure 1.** Figure 1: Overview of the proposed BiTA module integrated into TGN. Unlike conventional TGNs where message aggregation [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 2.** Figure 2: Sample of a graph-based representation of alert data. [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗

**Figure 3.** Figure 3: The graph evolution in time. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 4.** Figure 4: Example of the BiTA framework for link prediction and category prediction [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of Attack Intervals. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Attack Accumulation Over Time. 5.5. Data Analysis and Motivating Case Study To validate that recursive patterns emphasized in the proposed model occur in real-world scenarios, we analyze two representative attack cases from the dataset. Distribution of Attack Intervals:. As shown in [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: BiTA comparison with baselines-ROC Curve. (a) TPR vs. FPR (b) Recall.Precision. The AUC of new nodes [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: BiTA comparison with baselines-ROC Curve. (a) TPR vs. FPR (b) Precision-Recall. [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of Metrics for multiple time windows. [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of Metrics for baselines approaches. [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗

**Figure 11.** Figure 11: Comparison of Metrics for baselines and graph based approaches. [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗

**Figure 12.** Figure 12: Comparison of metrics for multiple approaches over time. [PITH_FULL_IMAGE:figures/full_fig_p029_12.png] view at source ↗

**Figure 13.** Figure 13: Metrics Comparision for NF-UNSW dataset based on multiple aggregators. [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗

**Figure 14.** Figure 14: Comparison of metrics for baseline aggregators and the proposed aggregator (Transductive setting). [PITH_FULL_IMAGE:figures/full_fig_p031_14.png] view at source ↗

**Figure 15.** Figure 15: Comparison of metrics for baseline aggregators and the proposed aggregator (Inductive setting). [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗

**Figure 18.** Figure 18: Comparison of metrics illustrating the influence of message sequence length. [PITH_FULL_IMAGE:figures/full_fig_p035_18.png] view at source ↗

**Figure 19.** Figure 19: Comparison of metrics based on contribution of the Transformer component. [PITH_FULL_IMAGE:figures/full_fig_p036_19.png] view at source ↗

**Figure 20.** Figure 20: Improvement percentage comparison of metrics based on contribution of the Transformer component. [PITH_FULL_IMAGE:figures/full_fig_p037_20.png] view at source ↗

**Figure 21.** Figure 21: Temporal Order Sensitivity Analysis. Quantitatively, ∆max < 10−2 and ∆¯ ≈ 5.43 × 10−3 , indicating negligible deviation between the two configurations. These results confirm that the proposed model strictly adheres to temporal causality constraints and does not exploit future information during prediction. This experiment primarily addresses RQ2 and RQ3. Training Order Invariance:. To evaluate the sensit… view at source ↗

**Figure 22.** Figure 22: Training Batch Order Sensitivity Analysis. [PITH_FULL_IMAGE:figures/full_fig_p041_22.png] view at source ↗

**Figure 23.** Figure 23: ROC curve comparison of BiTA with SOTA methods. (a) Performance of BiTA, TGN-Vanilla with mean aggregator [PITH_FULL_IMAGE:figures/full_fig_p043_23.png] view at source ↗

**Figure 24.** Figure 24: ROC curve comparison of BiTA with SOTA baselines. [PITH_FULL_IMAGE:figures/full_fig_p043_24.png] view at source ↗

**Figure 25.** Figure 25: Inference time comparison of SOTA. This balanced performance is crucial for practical applications where both false positives and false negatives carry significant costs. The ablation analysis reveals that the Transformer component contributes substantially to the overall performance gains, validating the architectural design choices and confirming that the attention mechanism effectively captures tempor… view at source ↗

**Figure 26.** Figure 26: Comprehensive performance comparison of BiTA against SOTA baselines across multiple evaluation metrics. [PITH_FULL_IMAGE:figures/full_fig_p044_26.png] view at source ↗

**Figure 27.** Figure 27: Batch Size Scalability: Impact of batch size on latency and throughput. [PITH_FULL_IMAGE:figures/full_fig_p045_27.png] view at source ↗

**Figure 28.** Figure 28: Graph Size Scalability: Scalability with increasing graph size. [PITH_FULL_IMAGE:figures/full_fig_p046_28.png] view at source ↗

**Figure 29.** Figure 29: Latency Distribution: Demonstrating real-time inference capability. [PITH_FULL_IMAGE:figures/full_fig_p046_29.png] view at source ↗

read the original abstract

Proactive alert prediction in computer networks is critical for mitigating evolving cyber threats and enabling timely defensive actions. Temporal Graph Neural Networks (TGNs) provide a principled framework for modeling time-evolving interactions; however, existing TGN-based methods predominantly rely on unidirectional or single-mechanism temporal aggregation, which limits their ability to capture recursive, multi-scale temporal patterns commonly observed in real-world attack behaviors. In this paper, we propose BiTA, a Bidirectional Gated Recurrent Unit-Transformer Aggregator for temporal graph learning. Rather than introducing a deeper or higher-capacity model, BiTA redesigns the temporal aggregation function within the TGN framework by jointly encoding bidirectional sequential dependencies and long-range contextual relations over each node's temporal neighborhood. This aggregation strategy enables complementary temporal reasoning at different scales while preserving the original TGN memory and message-passing structure. We evaluate BiTA on real-world alert datasets, demonstrating significant improvements in key performance metrics such as area under the curve, average precision, mean reciprocal rank, and per-category prediction accuracy when compared to state-of-the-art temporal graph models. BiTA outperforms baseline methods under both transductive and inductive settings, highlighting its robustness and generalization capabilities in dynamic network environments. BiTA is a scalable and interpretable framework for real-time cyber threat anticipation, paving the way toward more intelligent and adaptive intrusion detection systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BiTA redesigns the TGN aggregator with bidirectional GRU plus Transformer for network alert prediction, but the causality handling in the reverse pass is the key thing to verify.

read the letter

The paper's core move is replacing the standard temporal aggregator in TGNs with BiTA, which runs a bidirectional GRU over each node's neighborhood sequence and feeds that into a Transformer for longer-range context. It keeps the rest of the TGN memory and message-passing unchanged, which keeps the change focused rather than a full architecture overhaul. That is the actual novelty: a specific joint encoding of forward/backward sequential patterns and global relations inside the aggregator, aimed at the recursive timing in attack sequences. The claim is that this gives better AUC, AP, MRR, and per-category accuracy on real alert data, in both transductive and inductive regimes. If the experiments hold up with proper controls, it is a practical incremental step for intrusion detection pipelines that already use TGN-style models. The stress-test note on future leakage is worth taking seriously. Bidirectional GRUs normally see the whole sequence in reverse, so unless the reverse direction is strictly masked to exclude any event after the current timestamp t, the memory update at t can pick up information that would not be available in a live prediction setting. The abstract says the original TGN structure is preserved, but it does not spell out the masking or show an ablation that isolates the bidirectional component under causal constraints. If the full paper has the masking code or a clear time-aware implementation, the concern is minor; if not, the reported gains could partly reflect non-causal flow. The experimental section will decide this. Datasets, exact baselines, and error bars are not in the abstract, so the strength of the performance claim is still open. For someone working on temporal graphs for security or time-series anomaly detection, the paper is worth reading to see the exact aggregator equations and the masking details. It is not a foundational result, but it is a concrete design choice that could be tested or extended. I would send it to peer review so the causality and reproducibility questions get proper scrutiny rather than desk-rejecting it outright.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes BiTA, a Bidirectional Gated Recurrent Unit-Transformer Aggregator integrated into the Temporal Graph Network (TGN) framework for proactive alert prediction in computer networks. It redesigns the temporal aggregation function to jointly encode bidirectional sequential dependencies and long-range contextual relations over each node's temporal neighborhood, enabling complementary multi-scale temporal reasoning while preserving the original TGN memory and message-passing structure. The approach is evaluated on real-world alert datasets and claims significant improvements in AUC, average precision, mean reciprocal rank, and per-category accuracy over state-of-the-art temporal graph models under both transductive and inductive settings.

Significance. If the performance gains are genuine and the aggregator maintains strict temporal causality, BiTA offers a practical architectural refinement for modeling recursive, multi-scale temporal patterns in dynamic network graphs. This could strengthen proactive cyber threat anticipation in intrusion detection systems by improving upon unidirectional or single-mechanism aggregators without requiring changes to core TGN components, potentially aiding adoption in real-time security applications.

major comments (1)

BiTA Aggregator (description of the joint bidirectional GRU-Transformer encoding): The bidirectional GRU risks violating temporal causality, as the reverse pass can incorporate events with timestamps > t unless explicitly masked. The abstract states that BiTA 'preserves the original TGN memory and message-passing structure,' but provides no details on time-aware masking or causal enforcement in the aggregator. This is load-bearing for the central claim, because any reported gains in AUC/AP/MRR could stem from non-causal leakage rather than improved capture of attack dynamics. Please supply the exact forward/reverse pass equations, masking implementation, or pseudocode to confirm that no future information influences the state at time t.

minor comments (2)

Abstract: The claim of 'significant improvements' in key metrics is stated without any numerical values, dataset sizes, or baseline comparisons; adding the top-line results (e.g., AUC deltas) would make the summary self-contained.
Evaluation: Confirm that all reported metric improvements include error bars or statistical tests across multiple runs to substantiate outperformance claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The concern about temporal causality in the BiTA aggregator is important, and we address it directly below with a commitment to strengthen the manuscript.

read point-by-point responses

Referee: BiTA Aggregator (description of the joint bidirectional GRU-Transformer encoding): The bidirectional GRU risks violating temporal causality, as the reverse pass can incorporate events with timestamps > t unless explicitly masked. The abstract states that BiTA 'preserves the original TGN memory and message-passing structure,' but provides no details on time-aware masking or causal enforcement in the aggregator. This is load-bearing for the central claim, because any reported gains in AUC/AP/MRR could stem from non-causal leakage rather than improved capture of attack dynamics. Please supply the exact forward/reverse pass equations, masking implementation, or pseudocode to confirm that no future information influences the state at time t.

Authors: We agree that explicit causal enforcement must be demonstrated. In BiTA, the temporal neighborhood for each node at time t consists solely of events with timestamps ≤ t. The bidirectional GRU processes this sequence as follows: the forward GRU pass iterates from the oldest to the newest event before t; the reverse GRU pass iterates from the newest event before t backwards to the oldest, with no access to any future events. The Transformer self-attention is likewise restricted to this same causal sequence via a lower-triangular mask. We will add the exact forward/reverse GRU update equations, the masking procedure, and pseudocode in the revised Section 3.2 to make this explicit and rule out leakage. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural redesign with independent evaluation

full rationale

The paper proposes BiTA as a redesign of the temporal aggregation function inside the existing TGN framework, combining bidirectional GRU and Transformer components to capture multi-scale patterns while preserving the original memory and message-passing structure. No equations, derivations, or parameter-fitting steps are described that reduce the claimed performance gains (AUC, AP, MRR) to fitted inputs or self-referential quantities by construction. The contribution is evaluated empirically on real-world alert datasets against external baselines under transductive and inductive settings, with no load-bearing self-citations or uniqueness theorems invoked to justify the architecture. The derivation chain is therefore self-contained and does not collapse to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The central addition is the architectural choice of bidirectional GRU-Transformer aggregation.

pith-pipeline@v0.9.0 · 5551 in / 1142 out tokens · 41612 ms · 2026-05-13T19:57:34.956033+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArrowOfTime.lean arrow_from_z unclear
BiTA redesigns the temporal aggregation function within the TGN framework by jointly encoding bidirectional sequential dependencies and long-range contextual relations over each node's temporal neighborhood... preserves the original TGN memory and message-passing structure.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear
Strict causality is a fundamental requirement... BiTA modifies only the message aggregation stage and does not alter the memory update protocol of TGN... input to the BiTA aggregator consists exclusively of historical messages with timestamps ti ≤ t.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 1 internal anchor

[1]

S., Bartos, V., & Lee, B

Ansari, M. S., Bartos, V., & Lee, B. (2020). Shallow and deep learning approaches for network intrusion alert prediction.Procedia Computer Science,171, 644–653

work page 2020
[2]

S., Bartoˇ s, V., & Lee, B

Ansari, M. S., Bartoˇ s, V., & Lee, B. (2022). GRU-based deep learning approach for network intrusion alert prediction.Future Generation Computer Systems,128, 235–247. URL:https://www.researchgate. net/publication/355133237

work page arXiv 2022
[3]

An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

Bai, S., Kolter, J. Z., & Koltun, V. (2018). An empirical evaluation of generic convolutional and recurrent networks for sequence modeling.arXiv preprint arXiv:1803.01271,

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Chen, C., Geng, H., Yang, N., Yang, X., & Yan, J. (2024). Easydgl: Encode, train and interpret for continuous-time dynamic graph learning.IEEE Transactions on Pattern Analysis and Machine Intelli- gence,

work page 2024
[5]

Fu, C., Pei, W., Cao, Q., Zhang, C., Zhao, Y., Shen, X., & Tai, Y.-W. (2019). Non-local recurrent neural memory for supervised sequence modeling. InProceedings of the IEEE/CVF International Conference on Computer Vision(pp. 6311–6320)

work page 2019
[6]

X., Han, Z., Fu, F., Zhang, W., & Jiang, J

Huang, Q., Yan, X., Wang, X., Rao, S. X., Han, Z., Fu, F., Zhang, W., & Jiang, J. (2024). Retrofitting temporal graph neural networks with transformer.arXiv preprint arXiv:2409.05477, . Hus´ ak, M., Bajtoˇ s, T., Kaˇ spar, J., Bou-Harb, E., &ˇCeleda, P. (2020). Predictive cyber situational awareness and personalized blacklisting: a sequential rule mining ...

work page arXiv 2024
[7]

Kacha, P., Kostenec, M., & Kropacova, A. (2015). Warden 3: Security event exchange redesign. InProceed- ings of the 19th International Conference on Computers: Recent Advances in Computer Science

work page 2015
[8]

Kearney, P., Abdelsamea, M., Schmoor, X., Shah, F., & Vickers, I. (2023). Combating alert fatigue in the security operations centre.Available at SSRN 4633965, . 49

work page 2023
[9]

M., & Rezvani, M

Nayeri, Z. M., & Rezvani, M. (2024). Alert prediction in computer networks using deep graph learning. In 2024 10th International Conference on Signal Processing and Intelligent Systems (ICSPIS)(pp. 1–5). IEEE

work page 2024
[10]

M., & Rezvani, M

Nayeri, Z. M., & Rezvani, M. (2026). Alert prediction in computer networks using transformer-based temporal graph neural networks: Identifying the next victim.Journal of Network and Computer Applications, (p. 104455)

work page 2026
[11]

Oguntoyinbo, M. (2025). Mitigating the risk as soc alert analyst and incident responder,

work page 2025
[12]

Peng, J., Wei, Z., & Ye, Y. (2025). Tidformer: Exploiting temporal and interactive dynamics makes a great dynamic graph transformer. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2(pp. 2245–2256)

work page 2025
[13]

Poursafaei, F., Huang, S., Pelrine, K., & Rabbany, R. (2022). Towards better evaluation for dynamic link prediction.Advances in Neural Information Processing Systems,35, 32928–32941

work page 2022
[14]

Riebe, T., Wirth, T., Bayer, M., K¨ uhn, P., Kaufhold, M.-A., Knauthe, V., Guthe, S., & Reuter, C. (2021). Cysecalert: An alert generation system for cyber security events using open source intelligence data. In International Conference on Information and Communications Security(pp. 429–446). Springer

work page 2021
[15]

Rossi, E., Chamberlain, B., Frasca, F., Eynard, D., Monti, F., & Bronstein, M. (2020). Temporal graph networks for deep learning on dynamic graphs.arXiv preprint arXiv:2006.10637,

work page arXiv 2020
[16]

Siyan, A., & Sans, M. (2024). Machine learning in cyber security: Enhancing soc operations with predictive analytics,

work page 2024
[17]

Trivedi, R., Farajtabar, M., Biswal, P., & Zha, H. (2019). Dyrep: Learning representations over dynamic graphs. InInternational conference on learning representations

work page 2019
[18]

J., Bora, A., Xu, M., & Karniadakis, G

Varghese, A. J., Bora, A., Xu, M., & Karniadakis, G. E. (2024). Transformerg2g: Adaptive time-stepping for learning temporal graph embeddings using transformers.Neural Networks,172, 106086

work page 2024
[19]

Wu, Z., Pan, S., Long, G., Jiang, J., & Zhang, C. (2019). Graph wavenet for deep spatial-temporal graph modeling.arXiv preprint arXiv:1906.00121,

work page arXiv 2019
[20]

Xu, D., Ruan, C., Korpeoglu, E., Kumar, S., & Achan, K. (2020). Inductive representation learning on temporal graphs.arXiv preprint arXiv:2002.07962, . 50 Zahra Makki Nayeriis a Ph.D. candidate at Shahrood University of Technology and currently a visiting researcher at the University of Stuttgart, Germany, where she conducts research on knowledge graph fo...

work page arXiv 2020