pith. machine review for the scientific record. sign in

arxiv: 2605.02075 · v1 · submitted 2026-05-03 · 💻 cs.NI

Recognition: 3 theorem links

· Lean Theorem

Graph Transformers and Stabilized Reinforcement Learning for Large-Scale Dynamic Routing Modulation and Spectrum Allocation in Elastic Optical Networks

Alejandra Beghelli, Laura Toni, Michael Doherty

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:42 UTC · model grok-4.3

classification 💻 cs.NI
keywords reinforcement learninggraph transformersdynamic RMSAelastic optical networksrouting and spectrum allocationnetwork optimizationlarge-scale simulation
0
0 comments X

The pith

Stabilized reinforcement learning enables the first graph transformer to exceed benchmarks on large-scale dynamic routing and spectrum allocation in optical networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that recent machine learning techniques can overcome the training instabilities that previously prevented transformers from being used with reinforcement learning for dynamic RMSA. By pairing rotary positional encodings suited to graph data, off-policy invalid action masking, and valid mass regularization with fast GPU simulation, the authors achieve stable policy training on networks as large as 143 nodes. The resulting model supports up to 13 percent more traffic load than prior RL and heuristic methods while maintaining low blocking rates, and it scales to real topologies from the TopologyBench database. This demonstration includes ablation experiments that isolate the contribution of each stabilization component and an analysis of the allocation patterns the trained policy learns.

Core claim

The central discovery is that a graph transformer policy, trained end-to-end with stabilized off-policy reinforcement learning inside a GPU-accelerated simulator, produces the first RL agent for dynamic RMSA that outperforms all published benchmarks, increasing the supportable traffic load by as much as 13 percent and by 4 percent at blocking probabilities below 0.1 percent on networks up to 143 nodes and 362 links.

What carries the argument

A graph transformer policy whose training is stabilized by rotary positional encodings on graph-structured observations, off-policy invalid action masking, and valid mass regularization, executed inside a GPU-accelerated network simulator.

Load-bearing premise

The GPU simulator accurately reproduces the dynamics, physical impairments, and traffic patterns of real elastic optical networks so that policies learned in simulation transfer to unseen traffic and topologies.

What would settle it

Run the trained policy on a live or emulated optical network using traffic traces recorded from an operational network different from those used in training and measure whether the reported capacity gains and blocking rates are reproduced.

Figures

Figures reproduced from arXiv: 2605.02075 by Alejandra Beghelli, Laura Toni, Michael Doherty.

Figure 1
Figure 1. Figure 1: Overview of the XLRON training architecture and algorithm. Parallel environments (green) on GPU generate experience from actions selected by the agent (blue), which comprises a Graph Transformer agent (light blue) trained with stabilized PPO. Key components of the learning algorithm (purple) include off-policy invalid action masking, valid mass stabilization, and WiRE posi￾tional encodings (red). L(θ) = Eˆ… view at source ↗
Figure 2
Figure 2. Figure 2: Wavelet-Induced Rotary Encoding (WiRE) for injecting graph positional information into the transformer. The net￾work topology is converted to a line graph, spectral features are extracted from the Laplacian eigenvectors, and rotary po￾sition encodings are applied to the query and key vectors in each attention head. the bottom left view at source ↗
Figure 3
Figure 3. Figure 3: Actor-critic Graph Transformer architecture. The ac￾tor uses path pooling (concatenation of min, mean, max over path tokens) to rank candidate lightpaths, while the critic uses learned attention pooling over all tokens to estimate state value. H. Pooling and Readout Converting the transformed output tokens of the transformer into an action or value estimate requires an aggregation or pool￾ing step followed… view at source ↗
Figure 4
Figure 4. Figure 4: Service blocking probability as a function of traffic load for our method (Transformer RL) compared to previous RL meth￾ods (Deep/Reward/GCN-RMSA, MaskRSA, PtrNet-RSA), the best heuristic in each, and upper bound network capacity estimates (defragmentation bound and cut-set bound) across NSFNET, COST239, USNET, and JPN48 topologies. Shaded regions indicate the standard error of the mean across parallel env… view at source ↗
Figure 5
Figure 5. Figure 5: The (a) TataInd and (b) USA100 network topologies used in the large-scale experiments. Node labels indicate node IDs and edge labels indicate link lengths in km. A. Heuristic Benchmarks For the TataInd and USA100 topologies, we systematically de￾termine the strongest heuristic benchmark. Previous work [2] showed that FF-KSP and KSP-FF heuristics provided the low￾est blocking probability compared to five ot… view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study of key training components on TataInd and USA100 topologies. Each curve removes one component from the full model (All Features). The FF-KSP heuristic benchmark is shown for reference. Shaded regions indicate the upper and lower interquartile range across parallel environments. Removing the valid mass loss entirely (“No VML”) has dif￾fering effects in each case: on USA100 it causes a late tr… view at source ↗
Figure 7
Figure 7. Figure 7: Decomposition of the scalar magnitude of the total training loss into its constituent components (actor, valid mass, value, and entropy losses) over the course of training for TataInd and USA100 view at source ↗
Figure 8
Figure 8. Figure 8: Service blocking probability as a function of traffic load for the Transformer agent and FF-KSP heuristic on TataInd and USA100. Shaded regions indicate the standard error of the mean across parallel environments. sions without systematically preferring shorter or longer paths. On both topologies, the delta does not change appreciably as the network fills, suggesting a learned structural preference rather … view at source ↗
Figure 10
Figure 10. Figure 10: Mean path length in km (top) and hops (bottom) over the course of a single evaluation episode for the Transformer agent and FF-KSP heuristic on TataInd and USA100 view at source ↗
Figure 11
Figure 11. Figure 11: Difference in assigned path length (Transformer mi￾nus FF-KSP) in km (top) and hops (bottom) per traffic request for TataInd and USA100. Shaded regions indicate where one method selects shorter paths. natural to extend it to a multi-objective setting [44] or joint opti￾mization of parameters such as launch power, where we expect its advantages to be even more pronounced. Having established that our method… view at source ↗
Figure 14
Figure 14. Figure 14: Difference in link usage between Transformer and FF-KSP for each link in TataInd and USA100. Positive values (green) indicate links that have more requests that use them allocated by the Transformer; negative values (purple) indicate links used less. “Rotary Position Encodings for Graphs,” (2026). ArXiv:2509.22259 [cs]. 10. R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, … view at source ↗
Figure 13
Figure 13. Figure 13: Difference in frequency slot unit (FSU) occupancy be￾tween FF-KSP and Transformer across all links for TataInd and USA100. Green indicates higher occupancy by FF-KSP; purple indicates higher occupancy by the Transformer. (EP/R035342/1) REFERENCES 1. B. Jaumard, A. Mohammed, and Q. A. Nguyen, “Decomposition Mod￾els for the Routing and Slot Provisioning Problem,” in 2023 Interna￾tional Conference on Computi… view at source ↗
read the original abstract

Reinforcement learning (RL) has been widely applied to dynamic routing, modulation and spectrum assignment (RMSA) in optical networks, yet no prior work has trained a transformer model for this task. We attribute this to the high data and compute requirements of transformers and potential training instabilities with RL. We address this gap by combining recent advances from the machine learning literature (rotary positional encodings for graph-structured data, off-policy invalid action masking, and valid mass regularization) with GPU-accelerated simulation to achieve, for the first time, stable RL training of a transformer for dynamic RMSA. We demonstrate, through systematic benchmarking against previous RL methods and heuristic algorithms, that ours is the first RL method to exceed all benchmarks, increasing the supportable traffic load by up to 13\%. To demonstrate the scalability of our approach, we train on real network topologies from the TopologyBench database up to 143 nodes and 362 links, with 320 x 12.5\,GHz frequency slot units per link, and 100\,Gbps traffic requests. To our knowledge, these are the largest dynamic RMSA problems to which RL has been applied. We find up to 4\% increased traffic load can be supported at low blocking probability (<0.1\%) with our method compared to the best available benchmark algorithm. We present an ablation study of the components of our training algorithm, the dynamics of the loss function during training, and analyze the allocation decisions of the trained models. We make all code used to produce this paper openly available for reproduction and future benchmarking: https://github.com/micdoh/XLRON.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a graph transformer architecture trained via stabilized reinforcement learning (using rotary positional encodings, off-policy invalid action masking, and valid mass regularization) for dynamic routing, modulation, and spectrum allocation (RMSA) in elastic optical networks. It claims to be the first RL method to successfully train such a transformer, achieving stable training at scale on real topologies up to 143 nodes and 362 links with 320 frequency slots per link. Systematic benchmarking against prior RL methods and heuristics shows up to 13% higher supportable traffic load (and 4% at blocking probability <0.1%), supported by ablation studies, loss dynamics, decision analysis, and fully open code.

Significance. If the results hold, the work is significant for demonstrating that graph transformers can be stably trained with RL for large-scale dynamic network optimization problems previously limited by data/compute requirements and instability. The open code, ablation study, and analysis of allocation decisions provide concrete strengths for reproducibility and extension. The scalability to 143-node instances from TopologyBench is a notable advance over prior RL applications in this domain.

major comments (2)
  1. [Abstract and Evaluation sections] The headline performance claims (up to 13% increase in supportable traffic load) rest entirely on results from the authors' custom GPU-accelerated simulator. No hardware-in-the-loop measurements, validation against independent EON emulators, or external traffic traces are reported, which is load-bearing because any simulator-specific artifacts in modeling nonlinearities, ASE, crosstalk, or 12.5 GHz slot constraints could inflate the reported advantage over benchmarks.
  2. [Evaluation and Generalization tests] While held-out traffic and TopologyBench topologies are used for testing, the training distribution (uniform requests, fixed load ranges) may still overlap substantially with test conditions, leaving open whether the transformer policy learns robust principles or exploits simulator artifacts. This directly affects the generalization claim for unseen traffic and topologies.
minor comments (2)
  1. [Abstract] The abstract and introduction could more explicitly quantify the hyper-parameter sensitivity of the RL training (learning rate, entropy coefficient, etc.) and how the stabilization techniques mitigate it, to better contextualize the training stability contribution.
  2. [Results figures and tables] Figure captions and table legends should clarify whether reported blocking probabilities and load values are averaged over multiple random seeds or single runs, given the known variance in RL policies.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their detailed and constructive review of our manuscript. We address each major comment point-by-point below, providing honest responses based on the scope and limitations of the work. Where appropriate, we indicate revisions that will be incorporated into the next version of the manuscript.

read point-by-point responses
  1. Referee: [Abstract and Evaluation sections] The headline performance claims (up to 13% increase in supportable traffic load) rest entirely on results from the authors' custom GPU-accelerated simulator. No hardware-in-the-loop measurements, validation against independent EON emulators, or external traffic traces are reported, which is load-bearing because any simulator-specific artifacts in modeling nonlinearities, ASE, crosstalk, or 12.5 GHz slot constraints could inflate the reported advantage over benchmarks.

    Authors: We acknowledge that all reported performance results, including the up to 13% improvement in supportable traffic load, are obtained exclusively from our custom GPU-accelerated simulator. This simulator implements standard, literature-established models for key optical impairments such as nonlinear effects, ASE noise, crosstalk, and the 12.5 GHz frequency slot constraints. We agree that the lack of hardware-in-the-loop validation or cross-checks against independent EON emulators constitutes a genuine limitation, as we do not have access to physical testbeds. To address the concern directly, we have open-sourced the full simulator code (as stated in the manuscript) to enable community verification, and we will add an explicit discussion subsection in the revised Evaluation section outlining the simulator's modeling assumptions and potential artifacts. This represents the strongest honest response within the paper's focus on algorithmic and training innovations rather than hardware experimentation. revision: partial

  2. Referee: [Evaluation and Generalization tests] While held-out traffic and TopologyBench topologies are used for testing, the training distribution (uniform requests, fixed load ranges) may still overlap substantially with test conditions, leaving open whether the transformer policy learns robust principles or exploits simulator artifacts. This directly affects the generalization claim for unseen traffic and topologies.

    Authors: We thank the referee for raising this important point on generalization. Our evaluation protocol uses explicitly held-out traffic request sequences and network topologies drawn from TopologyBench that are never seen during training. Training employs uniform request generation across load ranges to encourage learning of general policies, while testing measures performance on these unseen instances. Nevertheless, we recognize that the underlying request arrival model remains the same, which could in principle allow the policy to exploit simulator-specific patterns rather than purely robust principles. To strengthen the generalization claims, we will incorporate additional experiments in the revised manuscript that employ non-uniform traffic distributions and further varied held-out topologies; preliminary internal checks indicate these yield consistent advantages, supporting that the learned policy captures allocation principles beyond simulator artifacts. revision: yes

standing simulated objections not resolved
  • Hardware-in-the-loop measurements, validation against independent EON emulators, or use of external traffic traces, as we lack access to physical elastic optical network testbeds and such resources.

Circularity Check

0 steps flagged

No circularity: empirical RL performance measured on held-out simulator instances and public topologies.

full rationale

The paper reports measured blocking probabilities and supportable traffic loads obtained by training a graph-transformer policy with stabilized RL inside a GPU simulator and evaluating on held-out traffic and TopologyBench graphs. No derivation chain reduces a claimed performance gain to a fitted constant, self-defined quantity, or load-bearing self-citation; the central result is an empirical comparison against baselines, with code released for reproduction. Standard ML generalization risks exist but do not constitute the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on empirical training rather than new theoretical derivations. Free parameters are standard RL and model hyperparameters; axioms are conventional modeling assumptions in networking and RL.

free parameters (2)
  • RL hyperparameters (learning rate, discount factor, entropy coefficient)
    Chosen via experimentation to achieve stable convergence on the RMSA task.
  • Transformer architecture parameters (number of layers, attention heads, embedding dimension)
    Selected to balance expressivity and training stability for graph-structured network states.
axioms (2)
  • domain assumption The optical network state can be represented as a graph with nodes, links, and frequency slots for sequential decision making.
    Standard modeling choice in RMSA literature.
  • domain assumption A Markov decision process formulation captures the dynamics of traffic arrivals and blocking.
    Core assumption enabling RL application.

pith-pipeline@v0.9.0 · 5602 in / 1396 out tokens · 77090 ms · 2026-05-08T18:42:23.601462+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

44 extracted references · 19 canonical work pages · 4 internal anchors

  1. [1]

    Decomposition Mod- els for the Routing and Slot Provisioning Problem,

    B. Jaumard, A. Mohammed, and Q. A. Nguyen, “Decomposition Mod- els for the Routing and Slot Provisioning Problem, ” in 2023 Interna- tional Conference on Computing, Networking and Communications (ICNC), (2023), pp. 659–665

  2. [2]

    Rein- forcement learning for dynamic resource allocation in optical networks: hype or hope?

    M. Doherty , R. Matzner, R. Sadeghi, P . Bayvel, and A. Beghelli, “Rein- forcement learning for dynamic resource allocation in optical networks: hype or hope?” J. Opt. Commun. Netw. 17, D1 (2025)

  3. [3]

    Attention Is All Y ou Need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention Is All Y ou Need, ” (2017). Version Number: 7

  4. [4]

    T ransformer-pointer DRL model for static resource allocation problems in SDM-EONs,

    S. Chen, J. Wang, and M. Shigeno, “T ransformer-pointer DRL model for static resource allocation problems in SDM-EONs, ” J. Opt. Com- mun. Netw. 18, 315 (2026)

  5. [5]

    Francis Song, Jack W

    E. Parisotto, H. F . Song, J. W. Rae, R. Pascanu, C. Gulcehre, S. M. Jayakumar, M. Jaderberg, R. L. Kaufman, A. Clark, S. Noury , M. M. Botvinick, N. Heess, and R. Hadsell, “Stabilizing T ransformers for Re- inforcement Learning, ” (2019). ArXiv:1910.06764 [cs]

  6. [6]

    XLRON: Accelerated Learning and Resource Allocation for Optical Networks,

    M. Doherty , “XLRON: Accelerated Learning and Resource Allocation for Optical Networks, ”https://github.com/micdoh/XLRON.git (2023)

  7. [7]

    XLRON: Accelerated Reinforcement Learning Environments for Optical Networks,

    M. Doherty and A. Beghelli, “XLRON: Accelerated Reinforcement Learning Environments for Optical Networks, ” in 2024 Optical Fiber Communications Conference and Exhibition (OFC), (2024), pp. 1–3

  8. [8]

    Podracer architectures for scalable Reinforce- ment Learning,

    M. Hessel, M. Kroiss, A. Clark, I. Kemaev, J. Quan, T . Keck, F . Viola, and H. van Hasselt, “Podracer architectures for scalable Reinforce- ment Learning, ” (2021). ArXiv:2104.06272 [cs]

  9. [9]

    Rotary Position Encodings for Graphs,

    I. Reid, A. Sehanobish, C. HÃ ˝ ufs, B. Mlodozeniec, L. Vulpius, F . Bar- bero, A. Weller, K. Choromanski, R. E. T urner, and P . VeliÄ koviÄ ˘G, Fig. 14. Difference in link usage between Transformer and FF-KSP for each link in TataInd and USA100. Positive values (green) indicate links that have more requests that use them allocated by the Transformer; ne...

  10. [10]

    On layer normalization in the transformer architecture, 2020

    R. Xiong, Y . Y ang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y . Lan, L. Wang, and T .-Y . Liu, “On Layer Normalization in the T rans- former Architecture, ” (2020). ArXiv:2002.04745 [cs]

  11. [11]

    DeepRMSA: A Deep Reinforcement Learning Framework for Routing, Modulation and Spectrum Assignment in Elastic Optical Networks,

    X. Chen, B. Li, R. Proietti, H. Lu, Z. Zhu, and S. J. B. Y oo, “DeepRMSA: A Deep Reinforcement Learning Framework for Routing, Modulation and Spectrum Assignment in Elastic Optical Networks, ” J. Light. T ech- nol. 37, 4155–4163 (2019)

  12. [12]

    Heuristic Reward De- sign for Deep Reinforcement Learning-Based Routing, Modulation and Spectrum Assignment of Elastic Optical Networks,

    B. T ang, Y .-C. Huang, Y . Xue, and W. Zhou, “Heuristic Reward De- sign for Deep Reinforcement Learning-Based Routing, Modulation and Spectrum Assignment of Elastic Optical Networks, ” IEEE Com- mun. Lett. 26, 2675–2679 (2022)

  13. [13]

    Deep Reinforcement Learning- Based Routing and Spectrum Assignment of EONs by Exploiting GCN and RNN for Feature Extraction,

    L. Xu, Y .-C. Huang, Y . Xue, and X. Hu, “Deep Reinforcement Learning- Based Routing and Spectrum Assignment of EONs by Exploiting GCN and RNN for Feature Extraction, ” J. Light. T echnol. 40, 4945–4955 (2022)

  14. [14]

    Mask RSA: End-T o-End Reinforcement Learning-based Routing and Spectrum Assignment in Elastic Optical Networks,

    M. Shimoda and T . T anaka, “Mask RSA: End-T o-End Reinforcement Learning-based Routing and Spectrum Assignment in Elastic Optical Networks, ” in2021 European Conference on Optical Communication (ECOC), (IEEE, Bordeaux, France, 2021), pp. 1–4

  15. [15]

    PtrNet-RSA: A Pointer Network-based QoT-aware Routing and Spectrum Assign- ment Scheme in Elastic Optical Networks,

    Y . Cheng, S. Ding, Y . Shao, and C.-K. Chan, “PtrNet-RSA: A Pointer Network-based QoT-aware Routing and Spectrum Assign- ment Scheme in Elastic Optical Networks, ” J. Light. T echnol. pp. 1–12 (2024)

  16. [16]

    T opology Bench: Systematic Graph Based Benchmarking for Core Optical Networks,

    R. Matzner, A. Ahuja, R. Sadeghi, M. Doherty , A. Beghelli, S. J. Savory , and P . Bayvel, “T opology Bench: Systematic Graph Based Benchmarking for Core Optical Networks, ” (2024). Version Number: 1

  17. [17]

    A multicast reinforcement learn- ing algorithm for WDM optical networks,

    P . Garcia, A. Zsigri, and A. Guitton, “A multicast reinforcement learn- ing algorithm for WDM optical networks, ” in Proceedings of the 7th In- ternational Conference on T elecommunications, 2003. ConTEL 2003., (IEEE, Zagreb, Croatia, 2003), pp. 419–426 vol.2

  18. [18]

    Action Guidance: Getting the Best of Sparse Rewards and Shaped Rewards for Real-time Strategy Games,

    S. Huang and S. Ontañón, “Action Guidance: Getting the Best of Sparse Rewards and Shaped Rewards for Real-time Strategy Games, ” (2020). ArXiv:2010.03956 [cs, stat]

  19. [19]

    Deep-reinforcement- learning-based RMSCA for space division multiplexing networks with multi-core fibers [Invited T utorial],

    Y . T eng, C. Natalino, H. Li, R. Y ang, J. Majeed, S. Shen, P . Monti, R. Nejabati, S. Y an, and D. Simeonidou, “Deep-reinforcement- learning-based RMSCA for space division multiplexing networks with multi-core fibers [Invited T utorial], ” J. Opt. Commun. Netw. 16, C76 (2024)

  20. [20]

    DRL-Assisted QoT-Aware Ser- vice Provisioning in Multi-Band Elastic Optical Networks,

    Y . T eng, C. Natalino, F . Arpanaei, H. Li, A. Sà ˛ anchez-Macià ˛ an, P . Monti, S. Y an, and D. Simeonidou, “DRL-Assisted QoT-Aware Ser- vice Provisioning in Multi-Band Elastic Optical Networks, ” J. Light. T echnol.43, 9090–9101 (2025)

  21. [21]

    Physical layer-aware deep reinforcement learning with advantage function stabilization for dynamic RMSA in elastic optical networks,

    H. Wang, Y . Wang, Y . Zhao, and J. Zhang, “Physical layer-aware deep reinforcement learning with advantage function stabilization for dynamic RMSA in elastic optical networks, ” J. Opt. Commun. Netw. 18, 250 (2026)

  22. [22]

    Proximal Policy Optimization Algorithms

    J. Schulman, F . Wolski, P . Dhariwal, A. Radford, and O. Klimov, “Prox- Research Article 13 imal Policy Optimization Algorithms, ” (2017). ArXiv:1707.06347 [cs]

  23. [23]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    J. Schulman, P . Moritz, S. Levine, M. Jordan, and P . Abbeel, “High- Dimensional Continuous Control Using Generalized Advantage Esti- mation, ” (2018). ArXiv:1506.02438 [cs]

  24. [24]

    Reward centering.arXiv preprint arXiv:2405.09999, 2024

    A. Naik, Y . Wan, M. T omar, and R. S. Sutton, “Reward Centering, ” (2024). ArXiv:2405.09999 [cs]

  25. [25]

    Overcoming Valid Action Suppression in Unmasked Pol- icy Gradient Algorithms,

    R. Zabounidis, R. Siegelmann, M. Qadri, W. Kim, S. Stepputtis, and K. P . Sycara, “Overcoming Valid Action Suppression in Unmasked Pol- icy Gradient Algorithms, ” (2026). ArXiv:2603.09090 [cs]

  26. [26]

    Exploring the Use of Invalid Action Masking in Reinforcement Learning: A Com- parative Study of On-Policy and Off-Policy Algorithms in Real-Time Strategy Games,

    Y . Hou, X. Liang, J. Zhang, Q. Y ang, A. Y ang, and N. Wang, “Exploring the Use of Invalid Action Masking in Reinforcement Learning: A Com- parative Study of On-Policy and Off-Policy Algorithms in Real-Time Strategy Games, ” Appl. Sci.13, 8283 (2023)

  27. [27]

    PonderNet: Learning to ponder.arXiv preprint arXiv:2106.01345,

    L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P . Abbeel, A. Srinivas, and I. Mordatch, “Decision T ransformer: Reinforcement Learning via Sequence Modeling, ” (2021). ArXiv:2106.01345 [cs]

  28. [28]

    Layer Normalization

    J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization, ” (2016). ArXiv:1607.06450 [cs, stat]

  29. [29]

    Graph Attention Networks

    P . VeliÄ koviÄ ˘G, G. Cucurull, A. Casanova, A. Romero, P . LiÚ, and Y . Bengio, “Graph Attention Networks, ” (2018). ArXiv:1710.10903 [stat]

  30. [30]

    How attentive are graph attention networks?arXiv preprint arXiv:2105.14491,

    S. Brody , U. Alon, and E. Y ahav, “How Attentive are Graph Attention Networks?” (2022). ArXiv:2105.14491 [cs]

  31. [31]

    Graph Attention Network En- hanced Deep Reinforcement Learning Framework for Routing, Mod- ulation, and Spectrum Allocation in EONs,

    Z. Xiong, Y .-C. Huang, and X. Hu, “Graph Attention Network En- hanced Deep Reinforcement Learning Framework for Routing, Mod- ulation, and Spectrum Allocation in EONs, ” in 2024 Asia Communica- tions and Photonics Conference (ACP) and International Conference on Information Photonics and Optical Communications (IPOC), (IEEE, Beijing, China, 2024), pp. 1–6

  32. [32]

    Do transformers really perform bad for graph representation?, 2021

    C. Ying, T . Cai, S. Luo, S. Zheng, G. Ke, D. He, Y . Shen, and T .-Y . Liu, “Do T ransformers Really Perform Bad for Graph Representation?” (2021). ArXiv:2106.05234 [cs]

  33. [33]

    Graph Inductive Biases in T ransformers without Message Passing,

    L. Ma, C. Lin, D. Lim, A. Romero-Soriano, P . K. Dokania, M. Coates, P . T orr, and S.-N. Lim, “Graph Inductive Biases in T ransformers without Message Passing, ” (2023). ArXiv:2305.17589 [cs]

  34. [34]

    Comparing Graph T ransformers via Positional Encodings,

    M. Black, Z. Wan, G. Mishne, A. Nayyeri, and Y . Wang, “Comparing Graph T ransformers via Positional Encodings, ” (2024). ArXiv:2402.14202 [cs]

  35. [35]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, and Y . Liu, “Ro- Former: Enhanced T ransformer with Rotary Position Embedding, ” (2023). ArXiv:2104.09864 [cs]

  36. [36]

    arXiv preprint arXiv:2502.11032 , year=

    S. Ennadir, M. Vazirgiannis, and R. Liao, “Pool me wisely: Rethinking graph pooling in graph transformers, ” (2025). ArXiv:2502.11032

  37. [37]

    Pointer Networks,

    O. Vinyals, M. Fortunato, and N. Jaitly , “Pointer Networks, ” (2015). Ver- sion Number: 2

  38. [38]

    Cost-effective network capac- ity upgrade by heterogeneous wavelength division multiplexing density with bandwidth-variable virtual direct links,

    K. Hayashi, Y . Mori, and H. Hasegawa, “Cost-effective network capac- ity upgrade by heterogeneous wavelength division multiplexing density with bandwidth-variable virtual direct links, ” J. Opt. Commun. Netw.15, D23–D32 (2023)

  39. [39]

    Effective Capacity Estimation Based on Cut-Set Load Analysis in Optical Path Networks,

    K. Cruzado, Y . Mori, S.-C. Lin, M. Matsuura, S. Subramaniam, and H. Hasegawa, “Effective Capacity Estimation Based on Cut-Set Load Analysis in Optical Path Networks, ” in 2023 International Conference on Photonics in Switching and Computing (PSC), (2023), pp. 1–3

  40. [40]

    Capacity-Bound Evaluation and Routing and Spec- trum Assignment for Elastic Optical Path Networks with Distance- Adaptive Modulation,

    K. Cruzado, Y . Mori, S.-C. Lin, M. Matsuura, S. Subramaniam, and H. Hasegawa, “Capacity-Bound Evaluation and Routing and Spec- trum Assignment for Elastic Optical Path Networks with Distance- Adaptive Modulation, ” in2024 Optical Fiber Communications Confer- ence and Exhibition (OFC), (2024), pp. 1–3

  41. [41]

    Routing and wavelength allocation in WDM optical net- works,

    S. Baroni, “Routing and wavelength allocation in WDM optical net- works, ” Ph.D. thesis, University College London, United Kingdom (1998)

  42. [42]

    Resource allocation and scalability in dynamic wavelength-routed optical networks,

    A. Beghelli, “Resource allocation and scalability in dynamic wavelength-routed optical networks, ” Ph.D. thesis, University of Lon- don (2006)

  43. [43]

    Staggered Environment Resets Im- prove Massively Parallel On-Policy Reinforcement Learning,

    S. Bharthulwar, S. T ao, and H. Su, “Staggered Environment Resets Im- prove Massively Parallel On-Policy Reinforcement Learning, ” (2025). ArXiv:2511.21011 [cs]

  44. [44]

    Interpreting multi-objective reinforcement learning for routing and wavelength assignment in optical networks,

    S. Nallaperuma, Z. Gan, J. Nevin, M. Shevchenko, and S. J. Sa- vory , “Interpreting multi-objective reinforcement learning for routing and wavelength assignment in optical networks, ” J. Opt. Commun. Netw. 15, 497 (2023)